This section contains quick start guides to help you get up and running with Pinot.
We want your experience getting started with Pinot to be both low effort and high reward. Here you'll find a collection of quick start guides that contain starter distributions of the Pinot platform.
This video will show you a step-by-step walk through for launching the individual components of Pinot and scaling them to multiple instances. This is an excellent resource for developers and operators that want to understand setting up each component and debugging a cluster.
We also have a step-by-step guide for manually setting up a Pinot cluster using Docker or shell scripts.
Manual cluster setupGetting data into Pinot is easy. Take a look at these two quick start guides which will help you get up and running with sample data for offline and real-time tables.
Batch import exampleStream ingestion exampleThis page has a collection of frequently asked questions with answers from the community.
We have toJsonStr(key)
function which can store a top level json field as a STRING in Pinot.
Then you can use jsonExtractScalar(JSON_STRING_FIELD, JSON_PATH, OUTPUT_FORMAT)
function during query time to fetch the desired field from the json string. For example
Select jsonExtractScalar(myJsonMapStr,'$.k1','STRING')
from myTable
where jsonExtractScalar(myJsonMapStr,'$.k1','STRING') = 'value-k1-0'"
Select sum(jsonExtractScalar(complexMapStr,'$.k4.met','INT'))
from myTable
group by jsonExtractScalar(complexMapStr,'$.k1','STRING')
NOTE This works well if some of your fields are nested json, but most of your fields are top level json keys. If all of your fields are within a nested JSON key, you will have to store the entire payload as 1 column, which is not ideal.
Support for flattening during ingestion is on the roadmap: https://github.com/apache/incubator-pinot/issues/5264
Inverted indexes are set in the tableConfig's tableIndexConfig -> invertedIndexColumns list. Here's the documentation for tableIndexConfig: https://docs.pinot.apache.org/basics/components/table#tableindexconfig-1 along with a sample table that has set inverted indexes on some columns.
Applying inverted indexes to a table config will generate inverted index to all new segments. In order to apply the inverted indexes to all existing segments, follow steps in How to apply inverted index to existing setup?
Add the columns you wish to index to the tableIndexConfig-> invertedIndexColumns list. This sample table config show inverted indexes set: https://docs.pinot.apache.org/basics/components/table#offline-table-config To update the table config use the Pinot Swagger API: http://localhost:9000/help#!/Table/updateTableConfig
Invoke the reload API: http://localhost:9000/help#!/Segment/reloadAllSegments
Right now, there’s no easy way to confirm that reload succeeded. One way it to check out the index_map file inside the segment metadata, you should see inverted index entries for the new columns. An API for this is coming soon: https://github.com/apache/incubator-pinot/issues/5390
Here's the page explaining the Pinot response format: https://docs.pinot.apache.org/users/api/querying-pinot-using-standard-sql/response-format
"timestamp" is a reserved keyword in SQL. Escape timestamp with double quotes.
select "timestamp" from myTable
Other commonly encountered reserved keywords are date, time, table.
For filtering on STRING columns, use single quotes
SELECT COUNT(*) from myTable WHERE column = 'foo'
The fields in the ORDER BY
clause must be one of the group by clauses or aggregations, BEFORE applying the alias. Therefore, this will not work
SELECT count(colA) as aliasA, colA from tableA GROUP BY colA ORDER BY aliasA
Instead, this will work
SELECT count(colA) as sumA, colA from tableA GROUP BY colA ORDER BY count(colA)
You can change the number of replicas by updating the table config's segmentsConfig section. Make sure you have at least as many servers as the replication.
For OFFLINE table, update replication
{
"tableName": "pinotTable",
"tableType": "OFFLINE",
"segmentsConfig": {
"replication": "3",
...
}
..
For REALTIME table update replicasPerPartition
{
"tableName": "pinotTable",
"tableType": "REALTIME",
"segmentsConfig": {
"replicasPerPartition": "3",
...
}
..
After changing the replication, run a table rebalance.
A rebalance is run to reassign all the segments of a table to the available servers. This is typically done when capacity changes are done i.e. adding more servers or removing servers from a table.
Offline
Use the rebalance API from the Swagger APIs on the controller http://localhost:9000/help#!/Table/rebalance, with tableType OFFLINE
Realtime
Use the rebalance API from the Swagger APIs on the controller http://localhost:9000/help#!/Table/rebalance, with tableType REALTIME.
A realtime table has 2 components, the consuming segments and the completed segments. By default, only the completed segments will get rebalanced. The consuming segments will pick the right assignment once they complete. But you can enforce the consuming segments to also be included in the rebalance, by setting the param includeConsuming
to true. Note that rebalancing the consuming segments would mean the consuming segment will drop the consumed data so far, and restart consumption from the last offset, which may lead to a short duration of data staleness.
You can check the status of the rebalance by
Checking the controller logs
Running rebalance again after a while, you should receive status "status": "NO_OP"
Checking the External View of the table, to see the changes in capacity/replicas have taken effect.
Yes, replica groups work for realtime. There's 2 parts to enabling replica groups:
Replica groups segment assignment
Replica group query routing
Replica group segment assignment
Replica group segment assignment is achieved in realtime, if number of servers is a multiple of number of replicas. The partitions get uniformly sprayed across the servers, creating replica groups.
For example, consider we have 6 partitions, 2 replicas, and 4 servers.
r1
r2
p1
S0
S1
p2
S2
S3
p3
S0
S1
p4
S2
S3
p5
S0
S1
p6
S2
S3
As you can see, the set (S0, S2) contains r1 of every partition, and (s1, S3) contains r2 of every partition. The query will only be routed to one of the sets, and not span every server. If you are are adding/removing servers from an existing table setup, you have to run rebalance for segment assignment changes to take effect.
Replica group query routing
Once replica group segment assignment is in effect, the query routing can take advantage of it. For replica group based query routing, set the following in the table config's routing section, and then restart brokers
{
"tableName": "pinotTable",
"tableType": "REALTIME",
"routing": {
"instanceSelectorType": "replicaGroup"
}
..
}
This quick start guide will help you bootstrap a Pinot standalone instance on your local machine.
In this guide you'll learn how to download and install Apache Pinot as a standalone instance.
This is a quickstart guide that will show you how to quickly start an example recipe in a standalone instance and is meant for learning. To run Pinot in cluster mode, please take a look at Manual cluster setup.
First, let's download the Pinot distribution for this tutorial. You can either build the distribution from source or download a packaged release.
# define the pinot version
PINOT_VERSION=0.3.0
Follow these steps to checkout code from Github and build Pinot locally
# checkout pinot
git clone https://github.com/apache/incubator-pinot.git
cd incubator-pinot
# build pinot
mvn install package -DskipTests -Pbin-dist
# navigate to directory containing the setup scripts
cd pinot-distribution/target/apache-pinot-incubating-$PINOT_VERSION-bin/apache-pinot-incubating-$PINOT_VERSION-bin
Download the latest binary release from Apache Pinot, or use this command
wget https://downloads.apache.org/incubator/pinot/apache-pinot-incubating-$PINOT_VERSION/apache-pinot-incubating-$PINOT_VERSION-bin.tar.gz
Once you have the tar file,
# untar it
tar -zxvf apache-pinot-incubating-$PINOT_VERSION-bin.tar.gz
# navigate to directory containing the launcher scripts
cd apache-pinot-incubating-$PINOT_VERSION-bin
We'll be using a quick-start script, which does the following:
Sets up the Pinot cluster QuickStartCluster
Creates a sample table and loads sample data
There's 3 kinds of quick start
Batch quick start creates the pinot cluster, creates an offline table baseballStats
and pushes sample offline data to the table.
bin/quick-start-batch.sh
That's it! We've spun up a Pinot cluster. You can continue playing with other types of quick start, or simply head on to Pinot Data Explorer to check out the data in the baseballStats
table.
Streaming quick start sets up a Kafka cluster and pushes sample data to a Kafka topic. Then, it creates the Pinot cluster and creates a realtime table meetupRSVP
which ingests data from the Kafka topic.
# stop previous quick start cluster, if any
bin/quick-start-streaming.sh
We now have a Pinot cluster with a realtime table! You can head over to Pinot Data Explorer to check out the data in the meetupRSVP
table.
Hybrid quick start sets up a Kafka cluster and pushes sample data to a Kafka topic. Then, it creates the Pinot cluster and creates a hybrid table airlineStats
. The realtime table ingests data from the Kafka topic. Lastly, sample data is pushed into the offline table.
# stop previous quick start cluster, if any
bin/quick-start-hybrid.sh
Let's head over to Pinot Data Explorer to check out the data we pushed to the airlineStats
table.
This quick start guide will show you how to run a Pinot cluster using Docker.
Create an isolated bridge network in docker
docker network create -d bridge pinot-demo
We'll be using our docker image apachepinot/pinot:latest
to run this quick start, which does the following:
Sets up the Pinot cluster
Creates a sample table and loads sample data
There are 3 types of quick start examples.
Batch example
Streaming example
Hybrid example
In this example we demonstrate how to do batch processing with Pinot.
Starts Pinot deployment by starting
Apache Zookeeper
Pinot Controller
Pinot Broker
Pinot Server
Creates a demo table
baseballStats
Launches a standalone data ingestion job
Builds one Pinot segment for a given CSV data file for table baseballStats
Pushes the built segment to the Pinot controller
Issues sample queries to Pinot
docker run \
--network=pinot-demo \
--name pinot-quickstart \
-p 9000:9000 \
-d apachepinot/pinot:latest QuickStart \
-type batch
Once the Docker container is running, you can view the logs by running the following command.
docker logs pinot-quickstart -f
That's it! We've spun up a Pinot cluster.
docker logs pinot-quickstart -f
Your cluster is ready once you see the cluster setup completion messages and sample queries, as demonstrated below.
You can head over to Exploring Pinot to check out the data in the baseballStats
table.
In this example we demonstrate how to do stream processing with Pinot.
Starts Pinot deployment by starting
Apache Kafka
Apache Zookeeper
Pinot Controller
Pinot Broker
Pinot Server
Creates a demo table
meetupRsvp
Launches a meetup
**stream
Publishes data to a Kafka topic meetupRSVPEvents
to be subscribed to by Pinot
Issues sample queries to Pinot
# stop previous container, if any, or use different network
docker run \
--network=pinot-demo \
--name pinot-quickstart \
-p 9000:9000 \
-d apachepinot/pinot:latest QuickStart \
-type stream
Once the cluster is up, you can head over to Exploring Pinot to check out the data in the meetupRSVPEvents
table.
In this example we demonstrate how to do hybrid stream and batch processing with Pinot.
Starts Pinot deployment by starting
Apache Kafka
Apache Zookeeper
Pinot Controller
Pinot Broker
Pinot Server
Creates a demo table
airlineStats
Launches a standalone data ingestion job
Builds Pinot segments under a given directory of Avro files for table airlineStats
Pushes built segments to Pinot controller
Launches a **stream of flights stats
Publishes data to a Kafka topic airlineStatsEvents
to be subscribed to by Pinot
Issues sample queries to Pinot
# stop previous container, if any, or use different network
docker run \
--network=pinot-demo \
--name pinot-quickstart \
-p 9000:9000 \
-d apachepinot/pinot:latest QuickStart \
-type hybrid
Once the cluster is up, you can head over to Exploring Pinot to check out the data in the airlineStats
table.
Pinot quick start in Kubernetes
Before continuing, please make sure that you've downloaded Apache Pinot. The scripts for the setup in this guide can be found in our open source project on GitHub.
The scripts can be found in the Pinot source at ./incubator-pinot/kubernetes/helm
# checkout pinot
git clone https://github.com/apache/incubator-pinot.git
cd incubator-pinot/kubernetes/helm
Pinot repo has pre-packaged HelmCharts for Pinot and Presto. Helm Repo index file is here.
helm repo add pinot https://raw.githubusercontent.com/apache/incubator-pinot/master/kubernetes/helm
kubectl create ns pinot-quickstart
helm install pinot pinot/pinot \
-n pinot-quickstart \
--set cluster.name=pinot \
--set server.replicaCount=2
helm dependency update
For Helm v2.12.1
If your Kubernetes cluster is recently provisioned, ensure Helm is initialized by running:
helm init --service-account tiller
Then deploy a new HA Pinot cluster using the following command:
helm install --namespace "pinot-quickstart" --name "pinot" .
For Helm v3.0.0
kubectl create ns pinot-quickstart
helm install -n pinot-quickstart pinot .
Error: Please run the below command if encountering the following issue:
Error: could not find tiller.
Resolution:
kubectl -n kube-system delete deployment tiller-deploy
kubectl -n kube-system delete service/tiller-deploy
helm init --service-account tiller
Error: Please run the command below if encountering a permission issue:
Error: release pinot failed: namespaces "pinot-quickstart" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "namespaces" in API group "" in the namespace "pinot-quickstart"
Resolution:
kubectl apply -f helm-rbac.yaml
kubectl get all -n pinot-quickstart
helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
helm install -n pinot-quickstart kafka incubator/kafka --set replicas=1
helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
helm install --namespace "pinot-quickstart" --name kafka incubator/kafka
kubectl get all -n pinot-quickstart |grep kafka
Ensure the Kafka deployment is ready before executing the scripts in the following next steps.
pod/kafka-0 1/1 Running 0 2m
pod/kafka-zookeeper-0 1/1 Running 0 10m
pod/kafka-zookeeper-1 1/1 Running 0 9m
pod/kafka-zookeeper-2 1/1 Running 0 8m
The scripts below will create two Kafka topics for data ingestion:
kubectl -n pinot-quickstart exec kafka-0 -- kafka-topics --zookeeper kafka-zookeeper:2181 --topic flights-realtime --create --partitions 1 --replication-factor 1
kubectl -n pinot-quickstart exec kafka-0 -- kafka-topics --zookeeper kafka-zookeeper:2181 --topic flights-realtime-avro --create --partitions 1 --replication-factor 1
The script below will deploy 3 batch jobs.
Ingest 19492 JSON messages to Kafka topic flights-realtime
at a speed of 1 msg/sec
Ingest 19492 Avro messages to Kafka topic flights-realtime-avro
at a speed of 1 msg/sec
Upload Pinot schema airlineStats
Create Pinot table airlineStats
to ingest data from JSON encoded Kafka topic flights-realtime
Create Pinot table airlineStatsAvro
to ingest data from Avro encoded Kafka topic flights-realtime-avro
kubectl apply -f pinot-realtime-quickstart.yml
Please use the script below to perform local port-forwarding, which will also open Pinot query console in your default web browser.
This script can be found in the Pinot source at ./incubator-pinot/kubernetes/helm
./query-pinot-data.sh
kubectl apply -f superset.yaml
kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'flask fab create-admin'
kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset db upgrade'
kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset init'
kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset import_datasources -p /etc/superset/pinot_example_datasource.yaml'
kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset import_dashboards -p /etc/superset/pinot_example_dashboard.json'
You can run below command to navigate superset in your browser with the previous admin credential.
./open-superset-ui.sh
You can open the imported dashboard by clicking Dashboards
banner and then click on AirlineStats
.
You can run the command below to deploy a customized Presto with Pinot plugin installed.
helm install presto pinot/presto -n pinot
kubectl apply -f presto-coordinator.yaml
Once Presto is deployed, you can run the command below.
./pinot-presto-cli.sh
List all catalogs
presto:default> show catalogs;
Catalog
---------
pinot
system
(2 rows)
Query 20191112_050827_00003_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]
List All tables
presto:default> show tables;
Table
--------------
airlinestats
(1 row)
Query 20191112_050907_00004_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [1 rows, 29B] [1 rows/s, 41B/s]
Show schema
presto:default> DESCRIBE pinot.dontcare.airlinestats;
Column | Type | Extra | Comment
----------------------+---------+-------+---------
flightnum | integer | |
origin | varchar | |
quarter | integer | |
lateaircraftdelay | integer | |
divactualelapsedtime | integer | |
......
Query 20191112_051021_00005_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:02 [80 rows, 6.06KB] [35 rows/s, 2.66KB/s]
Count total documents
presto:default> select count(*) as cnt from pinot.dontcare.airlinestats limit 10;
cnt
------
9745
(1 row)
Query 20191112_051114_00006_xkm4g, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [1 rows, 8B] [2 rows/s, 19B/s]
kubectl delete ns pinot-quickstart
This page contains multiple quick start guides for deploying Pinot to a public cloud provider.
The following quick start guides will show you how to run an Apache Pinot cluster using Kubernetes on different public cloud providers.
Running on AzureRunning on GCPRunning on AWSThis starter guide provides a quick start for running Pinot on Microsoft Azure
This document provides the basic instruction to set up a Kubernetes Cluster on Azure Kubernetes Service (AKS)
Please follow this link (https://kubernetes.io/docs/tasks/tools/install-kubectl) to install kubectl.
For Mac User
brew install kubernetes-cli
Please check kubectl version after installation.
kubectl version
Please follow this link (https://helm.sh/docs/using_helm/#installing-helm) to install helm.
For Mac User
brew install kubernetes-helm
Please check helm version after installation.
helm version
Please follow this link (https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) to install Azure CLI.
For Mac User
brew update && brew install azure-cli
Below script will open default browser to sign-in to your Azure Account.
az login
Below script will create a resource group in location eastus.
AKS_RESOURCE_GROUP=pinot-demo
AKS_RESOURCE_GROUP_LOCATION=eastus
az group create --name ${AKS_RESOURCE_GROUP} \
--location ${AKS_RESOURCE_GROUP_LOCATION}
Below script will create a 3 nodes cluster named pinot-quickstart for demo purposes.
Please modify the parameters in the example command below:
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks create --resource-group ${AKS_RESOURCE_GROUP} \
--name ${AKS_CLUSTER_NAME} \
--node-count 3
Once the command is succeed, it's ready to be used.
Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks get-credentials --resource-group ${AKS_RESOURCE_GROUP} \
--name ${AKS_CLUSTER_NAME}
To verify the connection, you can run:
kubectl get nodes
Please follow this Kubernetes QuickStart to deploy your Pinot Demo.
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks delete --resource-group ${AKS_RESOURCE_GROUP} \
--name ${AKS_CLUSTER_NAME}
This starter provides a quick start for running Pinot on Google Cloud Platform (GCP)
This document provides the basic instruction to set up a Kubernetes Cluster on Google Kubernetes Engine(GKE)
Please follow this link (https://kubernetes.io/docs/tasks/tools/install-kubectl) to install kubectl.
For Mac User
brew install kubernetes-cli
Please check kubectl version after installation.
kubectl version
Please follow this link (https://helm.sh/docs/using_helm/#installing-helm) to install helm.
For Mac User
brew install kubernetes-helm
Please check helm version after installation.
helm version
__
Please follow this link (https://cloud.google.com/sdk/install) to install Google Cloud SDK.
Install Google Cloud SDK
curl https://sdk.cloud.google.com | bash
Restart your shell
exec -l $SHELL
gcloud init
Below script will create a 3 nodes cluster named pinot-quickstart in us-west1-b with n1-standard-2 machines for demo purposes.
Please modify the parameters in the example command below:
GCLOUD_PROJECT=[your gcloud project name]
GCLOUD_ZONE=us-west1-b
GCLOUD_CLUSTER=pinot-quickstart
GCLOUD_MACHINE_TYPE=n1-standard-2
GCLOUD_NUM_NODES=3
gcloud container clusters create ${GCLOUD_CLUSTER} \
--num-nodes=${GCLOUD_NUM_NODES} \
--machine-type=${GCLOUD_MACHINE_TYPE} \
--zone=${GCLOUD_ZONE} \
--project=${GCLOUD_PROJECT}
You can monitor cluster status by command:
gcloud compute instances list
Once the cluster is in RUNNING status, it's ready to be used.
Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.
GCLOUD_PROJECT=[your gcloud project name]
GCLOUD_ZONE=us-west1-b
GCLOUD_CLUSTER=pinot-quickstart
gcloud container clusters get-credentials ${GCLOUD_CLUSTER} --zone ${GCLOUD_ZONE} --project ${GCLOUD_PROJECT}
To verify the connection, you can run:
kubectl get nodes
Please follow this Kubernetes QuickStart to deploy your Pinot Demo.
GCLOUD_ZONE=us-west1-b
gcloud container clusters delete pinot-quickstart --zone=${GCLOUD_ZONE}
This guide provides a quick start for running Pinot on Amazon Web Services (AWS).
This document provides the basic instruction to set up a Kubernetes Cluster on Amazon Elastic Kubernetes Service (Amazon EKS)
Please follow this link (https://kubernetes.io/docs/tasks/tools/install-kubectl) to install kubectl.
For Mac User
brew install kubernetes-cli
Please check kubectl version after installation.
kubectl version
Please follow this link (https://helm.sh/docs/using_helm/#installing-helm) to install helm.
For Mac User
brew install kubernetes-helm
Please check helm version after installation.
helm version
__
Please follow this link (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html#install-tool-bundled) to install AWS CLI.
For Mac User
curl "https://d1vvhvl2y92vvt.cloudfront.net/awscli-exe-macos.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
Please follow this link (https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html#installing-eksctl) to install AWS CLI.
For Mac User
brew tap weaveworks/tap
brew install weaveworks/tap/eksctl
For first time AWS user, please register your account at https://aws.amazon.com/.
Once created the account, you can go to AWS Identity and Access Management (IAM) to create a user and create access keys under Security Credential tab.
aws configure
Below script will create a 3 nodes cluster named pinot-quickstart in us-west-2 with t3.small machines for demo purposes.
Please modify the parameters in the example command below:
EKS_CLUSTER_NAME=pinot-quickstart
eksctl create cluster \
--name ${EKS_CLUSTER_NAME} \
--version 1.14 \
--region us-west-2 \
--nodegroup-name standard-workers \
--node-type t3.small \
--nodes 3 \
--nodes-min 3 \
--nodes-max 4 \
--node-ami auto
You can monitor cluster status by command:
EKS_CLUSTER_NAME=pinot-quickstart
aws eks describe-cluster --name ${EKS_CLUSTER_NAME}
Once the cluster is in ACTIVE status, it's ready to be used.
Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.
EKS_CLUSTER_NAME=pinot-quickstart
aws eks update-kubeconfig --name ${EKS_CLUSTER_NAME}
To verify the connection, you can run:
kubectl get nodes
Please follow this Kubernetes QuickStart to deploy your Pinot Demo.
EKS_CLUSTER_NAME=pinot-quickstart
aws eks delete-cluster --name ${EKS_CLUSTER_NAME}
This quick start guide will show you how to set up a Pinot cluster manually.
You can try out pre-built Pinot all-in-one docker image.
export PINOT_VERSION=0.3.0-SNAPSHOT
export PINOT_IMAGE=apachepinot/pinot:${PINOT_VERSION}
docker pull ${PINOT_IMAGE}
(Optional) You can also follow the instructions here to build your own images.
Create an isolated bridge network in docker
docker network create -d bridge pinot-demo
Start Zookeeper in daemon mode. This is a single node zookeeper setup. Zookeeper is the central metadata store for Pinot and should be set up with replication for production use. See https://zookeeper.apache.org/doc/r3.6.0/zookeeperStarted.html#sc_RunningReplicatedZooKeeper for more information.
docker run \
--network=pinot-demo \
--name pinot-zookeeper \
--restart always \
-p 2181:2181 \
-d zookeeper:3.5.6
Start ZKUI to browse Zookeeper data at http://localhost:9090.
docker run --rm -ti \
--network pinot-demo --name=zkui \
-p 9090:9090 \
-e ZK_SERVER=pinot-zookeeper:2181 \
-d qnib/plain-zkui:latest
Alternately, you can use Zooinspector.
Start Pinot Controller in daemon and connect to Zookeeper.
docker run --rm -ti \
--network=pinot-demo \
--name pinot-controller \
-p 9000:9000 \
-d ${PINOT_IMAGE} StartController \
-zkAddress pinot-zookeeper:2181
Start Pinot Broker in daemon and connect to Zookeeper.
docker run --rm -ti \
--network=pinot-demo \
--name pinot-broker \
-d ${PINOT_IMAGE} StartBroker \
-zkAddress pinot-zookeeper:2181
Start Pinot Server in daemon and connect to Zookeeper.
docker run --rm -ti \
--network=pinot-demo \
--name pinot-server \
-d ${PINOT_IMAGE} StartServer \
-zkAddress pinot-zookeeper:2181
Optionally, you can also start Kafka for setting up realtime streams. This brings up the Kafka broker on port 9092.
docker run --rm -ti \
--network pinot-demo --name=kafka \
-e KAFKA_ZOOKEEPER_CONNECT=pinot-zookeeper:2181/kafka \
-e KAFKA_BROKER_ID=0 \
-e KAFKA_ADVERTISED_HOST_NAME=kafka \
-d wurstmeister/kafka:latest
Now all Pinot related components are started as an empty cluster.
You can run below command to check container status.
docker container ls -a
Sample Console Output
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9ec20e4463fa wurstmeister/kafka:latest "start-kafka.sh" 43 minutes ago Up 43 minutes kafka
0775f5d8d6bf apachepinot/pinot:latest "./bin/pinot-admin.s…" 44 minutes ago Up 44 minutes 8096-8099/tcp, 9000/tcp pinot-server
64c6392b2e04 apachepinot/pinot:latest "./bin/pinot-admin.s…" 44 minutes ago Up 44 minutes 8096-8099/tcp, 9000/tcp pinot-broker
b6d0f2bd26a3 apachepinot/pinot:latest "./bin/pinot-admin.s…" 45 minutes ago Up 45 minutes 8096-8099/tcp, 0.0.0.0:9000->9000/tcp pinot-controller
570416fc530e zookeeper:3.5.6 "/docker-entrypoint.…" 45 minutes ago Up 45 minutes 2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp, 8080/tcp pinot-zookeeper
cd apache-pinot-incubating-${PINOT_VERSION}-bin
bin/pinot-admin.sh StartZookeeper \
-zkPort 2191
You can use Zooinspector to browse the Zookeeper instance.
bin/pinot-admin.sh StartController \
-zkAddress localhost:2191 \
-controllerPort 9000
bin/pinot-admin.sh StartBroker \
-zkAddress localhost:2191
bin/pinot-admin.sh StartServer \
-zkAddress localhost:2191
bin/pinot-admin.sh StartKafka \
-zkAddress=localhost:2191/kafka \
-port 19092
Now all Pinot related components are started as an empty cluster.
Now it's time to start adding data to the cluster. Check out some of the Recipes or follow the Batch upload sample data and Stream sample data for instructions on loading your own data.
Step-by-step guide on pushing your own data into the Pinot cluster
So far, we setup our cluster, ran some queries, explored the admin endpoints. Now, it's time to get our own data into Pinot
Let's gather our data files and put it in pinot-quick-start/rawdata
.
mkdir -p /tmp/pinot-quick-start/rawdata
Supported file formats are CVS, JSON, AVRO, PARQUET, THRIFT, ORC. If you don't have sample data, you can use this sample CSV.
studentID,firstName,lastName,gender,subject,score,timestamp
200,Lucy,Smith,Female,Maths,3.8,1570863600000
200,Lucy,Smith,Female,English,3.5,1571036400000
201,Bob,King,Male,Maths,3.2,1571900400000
202,Nick,Young,Male,Physics,3.6,1572418800000
Schema is used to define the columns and data types of the Pinot table. A detailed overview of the schema can be found in Schema.
Briefly, we categorize our columns into 3 types
Column Type
Description
Dimensions
Typically used in filters and group by, for slicing and dicing into data
Metrics
Typically used in aggregations, represents the quantitative data
Time
Optional column, represents the timestamp associated with each row
For example, in our sample table, the playerID, yearID, teamID, league, playerName
columns are the dimensions, the playerStint, numberOfgames, numberOfGamesAsBatter, AtBatting, runs, hits, doules, triples, homeRuns, runsBattedIn, stolenBases, caughtStealing, baseOnBalls, strikeouts, intentionalWalks, hitsByPitch, sacrificeHits, sacrificeFlies, groundedIntoDoublePlays, G_old
columns are the metrics and there is no time column.
Once you have identified the dimensions, metrics and time columns, create a schema for your data, using the reference below.
{
"schemaName": "transcript",
"dimensionFieldSpecs": [
{
"name": "studentID",
"dataType": "INT"
},
{
"name": "firstName",
"dataType": "STRING"
},
{
"name": "lastName",
"dataType": "STRING"
},
{
"name": "gender",
"dataType": "STRING"
},
{
"name": "subject",
"dataType": "STRING"
}
],
"metricFieldSpecs": [
{
"name": "score",
"dataType": "FLOAT"
}
],
"dateTimeFieldSpecs": [{
"name": "timestamp",
"dataType": "LONG",
"format" : "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
}]
}
A table config is used to define the config related to the Pinot table. A detailed overview of the table can be found in Table.
Here's the table config for the sample CSV file. You can use this as a reference to build your own table config. Simply edit the tableName and schemaName.
{
"tableName": "transcript",
"segmentsConfig" : {
"timeColumnName": "timestamp",
"timeType": "MILLISECONDS",
"replication" : "1",
"schemaName" : "transcript"
},
"tableIndexConfig" : {
"invertedIndexColumns" : [],
"loadMode" : "MMAP"
},
"tenants" : {
"broker":"DefaultTenant",
"server":"DefaultTenant"
},
"tableType":"OFFLINE",
"metadata": {}
}
Check the directory structure so far
$ ls /tmp/pinot-quick-start
rawdata transcript-schema.json transcript-table-offline.json
$ ls /tmp/pinot-quick-start/rawdata
transcript.csv
Upload the table config using the following command
docker run --rm -ti \
--network=pinot-demo \
-v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
--name pinot-batch-table-creation \
apachepinot/pinot:latest AddTable \
-schemaFile /tmp/pinot-quick-start/transcript-schema.json \
-tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
-controllerHost pinot-quickstart \
-controllerPort 9000 -exec
bin/pinot-admin.sh AddTable \
-tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
-schemaFile /tmp/pinot-quick-start/transcript-schema.json -exec
Check out the table config and schema in the Rest API to make sure it was successfully uploaded.
A Pinot table's data is stored as Pinot segments. A detailed overview of the segment can be found in Segment.
To generate a segment, we need to first create a job spec yaml file. JobSpec yaml file has all the information regarding data format, input data location and pinot cluster coordinates. You can just copy over this job spec file. If you're using your own data, be sure to 1) replace transcript
with your table name 2) set the right recordReaderSpec
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/tmp/pinot-quick-start/rawdata/'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/tmp/pinot-quick-start/segments/'
overwriteOutput: true
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'csv'
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
tableName: 'transcript'
schemaURI: 'http://localhost:9000/tables/transcript/schema'
tableConfigURI: 'http://localhost:9000/tables/transcript'
pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/tmp/pinot-quick-start/rawdata/'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/tmp/pinot-quick-start/segments/'
overwriteOutput: true
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'csv'
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
tableName: 'transcript'
schemaURI: 'http://pinot-quickstart:9000/tables/transcript/schema'
tableConfigURI: 'http://pinot-quickstart:9000/tables/transcript'
pinotClusterSpecs:
- controllerURI: 'http://pinot-quickstart:9000'
Use the following command to generate a segment and upload it
docker run --rm -ti \
--network=pinot-demo \
-v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
--name pinot-data-ingestion-job \
apachepinot/pinot:latest LaunchDataIngestionJob \
-jobSpecFile /tmp/pinot-quick-start/docker-job-spec.yml
bin/pinot-admin.sh LaunchDataIngestionJob \
-jobSpecFile /tmp/pinot-quick-start/batch-job-spec.yml
Sample output
SegmentGenerationJobSpec:
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**\/*.csv
inputDirURI: /tmp/pinot-quick-start/rawdata/
jobType: SegmentCreationAndTarPush
outputDirURI: /tmp/pinot-quick-start/segments
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: 'http://localhost:9000'}
pinotFSSpecs:
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: null
recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader,
configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig,
configs: null, dataFormat: csv}
segmentNameGeneratorSpec: null
tableSpec: {schemaURI: 'http://localhost:9000/tables/transcript/schema', tableConfigURI: 'http://localhost:9000/tables/transcript',
tableName: transcript}
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Finished building StatsCollector!
Collected stats for 4 documents
Using fixed bytes value dictionary for column: studentID, size: 9
Created dictionary for STRING column: studentID with cardinality: 3, max length in bytes: 3, range: 200 to 202
Using fixed bytes value dictionary for column: firstName, size: 12
Created dictionary for STRING column: firstName with cardinality: 3, max length in bytes: 4, range: Bob to Nick
Using fixed bytes value dictionary for column: lastName, size: 15
Created dictionary for STRING column: lastName with cardinality: 3, max length in bytes: 5, range: King to Young
Created dictionary for FLOAT column: score with cardinality: 4, range: 3.2 to 3.8
Using fixed bytes value dictionary for column: gender, size: 12
Created dictionary for STRING column: gender with cardinality: 2, max length in bytes: 6, range: Female to Male
Using fixed bytes value dictionary for column: subject, size: 21
Created dictionary for STRING column: subject with cardinality: 3, max length in bytes: 7, range: English to Physics
Created dictionary for LONG column: timestamp with cardinality: 4, range: 1570863600000 to 1572418800000
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
Converting segment: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to v3 format
v3 segment location for segment: transcript_OFFLINE_1570863600000_1572418800000_0 is /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3
Deleting files in v1 segment directory: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0
Starting building 1 star-trees with configs: [StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[org.apache.pinot.core.startree.v2.AggregationFunctionColumnPair@3a48efdc],maxLeafRecords=1]] using OFF_HEAP builder
Starting building star-tree with config: StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[org.apache.pinot.core.startree.v2.AggregationFunctionColumnPair@3a48efdc],maxLeafRecords=1]
Generated 3 star-tree records from 4 segment records
Finished constructing star-tree, got 9 tree nodes and 4 records under star-node
Finished creating aggregated documents, got 6 aggregated records
Finished building star-tree in 10ms
Finished building 1 star-trees in 27ms
Computed crc = 3454627653, based on files [/var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/columns.psf, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/index_map, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/metadata.properties, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index_map]
Driver, record read time : 0
Driver, stats collector time : 0
Driver, indexing time : 0
Tarring segment from: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz
Size for segment: transcript_OFFLINE_1570863600000_1572418800000_0, uncompressed: 6.73KB, compressed: 1.89KB
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: [/tmp/pinot-quick-start/segments/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz]... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@243c4f91] for table transcript
Pushing segment: transcript_OFFLINE_1570863600000_1572418800000_0 to location: http://localhost:9000 for table transcript
Sending request: http://localhost:9000/v2/segments?tableName=transcript to controller: nehas-mbp.hsd1.ca.comcast.net, version: Unknown
Response for pushing table transcript segment transcript_OFFLINE_1570863600000_1572418800000_0 to location http://localhost:9000 - 200: {"status":"Successfully uploaded segment: transcript_OFFLINE_1570863600000_1572418800000_0 of table: transcript"}
Check that your segment made it to the table using the Rest API
You're all set! You should see your table in the Query Console and be able to run queries against it now.
The Docker instructions on this page are still WIP
So far, we setup our cluster, ran some queries on the demo tables and explored the admin endpoints. We also uploaded some sample batch data for transcript table.
Now, it's time to ingest from a sample stream into Pinot.
First, we need to setup a stream. Pinot has out-of-the-box realtime ingestion support for Kafka. Other streams can be plugged in, more details in Pluggable Streams.
Let's setup a demo Kafka cluster locally, and create a sample topic transcript-topic
Start Kafka
docker run \
--network pinot-demo --name=kafka \
-e KAFKA_ZOOKEEPER_CONNECT=pinot-quickstart:2123/kafka \
-e KAFKA_BROKER_ID=0 \
-e KAFKA_ADVERTISED_HOST_NAME=kafka \
-d wurstmeister/kafka:latest
Create a Kafka Topic
docker exec \
-t kafka \
/opt/kafka/bin/kafka-topics.sh \
--zookeeper pinot-quickstart:2123/kafka \
--partitions=1 --replication-factor=1 \
--create --topic transcript-topic
Start Kafka
Start Kafka cluster on port 9876
using the same Zookeeper from the quick-start examples
bin/pinot-admin.sh StartKafka -zkAddress=localhost:2123/kafka -port 9876
Create a Kafka topic
Download the latest Kafka. Create a topic
bin/kafka-topics.sh --create --bootstrap-server localhost:9876 --replication-factor 1 --partitions 1 --topic transcript-topic
If you followed the Batch upload sample data, you have already pushed a schema for your sample table. If not, head over to Creating a schema on that page, to learn how to create a schema for your sample data.
If you followed Batch upload sample data, you learnt how to push an offline table and schema. Similar to the offline table config, we will create a realtime table config for the sample. Here's the realtime table config for the transcript table. For a more detailed overview about table, checkout Table.
{
"tableName": "transcript",
"tableType": "REALTIME",
"segmentsConfig": {
"timeColumnName": "timestamp",
"timeType": "MILLISECONDS",
"schemaName": "transcript",
"replicasPerPartition": "1"
},
"tenants": {},
"tableIndexConfig": {
"loadMode": "MMAP",
"streamConfigs": {
"streamType": "kafka",
"stream.kafka.consumer.type": "lowlevel",
"stream.kafka.topic.name": "transcript-topic",
"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.broker.list": "localhost:9876",
"realtime.segment.flush.threshold.time": "3600000",
"realtime.segment.flush.threshold.size": "50000",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest"
}
},
"metadata": {
"customConfigs": {}
}
}
Now that we have our table and schema, let's upload them to the cluster. As soon as the realtime table is created, it will begin ingesting from the Kafka topic.
docker run \
--network=pinot-demo \
-v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
--name pinot-streaming-table-creation \
apachepinot/pinot:latest AddTable \
-schemaFile /tmp/pinot-quick-start/transcript-schema.json \
-tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
-controllerHost pinot-quickstart \
-controllerPort 9000 \
-exec
bin/pinot-admin.sh AddTable \
-schemaFile /tmp/pinot-quick-start/transcript-schema.json \
-tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
-exec
Here's a JSON file for transcript table data:
{"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"Maths","score":3.8,"timestamp":1571900400000}
{"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"History","score":3.5,"timestamp":1571900400000}
{"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Maths","score":3.2,"timestamp":1571900400000}
{"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Chemistry","score":3.6,"timestamp":1572418800000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Geography","score":3.8,"timestamp":1572505200000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"English","score":3.5,"timestamp":1572505200000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Maths","score":3.2,"timestamp":1572678000000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Physics","score":3.6,"timestamp":1572678000000}
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"Maths","score":3.8,"timestamp":1572678000000}
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"English","score":3.5,"timestamp":1572678000000}
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"History","score":3.2,"timestamp":1572854400000}
{"studentID":212,"firstName":"Nick","lastName":"Young","gender":"Male","subject":"History","score":3.6,"timestamp":1572854400000}
Push sample JSON into Kafka topic, using the Kafka script from the Kafka download
bin/kafka-console-producer.sh \
--broker-list localhost:9876 \
--topic transcript-topic < /tmp/pinot-quick-start/rawData/transcript.json
As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to checkout the realtime data