arrow-left

All pages
gitbookPowered by GitBook
1 of 19

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Public cloud examples

This page contains multiple quick start guides for deploying Pinot to a public cloud provider.

The following quick start guides will show you how to run an Apache Pinot cluster using Kubernetes on different public cloud providers.

Running on Azurechevron-right
Running on GCPchevron-right
Running on AWSchevron-right

General

FAQ for general questions around Pinot

hashtag
How does Pinot use deep storage?

When data is pushed in to Pinot, it makes a backup copy of the data and stores it on the configured deep-storage (S3/GCP/ADLS/NFS/etc). This copy is stored as tar.gz Pinot segments. Note, that pinot servers keep a (untarred) copy of the segments on their local disk as well. This is done for performance reasons.

hashtag
How does Pinot use Zookeeper?

Pinot uses Apache Helix for cluster management, which in turn is built on top of Zookeeper. Helix uses Zookeeper to store the cluster state, including Ideal State, External View, Participants, etc. Besides that, Pinot uses Zookeeper to store other information such as Table configs, schema, Segment Metadata, etc.

hashtag
Why am I getting "Could not find or load class" error when running Quickstart using 0.8.0 release?

Please check the JDK version you are using. The release 0.8.0 binary is on JDK 11. You may be getting this error if you are using JDK8. In that case, please consider using JDK11, or you will need to download the for the release and it locally.

Getting Started

This section contains quick start guides to help you get up and running with Pinot.

hashtag
Running Pinot

We want your experience getting started with Pinot to be both low effort and high reward. Here you'll find a collection of quick start guides that contain starter distributions of the Pinot platform.

hashtag

Pinot On Kubernetes FAQ

hashtag
How to increase server disk size on AWS

Below is an example of AWS EKS.

hashtag

Frequently Asked Questions (FAQs)

This page has a collection of frequently asked questions with answers from the community.

circle-info

This is a list of frequent questions most often asked in our troubleshooting channel on Slack. Please feel free to contribute your questions and answers here and make a pull request.

source codearrow-up-right
buildarrow-up-right
Ingestion FAQchevron-right
Query FAQchevron-right
Operations FAQchevron-right
Bootstrapping a cluster

hashtag
Deploy to a public cloud

hashtag
How to setup a Pinot cluster

This video will show you a step-by-step walk through for launching the individual components of Pinot and scaling them to multiple instances. This is an excellent resource for developers and operators that want to understand setting up each component and debugging a cluster.

circle-info

You can find the commands that are shown in this video on GitHub https://github.com/npawar/pinot-tutorialarrow-up-right

We also have a step-by-step guide for manually setting up a Pinot cluster using Docker or shell scripts.

hashtag
Data import examples

Getting data into Pinot is easy. Take a look at these two quick start guides which will help you get up and running with sample data for offline and real-time tables.

Running Pinot locallychevron-right
Running Pinot in Dockerchevron-right
Running Pinot in Kuberneteschevron-right
Running on Azurechevron-right
Running on GCPchevron-right
Running on AWSchevron-right
Manual cluster setupchevron-right
Batch import examplechevron-right
Stream ingestion examplechevron-right
1. Update Storage Class

In the K8s cluster, check the storage class: in AWS, it should be gp2.

Then update StorageClass to ensure:

Once StorageClass is updated, it should be like:

hashtag
2. Update PVC

Once the storage class is updated, then we can update PVC for the server disk size.

Now we want to double the disk size for pinot-server-3.

Below is an example of current disks:

Below is the output of data-pinot-server-3

PVC data-pinot-server-3

Now, let's change the PVC size to 2T by editing the server PVC.

Once updated, the spec's PVC size is updated to 2T, but the status's PVC size is still 1T.

hashtag
3. Restart pod to let it reflect

Restart pinot-server-3 pod:

Recheck PVC size:

allowVolumeExpansion: true
kubectl edit pvc data-pinot-server-3 -n pinot

Running on Azure

This starter guide provides a quick start for running Pinot on Microsoft Azure

This document provides the basic instruction to set up a Kubernetes Cluster on Azure Kubernetes Service (AKS)arrow-up-right

hashtag
1. Tooling Installation

hashtag
1.1 Install Kubectl

Please follow this link () to install kubectl.

For Mac User

Please check kubectl version after installation.

circle-info

QuickStart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12

hashtag
1.2 Install Helm

Please follow this link () to install helm.

For Mac User

Please check helm version after installation.

circle-info

This QuickStart provides helm supports for helm v3.0.0 and v2.12.1. Please pick the script based on your helm version.

hashtag
1.3 Install Azure CLI

Please follow this link () to install Azure CLI.

For Mac User

hashtag
2. (Optional) Login to your Azure account

Below script will open default browser to sign-in to your Azure Account.

hashtag
3. (Optional) Create a Resource Group

Below script will create a resource group in location eastus.

hashtag
4. (Optional) Create a Kubernetes cluster(AKS) in Azure

Below script will create a 3 nodes cluster named pinot-quickstart for demo purposes.

Please modify the parameters in the example command below:

Once the command is succeed, it's ready to be used.

hashtag
5. Connect to an existing cluster

Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.

To verify the connection, you can run:

hashtag
6. Pinot Quickstart

Please follow this to deploy your Pinot Demo.

hashtag
7. Delete a Kubernetes Cluster

Running on GCP

This starter provides a quick start for running Pinot on Google Cloud Platform (GCP)

This document provides the basic instruction to set up a Kubernetes Cluster on Google Kubernetes Engine(GKE)arrow-up-right

hashtag
1. Tooling Installation

hashtag
1.1 Install Kubectl

Please follow this link () to install kubectl.

For Mac User

Please check kubectl version after installation.

circle-info

QuickStart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12

hashtag
1.2 Install Helm

Please follow this link () to install helm.

For Mac User

Please check helm version after installation.

circle-info

This QuickStart provides helm supports for helm v3.0.0 and v2.12.1. Please pick the script based on your helm version.

hashtag
1.3 Install Google Cloud SDK

Please follow this link () to install Google Cloud SDK.

hashtag
1.3.1 For Mac User

  • Install Google Cloud SDK

  • Restart your shell

hashtag
2. (Optional) Initialize Google Cloud Environment

hashtag
3. (Optional) Create a Kubernetes cluster(GKE) in Google Cloud

Below script will create a 3 nodes cluster named pinot-quickstart in us-west1-b with n1-standard-2 machines for demo purposes.

Please modify the parameters in the example command below:

You can monitor cluster status by command:

Once the cluster is in RUNNING status, it's ready to be used.

hashtag
4. Connect to an existing cluster

Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.

To verify the connection, you can run:

hashtag
5. Pinot Quickstart

Please follow this to deploy your Pinot Demo.

hashtag
6. Delete a Kubernetes Cluster

Hdfs as Deep Storage

This guide helps to setup HDFS as deepstorage for Pinot Segment.

To use HDFS as deep storage you need to include HDFS dependency jars and plugins.

hashtag
Server Setup

hashtag
Configuration.

hashtag
Executable.

hashtag
Controller Setup

hashtag
Configuration.

hashtag
Executable.

hashtag
Broker Setup

hashtag
Configuration.

hashtag
Executable.

Running on AWS

This guide provides a quick start for running Pinot on Amazon Web Services (AWS).

This document provides the basic instruction to set up a Kubernetes Cluster on

hashtag
1. Tooling Installation

hashtag

Running Pinot locally

This quick start guide will help you bootstrap a Pinot standalone instance on your local machine.

In this guide you'll learn how to download and install Apache Pinot as a standalone instance.

circle-check

This is a quick start guide that will show you how to quickly start an example recipe in a standalone instance and is meant for learning. To run Pinot in cluster mode, please take a look at .

Troubleshooting Pinot

hashtag
Is there any debug information available in Pinot?

Pinot offers various ways to assist with troubleshooting and debugging problems that might happen. It is recommended to start off with the debug api which may quickly surface some of the commonly occurring problems. The debug api provides information such as tableSize, ingestion status, any error messages related to state transition in server, among other things.

The table debug api can be invoked via the Swagger UI as follows:

https://kubernetes.io/docs/tasks/tools/install-kubectlarrow-up-right
https://helm.sh/docs/using_helm/#installing-helmarrow-up-right
https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latestarrow-up-right
Kubernetes QuickStart
https://kubernetes.io/docs/tasks/tools/install-kubectlarrow-up-right
https://helm.sh/docs/using_helm/#installing-helmarrow-up-right
https://cloud.google.com/sdk/installarrow-up-right
Kubernetes QuickStart
1.1 Install Kubectl

Please follow this link (https://kubernetes.io/docs/tasks/tools/install-kubectlarrow-up-right) to install kubectl.

For Mac User

Please check kubectl version after installation.

circle-info

QuickStart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12

hashtag
1.2 Install Helm

Please follow this link (https://helm.sh/docs/using_helm/#installing-helmarrow-up-right) to install helm.

For Mac User

Please check helm version after installation.

circle-info

This QuickStart provides helm supports for helm v3.0.0 and v2.12.1. Please pick the script based on your helm version.

hashtag
1.3 Install AWS CLI

Please follow this link (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html#install-tool-bundledarrow-up-right) to install AWS CLI.

For Mac User

hashtag
1.4 Install Eksctl

Please follow this link (https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html#installing-eksctlarrow-up-right) to install AWS CLI.

For Mac User

hashtag
2. (Optional) Login to your AWS account.

For first time AWS user, please register your account at https://aws.amazon.com/arrow-up-right.

Once created the account, you can go to AWS Identity and Access Management (IAM)arrow-up-right to create a user and create access keys under Security Credential tab.

circle-info

Environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY will override AWS configuration stored in file ~/.aws/credentials

hashtag
3. (Optional) Create a Kubernetes cluster(EKS) in AWS

The script below will create a 1 node cluster named pinot-quickstart in us-west-2 with a t3.xlarge machine for demo purposes:

You can monitor the cluster status via this command:

Once the cluster is in ACTIVE status, it's ready to be used.

hashtag
4. Connect to an existing cluster

Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.

To verify the connection, you can run:

hashtag
5. Pinot Quickstart

Please follow this Kubernetes QuickStart to deploy your Pinot Demo.

hashtag
6. Delete a Kubernetes Cluster

Amazon Elastic Kubernetes Service (Amazon EKS)arrow-up-right
brew install kubernetes-cli
kubectl version
brew install kubernetes-helm
helm version
brew update && brew install azure-cli
az login
AKS_RESOURCE_GROUP=pinot-demo
AKS_RESOURCE_GROUP_LOCATION=eastus
az group create --name ${AKS_RESOURCE_GROUP} \
                --location ${AKS_RESOURCE_GROUP_LOCATION}
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks create --resource-group ${AKS_RESOURCE_GROUP} \
              --name ${AKS_CLUSTER_NAME} \
              --node-count 3
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks get-credentials --resource-group ${AKS_RESOURCE_GROUP} \
                       --name ${AKS_CLUSTER_NAME}
kubectl get nodes
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks delete --resource-group ${AKS_RESOURCE_GROUP} \
              --name ${AKS_CLUSTER_NAME}
brew install kubernetes-cli
kubectl version
brew install kubernetes-helm
helm version
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
GCLOUD_PROJECT=[your gcloud project name]
GCLOUD_ZONE=us-west1-b
GCLOUD_CLUSTER=pinot-quickstart
GCLOUD_MACHINE_TYPE=n1-standard-2
GCLOUD_NUM_NODES=3
gcloud container clusters create ${GCLOUD_CLUSTER} \
  --num-nodes=${GCLOUD_NUM_NODES} \
  --machine-type=${GCLOUD_MACHINE_TYPE} \
  --zone=${GCLOUD_ZONE} \
  --project=${GCLOUD_PROJECT}
gcloud compute instances list
GCLOUD_PROJECT=[your gcloud project name]
GCLOUD_ZONE=us-west1-b
GCLOUD_CLUSTER=pinot-quickstart
gcloud container clusters get-credentials ${GCLOUD_CLUSTER} --zone ${GCLOUD_ZONE} --project ${GCLOUD_PROJECT}
kubectl get nodes
GCLOUD_ZONE=us-west1-b
gcloud container clusters delete pinot-quickstart --zone=${GCLOUD_ZONE}
pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
pinot.server.storage.factory.hdfs.hadoop.conf.path=/path/to/hadoop/conf/directory/
pinot.server.segment.fetcher.protocols=file,http,hdfs
pinot.server.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
pinot.set.instance.id.to.hostname=true
pinot.server.instance.dataDir=/path/in/local/filesystem/for/pinot/data/server/index
pinot.server.instance.segmentTarDir=/path/in/local/filesystem/for/pinot/data/server/segment
pinot.server.grpc.enable=true
pinot.server.grpc.port=8090
export HADOOP_HOME=/path/to/hadoop/home
export HADOOP_VERSION=2.7.1
export HADOOP_GUAVA_VERSION=11.0.2
export HADOOP_GSON_VERSION=2.2.4
export GC_LOG_LOCATION=/path/to/gc/log/file
export PINOT_VERSION=0.8.0
export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin/
export SERVER_CONF_DIR=/path/to/pinot/conf/dir/
export ZOOKEEPER_ADDRESS=localhost:2181


export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
export JAVA_OPTS="-Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-server.log"
${PINOT_DISTRIBUTION_DIR}/bin/start-server.sh  -zkAddress ${ZOOKEEPER_ADDRESS} -configFileName ${SERVER_CONF_DIR}/server.conf
controller.data.dir=hdfs://path/in/hdfs/for/controller/segment
controller.local.temp.dir=/tmp/pinot/
controller.zk.str=<ZOOKEEPER_HOST:ZOOKEEPER_PORT>
controller.enable.split.commit=true
controller.access.protocols.http.port=9000
controller.helix.cluster.name=PinotCluster
pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
pinot.controller.storage.factory.hdfs.hadoop.conf.path=/path/to/hadoop/conf/directory/
pinot.controller.segment.fetcher.protocols=file,http,hdfs
pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
controller.vip.port=9000
controller.port=9000
pinot.set.instance.id.to.hostname=true
pinot.server.grpc.enable=true
export HADOOP_HOME=/path/to/hadoop/home
export HADOOP_VERSION=2.7.1
export HADOOP_GUAVA_VERSION=11.0.2
export HADOOP_GSON_VERSION=2.2.4
export GC_LOG_LOCATION=/path/to/gc/log/file
export PINOT_VERSION=0.8.0
export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin/
export SERVER_CONF_DIR=/path/to/pinot/conf/dir/
export ZOOKEEPER_ADDRESS=localhost:2181


export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
export JAVA_OPTS="-Xms8G -Xmx12G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-controller.log"
${PINOT_DISTRIBUTION_DIR}/bin/start-controller.sh -configFileName ${SERVER_CONF_DIR}/controller.conf
pinot.set.instance.id.to.hostname=true
pinot.server.grpc.enable=true
export HADOOP_HOME=/path/to/hadoop/home
export HADOOP_VERSION=2.7.1
export HADOOP_GUAVA_VERSION=11.0.2
export HADOOP_GSON_VERSION=2.2.4
export GC_LOG_LOCATION=/path/to/gc/log/file
export PINOT_VERSION=0.8.0
export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin/
export SERVER_CONF_DIR=/path/to/pinot/conf/dir/
export ZOOKEEPER_ADDRESS=localhost:2181


export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
export JAVA_OPTS="-Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-broker.log"
${PINOT_DISTRIBUTION_DIR}/bin/start-broker.sh -zkAddress ${ZOOKEEPER_ADDRESS} -configFileName  ${SERVER_CONF_DIR}/broker.conf
brew install kubernetes-cli
kubectl version
brew install kubernetes-helm
helm version
curl "https://d1vvhvl2y92vvt.cloudfront.net/awscli-exe-macos.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
brew tap weaveworks/tap
brew install weaveworks/tap/eksctl
aws configure
EKS_CLUSTER_NAME=pinot-quickstart
eksctl create cluster \
--name ${EKS_CLUSTER_NAME} \
--version 1.16 \
--region us-west-2 \
--nodegroup-name standard-workers \
--node-type t3.xlarge \
--nodes 1 \
--nodes-min 1 \
--nodes-max 1 \
--node-ami auto
EKS_CLUSTER_NAME=pinot-quickstart
aws eks describe-cluster --name ${EKS_CLUSTER_NAME}
EKS_CLUSTER_NAME=pinot-quickstart
aws eks update-kubeconfig --name ${EKS_CLUSTER_NAME}
kubectl get nodes
EKS_CLUSTER_NAME=pinot-quickstart
aws eks delete-cluster --name ${EKS_CLUSTER_NAME}
hashtag
Download Apache Pinot

First, let's download the Pinot distribution for this tutorial. You can either build the distribution from source or download a packaged release.

circle-info

Prerequisites

Install JDK11 or higher (JDK16 is not yet supported) For JDK 8 support use Pinot 0.7.1 or compile from source

hashtag
Build from source or download the distribution

Follow these steps to checkout code from Githubarrow-up-right and build Pinot locally

circle-info

Prerequisites

Install Apache Mavenarrow-up-right 3.6 or higher

# checkout pinot
git clone https://github.com/apache/pinot.git
cd 
circle-info

Add maven option -Djdk.version=8 when building with JDK 8

circle-info

Note that Pinot scripts is located under pinot-distribution/target not target directory under root.

Download the latest binary release from , or use this command

Once you have the tar file,

hashtag
Setting up a Pinot cluster

We'll be using the quick-start scripts provided along with pinot distribution, which do the following:

  1. Set up the Pinot cluster QuickStartCluster

  2. Create a sample table and load sample data

The following quick start scripts are available. Please note though, these scripts launch the Pinot cluster with minimal resources. If you intend to play with sizable data (more than few MB), you may want to follow the Manual cluster setup and provide required resources.

hashtag
Batch

Batch quick start creates the pinot cluster, creates an offline table baseballStats and pushes sample offline data to the table.

That's it! We've spun up a Pinot cluster. You can continue playing with other types of quick start, or simply head on to Pinot Data Explorer to check out the data in the baseballStats table.

hashtag
Streaming

Streaming quick start sets up a Kafka cluster and pushes sample data to a Kafka topic. Then, it creates the Pinot cluster and creates a realtime table meetupRSVP which ingests data from the Kafka topic.

We now have a Pinot cluster with a realtime table! You can head over to Pinot Data Explorer to check out the data in the meetupRSVP table.

hashtag
Hybrid

Hybrid quick start sets up a Kafka cluster and pushes sample data to a Kafka topic. Then, it creates the Pinot cluster and creates a hybrid table airlineStats . The realtime table ingests data from the Kafka topic. Lastly, sample data is pushed into the offline table.

Let's head over to Pinot Data Explorer to check out the data we pushed to the airlineStats table.

Manual cluster setup
It can also be invoked directly by accessing the URL as follows. The api requires the tableName, and can optionally take tableType (offline|realtime) and verbosity level.

Pinot also provides a wide-variety of operational metrics that can be used for creating dashboards, alerting and monitoringarrow-up-right. Also, all pinot components log debug information related to error conditions that can be used for troubleshooting.

hashtag
How do I debug a slow query or a query which keeps timing out

Please use these steps:

  1. If the query executes, look at the query result. Specifically look at numEntriesScannedInFilter and numDocsScanned.

    1. If numEntriesScannedInFilter is very high, consider adding indexes for the corresponding columns being used in the filter predicates. You should also think about partitioning the incoming data based on the dimension most heavily used in your filter queries.

    2. If numDocsScanned is very high, that means the selectivity for the query is low and lots of documents need to be processed after the filtering. Consider refining the filter to increase the selectivity of the query.

  2. If the query is not executing, you can extend the query timeout by appending a timeoutMs parameter to the query (eg: select * from mytable limit 10 option(timeoutMs=60000)). Then you can repeat step 1.

  3. You can also look at GC stats for the corresponding Pinot servers. If a particular server seems to be running full GC all the time, you can do a couple of things such as

    1. Increase JVM heap (Xmx)

    2. Consider using off-heap memory for segments

hashtag

Swagger - Table Debug Api

Running Pinot in Docker

This quick start guide will show you how to run a Pinot cluster using Docker.

circle-check

This is a quickstart guide that will show you how to quickly start an example recipe in a standalone instance and is meant for learning. To run Pinot in cluster mode, please take a look at .

circle-info

Prerequisites

bin/quick-start-batch.sh
# stop previous quick start cluster, if any
bin/quick-start-streaming.sh
# stop previous quick start cluster, if any
bin/quick-start-hybrid.sh
curl -X GET "http://localhost:9000/debug/tables/airlineStats?verbosity=0" -H "accept: application/json"
pinot
# build pinot
mvn install package -DskipTests -Pbin-dist
# navigate to directory containing the setup scripts
cd pinot-distribution/target/apache-pinot-$PINOT_VERSION-bin/apache-pinot-$PINOT_VERSION-bin
Apache Pinotarrow-up-right
PINOT_VERSION=0.8.0 #set to the Pinot version you decide to use

wget https://downloads.apache.org/pinot/apache-pinot-$PINOT_VERSION/apache-pinot-$PINOT_VERSION-bin.tar.gz
# untar it
tar -zxvf apache-pinot-$PINOT_VERSION-bin.tar.gz

# navigate to directory containing the launcher scripts
cd apache-pinot-$PINOT_VERSION-bin

Decrease the total number of segments per server (by partitioning the data in a better way)

Install Dockerarrow-up-right

You can also try Kubernetes quick start if you already have a local minikubearrow-up-right cluster installed or Docker Kubernetesarrow-up-right setup.

If running locally, please ensure your docker cluster has enough resources, below is a sample config.

We'll be using our docker image apachepinot/pinot:latest to run this quick start, which does the following:

  • Sets up the Pinot cluster

  • Creates a sample table and loads sample data

The following quick-start scripts are available

  • Batch example

  • Streaming example

  • Hybrid example

Before running the scripts, create an isolated bridge network pinot-demo in docker. This will allow all docker containers to easily communicate with each other. You can create the network using the following command -

hashtag
Batch example

In this example we demonstrate how to do batch processing with Pinot.

  • Starts Pinot deployment by starting

    • Apache Zookeeper

    • Pinot Controller

    • Pinot Broker

    • Pinot Server

  • Creates a demo table

    • baseballStats

  • Launches a standalone data ingestion job

    • Builds one Pinot segment for a given CSV data file for table baseballStats

    • Pushes the built segment to the Pinot controller

  • Issues sample queries to Pinot

Once the Docker container is running, you can view the logs by running the following command.

That's it! We've spun up a Pinot cluster.

circle-info

It may take a while for all the Pinot components to start and for the sample data to be loaded.

Use the below command to check the status in the container logs.

Your cluster is ready once you see the cluster setup completion messages and sample queries, as demonstrated below.

Cluster Setup Completion Example

You can head over to Exploring Pinot to check out the data in the baseballStats table.

hashtag
Streaming example

In this example we demonstrate how to do stream processing with Pinot.

  • Starts Pinot deployment by starting

    • Apache Kafka

    • Apache Zookeeper

    • Pinot Controller

    • Pinot Broker

    • Pinot Server

  • Creates a demo table

    • meetupRsvp

  • Launches a meetup stream

  • Publishes data to a Kafka topic meetupRSVPEvents to be subscribed to by Pinot

  • Issues sample queries to Pinot

Once the cluster is up, you can head over to Exploring Pinot to check out the data in the meetupRSVPEvents table.

hashtag
Hybrid example

In this example we demonstrate how to do hybrid stream and batch processing with Pinot.

  1. Starts Pinot deployment by starting

    • Apache Kafka

    • Apache Zookeeper

    • Pinot Controller

    • Pinot Broker

    • Pinot Server

  2. Creates a demo table

    • airlineStats

  3. Launches a standalone data ingestion job

    • Builds Pinot segments under a given directory of Avro files for table airlineStats

    • Pushes built segments to Pinot controller

  4. Launches a stream of flights stats

  5. Publishes data to a Kafka topic airlineStatsEvents to be subscribed to by Pinot

  6. Issues sample queries to Pinot

Once the cluster is up, you can head over to Exploring Pinot to check out the data in the airlineStats table.

Manual cluster setup

Stream ingestion example

The Docker instructions on this page are still WIP

So far, we setup our cluster, ran some queries on the demo tables and explored the admin endpoints. We also uploaded some sample batch data for transcript table.

Now, it's time to ingest from a sample stream into Pinot. The rest of the instructions assume you're using Pinot running in Dockerarrow-up-right (inside a pinot-quickstart container).

hashtag
Data Stream

First, we need to setup a stream. Pinot has out-of-the-box realtime ingestion support for Kafka. Other streams can be plugged in, more details in .

Let's setup a demo Kafka cluster locally, and create a sample topic transcript-topic

Start Kafka

Create a Kafka Topic

Start Kafka

Start Kafka cluster on port 9876 using the same Zookeeper from the quick-start examples

Create a Kafka topic

hashtag
Creating a Schema

If you followed the , you have already pushed a schema for your sample table. If not, head over to on that page, to learn how to create a schema for your sample data.

hashtag
Creating a table config

If you followed , you learnt how to push an offline table and schema. Similar to the offline table config, we will create a realtime table config for the sample. Here's the realtime table config for the transcript table. For a more detailed overview about table, checkout .

hashtag
Uploading your schema and table config

Now that we have our table and schema, let's upload them to the cluster. As soon as the realtime table is created, it will begin ingesting from the Kafka topic.

hashtag
Loading sample data into stream

Here's a JSON file for transcript table data:

Push sample JSON into Kafka topic, using the Kafka script from the Kafka download

hashtag
Ingesting streaming data

As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the to checkout the realtime data

docker logs pinot-quickstart -f
docker network create -d bridge pinot-demo
docker run \
    --network=pinot-demo \
    --name pinot-quickstart \
    -p 9000:9000 \
    -d apachepinot/pinot:latest QuickStart \
    -type batch
docker logs pinot-quickstart -f
# stop previous container, if any, or use different network
docker run \
    --network=pinot-demo \
    --name pinot-quickstart \
    -p 9000:9000 \
    -d apachepinot/pinot:latest QuickStart \
    -type stream
# stop previous container, if any, or use different network
docker run \
    --network=pinot-demo \
    --name pinot-quickstart \
    -p 9000:9000 \
    -d apachepinot/pinot:latest QuickStart \
    -type hybrid
Neha Pawar from the Apache Pinot team shows you how to setup a Pinot cluster
Download the latest Kafkaarrow-up-right. Create a topic
docker run \
    --network pinot-demo --name=kafka \
    -e KAFKA_ZOOKEEPER_CONNECT=pinot-quickstart:2123/kafka \
    -e KAFKA_BROKER_ID=0 \
    -e KAFKA_ADVERTISED_HOST_NAME=kafka \
    -d wurstmeister/kafka:latest
docker exec \
  -t kafka \
  /opt/kafka/bin/kafka-topics.sh \
  --zookeeper pinot-quickstart:2123/kafka \
  --partitions=1 --replication-factor=1 \
  --create --topic transcript-topic
bin/pinot-admin.sh  StartKafka -zkAddress=localhost:2123/kafka -port 9876
Pluggable Streams
Batch upload sample data
Creating a schema
Batch upload sample data
Table
Query Console arrow-up-right
bin/kafka-topics.sh --create --bootstrap-server localhost:9876 --replication-factor 1 --partitions 1 --topic transcript-topic

Manual cluster setup

This quick start guide will show you how to set up a Pinot cluster manually.

hashtag
Start Pinot components (scripts or Docker images)

A manual cluster setup consists of the following components - 1. Zookeeper 2. Controller 3. Broker 4. Server 5. Kafka

We will run each of these components in separate containers

hashtag
Start Pinot Components using docker

hashtag
Prerequisites

circle-info

If running locally, please ensure your docker cluster has enough resources, below is a sample config.

hashtag
Pull docker image

You can try out the pre-built Pinot all-in-one docker image.

(Optional) You can also follow the instructions to build your own images.

hashtag
0. Create a Network

Create an isolated bridge network in docker

hashtag
1. Start Zookeeper

Start Zookeeper in daemon mode. This is a single node zookeeper setup. Zookeeper is the central metadata store for Pinot and should be set up with replication for production use. For more information, see .

hashtag
2. Start Pinot Controller

Start Pinot Controller in daemon and connect to Zookeeper.

circle-info

The command below expects a 4GB memory container. Please tune-Xms and-Xmx if your machine doesn't have enough resources.

hashtag
3. Start Pinot Broker

Start Pinot Broker in daemon and connect to Zookeeper.

circle-info

The command below expects a 4GB memory container. Please tune-Xms and-Xmx if your machine doesn't have enough resources.

hashtag
4. Start Pinot Server

Start Pinot Server in daemon and connect to Zookeeper.

circle-info

The command below expects a 16GB memory container. Please tune-Xms and-Xmx if your machine doesn't have enough resources.

hashtag
5. Start Kafka

Optionally, you can also start Kafka for setting up realtime streams. This brings up the Kafka broker on port 9092.

Now all Pinot related components are started as an empty cluster.

You can run the below command to check container status.

Sample Console Output

hashtag
Start Pinot Components using Docker Compose

hashtag
Prerequisites

circle-info

Prerequisites

Follow this instruction in to get Pinot

hashtag

Run docker-compose up to launch all the components.

You can run the below command to check container status.

Sample Console Output

Now it's time to start adding data to the cluster. Check out some of the or follow the and for instructions on loading your own data.

/tmp/pinot-quick-start/transcript-table-realtime.json
{
  "tableName": "transcript",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "timestampInEpoch",
    "timeType": "MILLISECONDS",
    "schemaName": "transcript",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "lowlevel",
      "stream.kafka.topic.name": "transcript-topic",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.broker.list": "localhost:9876",
      "realtime.segment.flush.threshold.size": "0",
      "realtime.segment.flush.threshold.time": "24h",
      "realtime.segment.flush.desired.size": "50M",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}
docker run \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-streaming-table-creation \
    apachepinot/pinot:latest AddTable \
    -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
    -tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
    -controllerHost pinot-quickstart \
    -controllerPort 9000 \
    -exec
bin/pinot-admin.sh AddTable \
    -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
    -tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
    -exec
/tmp/pinot-quick-start/rawData/transcript.json
{"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"Maths","score":3.8,"timestampInEpoch":1571900400000}
{"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"History","score":3.5,"timestampInEpoch":1571900400000}
{"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Maths","score":3.2,"timestampInEpoch":1571900400000}
{"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Chemistry","score":3.6,"timestampInEpoch":1572418800000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Geography","score":3.8,"timestampInEpoch":1572505200000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"English","score":3.5,"timestampInEpoch":1572505200000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Maths","score":3.2,"timestampInEpoch":1572678000000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Physics","score":3.6,"timestampInEpoch":1572678000000}
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"Maths","score":3.8,"timestampInEpoch":1572678000000}
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"English","score":3.5,"timestampInEpoch":1572678000000}
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"History","score":3.2,"timestampInEpoch":1572854400000}
{"studentID":212,"firstName":"Nick","lastName":"Young","gender":"Male","subject":"History","score":3.6,"timestampInEpoch":1572854400000}
bin/kafka-console-producer.sh \
    --broker-list localhost:9876 \
    --topic transcript-topic < /tmp/pinot-quick-start/rawData/transcript.json
circle-info

If running locally, please ensure your docker cluster has enough resources, below is a sample config.

Sample docker resources

Create a file called docker-compose.yml that contains the following:

Start Pinot components via launcher scripts

hashtag
1. Start Zookeeper

You can use Zooinspectorarrow-up-right to browse the Zookeeper instance.

hashtag
2. Start Pinot Controller

circle-info

The examples below are for Java 8 users.

For Java 11+ users, please remove the GC settings insideJAVA_OPTS. So it looks like: export JAVA_OPTS="-Xms4G -Xmx8G"

hashtag
3. Start Pinot Broker

hashtag
4. Start Pinot Server

hashtag
5. Start Kafka

Now all Pinot related components are started as an empty cluster.

here
Running Replicated Zookeeperarrow-up-right
Getting Pinot
Recipes
Batch upload sample data
Stream sample data
Sample docker resources

Query FAQ

hashtag
Querying

hashtag
I get the following error when running a query, what does it mean?

This essentially implies that the Pinot Broker assigned to the table specified in the query was not found. A common root cause for this is a typo in the table name in the query. Another uncommon reason could be if there wasn't actually a broker with required broker tenant tag for the table.

hashtag
What are all the fields in the Pinot query's JSON response?

Here's the page explaining the Pinot response format:

hashtag
SQL Query fails with "Encountered 'timestamp' was expecting one of..."

"timestamp" is a reserved keyword in SQL. Escape timestamp with double quotes.

Other commonly encountered reserved keywords are date, time, table.

hashtag
Filtering on STRING column WHERE column = "foo" does not work?

For filtering on STRING columns, use single quotes

hashtag
ORDER BY using an alias doesn't work?

The fields in the ORDER BY clause must be one of the group by clauses or aggregations, BEFORE applying the alias. Therefore, this will not work

Instead, this will work

hashtag
Does pagination work in GROUP BY queries?

No. Pagination only works for SELECTION queries

hashtag
How do I increase timeout for a query ?

You can add this at the end of your query: option(timeoutMs=X). For eg: the following example will use a timeout of 20 seconds for the query:

hashtag
How do I optimize my Pinot table for doing aggregations and group-by on high cardinality columns ?

In order to speed up aggregations, you can enable metrics aggregation on the required column by adding a in the corresponding schema and setting aggregateMetrics to true in the table config. You can also use a star-tree index config for such columns ()

hashtag
How do I verify that an index is created on a particular column ?

There are 2 ways to verify this:

  1. Log in to a server that hosts segments of this table. Inside the data directory, locate the segment directory for this table. In this directory, there is a file named index_map which lists all the indexes and other data structures created for each segment. Verify that the requested index is present here.

  2. During query: Use the column in the filter predicate and check the value of numEntriesScannedInFilter . If this value is 0, then indexing is working as expected (works for Inverted index)

hashtag
Does Pinot use a default value for LIMIT in queries?

Yes, Pinot uses a default value of LIMIT 10 in queries. The reason behind this default value is to avoid unintentionally submitting expensive queries that end up fetching or processing a lot of data from Pinot. Users can always overwrite this by explicitly specifying a LIMIT value.

hashtag
Does Pinot cache query results?

Pinot does not cache query results, each query is computed in its entirety. Note though, running the same or similar query multiple times will naturally pull in segment pages into memory making subsequent calls faster. Also, for realtime systems, the data is changing in realtime, so results cannot be cached. For offline-only systems, caching layer can be built on top of Pinot, with invalidation mechanism built-in to invalidate the cache when data is pushed into Pinot.

hashtag
How do I determine if StarTree index is being used for my query?

The query execution engine will prefer to use StarTree index for all queries where it can be used. The criteria to determine whether StarTree index can be used is as follows:

  • All aggregation function + column pairs in the query must exist in the StarTree index.

  • All dimensions that appear in filter predicates and group-by should be StarTree dimensions.

For queries where above is true, StarTree index is used. For other queries, the execution engine will default to using the next best index available.

Ingestion FAQ

hashtag
Data processing

hashtag
What is a good segment size?

While Pinot can work with segments of various sizes, for optimal use of Pinot, you want to get your segments sized in the 100MB to 500MB (un-tarred/uncompressed) range. Please note that having too many (thousands or more) of tiny segments for a single table just creates more overhead in terms of the metadata storage in Zookeeper as well as in the Pinot servers' heap. At the same time, having too few really large (GBs) segments reduces parallelism of query execution, as on the server side, the thread parallelism of query execution is at segment level.

export PINOT_VERSION=0.9.0
export PINOT_IMAGE=apachepinot/pinot:${PINOT_VERSION}
docker pull ${PINOT_IMAGE}
docker network create -d bridge pinot-demo
docker run \
    --network=pinot-demo \
    --name  pinot-zookeeper \
    --restart always \
    -p 2181:2181 \
    -d zookeeper:3.5.6
docker run --rm -ti \
    --network=pinot-demo \
    --name pinot-controller \
    -p 9000:9000 \
    -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms1G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-controller.log" \
    -d ${PINOT_IMAGE} StartController \
    -zkAddress pinot-zookeeper:2181
docker run --rm -ti \
    --network=pinot-demo \
    --name pinot-broker \
    -p 8099:8099 \
    -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-broker.log" \
    -d ${PINOT_IMAGE} StartBroker \
    -zkAddress pinot-zookeeper:2181
docker run --rm -ti \
    --network=pinot-demo \
    --name pinot-server \
    -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log" \
    -d ${PINOT_IMAGE} StartServer \
    -zkAddress pinot-zookeeper:2181
docker run --rm -ti \
    --network pinot-demo --name=kafka \
    -e KAFKA_ZOOKEEPER_CONNECT=pinot-zookeeper:2181/kafka \
    -e KAFKA_BROKER_ID=0 \
    -e KAFKA_ADVERTISED_HOST_NAME=kafka \
    -d wurstmeister/kafka:latest
docker container ls -a
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS                                                  NAMES
9ec20e4463fa        wurstmeister/kafka:latest   "start-kafka.sh"         43 minutes ago      Up 43 minutes                                                              kafka
0775f5d8d6bf        apachepinot/pinot:latest    "./bin/pinot-admin.s…"   44 minutes ago      Up 44 minutes       8096-8099/tcp, 9000/tcp                                pinot-server
64c6392b2e04        apachepinot/pinot:latest    "./bin/pinot-admin.s…"   44 minutes ago      Up 44 minutes       8096-8099/tcp, 9000/tcp                                pinot-broker
b6d0f2bd26a3        apachepinot/pinot:latest    "./bin/pinot-admin.s…"   45 minutes ago      Up 45 minutes       8096-8099/tcp, 0.0.0.0:9000->9000/tcp                  pinot-quickstart
570416fc530e        zookeeper:3.5.6             "/docker-entrypoint.…"   45 minutes ago      Up 45 minutes       2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp, 8080/tcp   pinot-zookeeper
docker container ls 
CONTAINER ID   IMAGE                     COMMAND                  CREATED              STATUS              PORTS                                                                     NAMES
ba5cb0868350   apachepinot/pinot:0.9.0   "./bin/pinot-admin.s…"   About a minute ago   Up About a minute   8096-8099/tcp, 9000/tcp                                                   manual-pinot-server
698f160852f9   apachepinot/pinot:0.9.0   "./bin/pinot-admin.s…"   About a minute ago   Up About a minute   8096-8098/tcp, 9000/tcp, 0.0.0.0:8099->8099/tcp, :::8099->8099/tcp        manual-pinot-broker
b1ba8cf60d69   apachepinot/pinot:0.9.0   "./bin/pinot-admin.s…"   About a minute ago   Up About a minute   8096-8099/tcp, 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp                  manual-pinot-controller
54e7e114cd53   zookeeper:3.5.6           "/docker-entrypoint.…"   About a minute ago   Up About a minute   2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 8080/tcp   manual-zookeeper
docker-compose.yml
version: '3.7'
services:
  zookeeper:
    image: zookeeper:3.5.6
    hostname: zookeeper
    container_name: manual-zookeeper
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
  pinot-controller:
    image: apachepinot/pinot:0.9.0
    command: "StartController -zkAddress manual-zookeeper:2181"
    container_name: "manual-pinot-controller"
    restart: unless-stopped
    ports:
      - "9000:9000"
    environment:
      JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms1G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-controller.log"
    depends_on:
      - zookeeper
  pinot-broker:
    image: apachepinot/pinot:0.9.0
    command: "StartBroker -zkAddress manual-zookeeper:2181"
    restart: unless-stopped
    container_name: "manual-pinot-broker"
    ports:
      - "8099:8099"
    environment:
      JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-broker.log"
    depends_on:
      - pinot-controller
  pinot-server:
    image: apachepinot/pinot:0.9.0
    command: "StartServer -zkAddress manual-zookeeper:2181"
    restart: unless-stopped
    container_name: "manual-pinot-server" 
    environment:
      JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log"
    depends_on:
      - pinot-broker
cd apache-pinot-${PINOT_VERSION}-bin
bin/pinot-admin.sh StartZookeeper \
  -zkPort 2191
export JAVA_OPTS="-Xms4G -Xmx8G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-controller.log"
bin/pinot-admin.sh StartController \
    -zkAddress localhost:2191 \
    -controllerPort 9000
export JAVA_OPTS="-Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-broker.log"
bin/pinot-admin.sh StartBroker \
    -zkAddress localhost:2191
export JAVA_OPTS="-Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log"
bin/pinot-admin.sh StartServer \
    -zkAddress localhost:2191
bin/pinot-admin.sh  StartKafka \ 
  -zkAddress=localhost:2191/kafka \
  -port 19092
{'errorCode': 410, 'message': 'BrokerResourceMissingError'}
https://docs.pinot.apache.org/users/api/querying-pinot-using-standard-sql/response-formatarrow-up-right
metric fieldarrow-up-right
read more about star-tree herearrow-up-right

hashtag
Can multiple Pinot tables consume from the same Kafka topic?

Yes. Each table can be independently configured to consume from any given Kafka topic, regardless of whether there are other tables that are also consuming from the same Kafka topic.

hashtag
How do I enable partitioning in Pinot, when using Kafka stream?

Setup partitioner in the Kafka producer: https://docs.confluent.io/current/clients/producer.htmlarrow-up-right

The partitioning logic in the stream should match the partitioning config in Pinot. Kafka uses murmur2, and the equivalent in Pinot is Murmur function.

Set partitioning config as below using same column used in Kafka

and also set

More details about how partitioner works in Pinot here.

hashtag
How do I store BYTES column in JSON data?

For JSON, you can use hex encoded string to ingest BYTES

hashtag
How do I flatten my JSON Kafka stream?

We have json_format(field)arrow-up-right function which can store a top level json field as a STRING in Pinot.

Then you can use these json functionsarrow-up-right during query time, to extract fields from the json string.

circle-exclamation

NOTE This works well if some of your fields are nested json, but most of your fields are top level json keys. If all of your fields are within a nested JSON key, you will have to store the entire payload as 1 column, which is not ideal.

Support for flattening during ingestion is on the roadmap: https://github.com/apache/pinot/issues/5264arrow-up-right

hashtag
How do I escape Unicode in my Job Spec YAML file?

To use explicit code points, you must double-quote (not single-quote) the string, and escape the code point via "\uHHHH", where HHHH is the four digit hex code for the character. See https://yaml.org/spec/spec.html#escaping/in%20double-quoted%20scalars/arrow-up-right for more details.

hashtag
Is there a limit on the maximum length of a string column in Pinot?

By default, Pinot limits the length of a String column to 512 bytes. If you want to overwrite this value, you can set the maxLength attribute in the schema as follows:

hashtag
When can new events become queryable when getting ingested into a real-time table?

Events are available to be read by queries as soon as they are ingested. This is because events are instantly indexed in-memory upon ingestion.

The ingestion of events into the real-time table is not transactional, so replicas of the open segment are not immediately consistent. Pinot trades consistency for availability upon network partitioning (CAP theorem) to provide ultra-low ingestion latencies at high throughput.

However, when the open segment is closed and its in-memory indexes are flushed to persistent storage, all its replicas are guaranteed to be consistent, with the commit protocolarrow-up-right.

hashtag
Indexing

hashtag
How to set inverted indexes?

Inverted indexes are set in the tableConfig's tableIndexConfig -> invertedIndexColumns list. Here's the documentation for tableIndexConfig: https://docs.pinot.apache.org/basics/components/table#tableindexconfig-1arrow-up-right along with a sample table that has set inverted indexes on some columns.

Applying inverted indexes to a table config will generate inverted index to all new segments. In order to apply the inverted indexes to all existing segments, follow steps in How to apply inverted index to existing setup?

hashtag
How to apply inverted index to existing setup?

  1. Add the columns you wish to index to the tableIndexConfig-> invertedIndexColumns list. This sample table config show inverted indexes set: https://docs.pinot.apache.org/basics/components/table#offline-table-config arrow-up-rightTo update the table config use the Pinot Swagger API: http://localhost:9000/help#!/Table/updateTableConfigarrow-up-right

  2. Invoke the reload API: http://localhost:9000/help#!/Segment/reloadAllSegmentsarrow-up-right

Right now, there’s no easy way to confirm that reload succeeded. One way it to check out the index_map file inside the segment metadata, you should see inverted index entries for the new columns. An API for this is coming soon: https://github.com/apache/pinot/issues/5390arrow-up-right

hashtag
How to create star-tree indexes?

Star-tree indexes are configured in the table config under the tableIndexConfig -> starTreeIndexConfigs (list) and enableDefaultStarTree (boolean). Read more about how to configure star-tree indexes: https://docs.pinot.apache.org/basics/indexing/star-tree-index#index-generationarrow-up-right

The new segments will have star-tree indexes generated after applying the star-tree index configs to the table config. Currently Pinot does not support adding star-tree indexes to the existing segments.

hashtag
Handling time in Pinot

hashtag
How does Pinot’s real-time ingestion handle out-of-order events?

Pinot does not require ordering of event time stamps. Out of order events are still consumed and indexed into the "currently consuming" segment. In a pathological case, if you have a 2 day old event come in "now", it will still be stored in the segment that is open for consumption "now". There is no strict time-based partitioning for segments, but star-indexes and hybrid tables will handle this as appropriate.

See the Components > Brokerarrow-up-right for more details about how hybrid tables handle this. Specifically, the time-boundary is computed as max(OfflineTIme) - 1 unit of granularity. Pinot does store the min-max time for each segment and uses it for pruning segments, so segments with multiple time intervals may not be perfectly pruned.

When generating star-indexes, the time column will be part of the star-tree so the tree can still be efficiently queried for segments with multiple time intervals.

hashtag
What is the purpose of a hybrid table not using max(OfflineTime) to determine the time-boundary, and instead using an offset?

This lets you have an old event up come in without building complex offline pipelines that perfectly partition your events by event timestamps. With this offset, even if your offline data pipeline produces segments with a maximum timestamp, Pinot will not use the offline dataset for that last chunk of segments. The expectation is if you process offline the next time-range of data, your data pipeline will include any late events.

hashtag
Why are segments not strictly time-partitioned?

It might seem odd that segments are not strictly time-partitioned, unlike similar systems such as Apache Druid. This allows real-time ingestion to consume out-of-order events. Even though segments are not strictly time-partitioned, Pinot will still index, prune, and query segments intelligently by time-intervals to for performance of hybrid tables and time-filtered data.

When generating offline segments, the segments generated such that segments only contain one time-interval and are well partitioned by the time column.

select "timestamp" from myTable
SELECT COUNT(*) from myTable WHERE column = 'foo'
SELECT count(colA) as aliasA, colA from tableA GROUP BY colA ORDER BY aliasA
SELECT count(colA) as sumA, colA from tableA GROUP BY colA ORDER BY count(colA)
SELECT COUNT(*) from myTable option(timeoutMs=20000)
"tableIndexConfig": {
      ..
      "segmentPartitionConfig": {
        "columnPartitionMap": {
          "column_foo": {
            "functionName": "Murmur",
            "numPartitions": 12 // same as number of kafka partitions
          }
        }
      }
"routing": {
      "segmentPrunerTypes": ["partition"]
    }
    {
      "dataType": "STRING",
      "maxLength": 1000,
      "name": "textDim1"
    },

Operations FAQ

hashtag
Operations

hashtag
How much heap should I allocate for my Pinot instances?

Typically, Pinot components try to use as much off-heap (MMAP/DirectMemory) where ever possible. For example, Pinot servers load segments in memory-mapped files in MMAP mode (recommended), or direct memory in HEAP mode. Heap memory is used mostly for query execution and storing some metadata. We have seen production deployments with high throughput and low-latency work well with just 16 GB of heap for Pinot servers and brokers. Pinot controller may also cache some metadata (table configs etc) in heap, so if there are just a few tables in the Pinot cluster, a few GB of heap should suffice.

hashtag
Does Pinot provide any backup/restore mechanism?

Pinot relies on deep-storage for storing backup copy of segments (offline as well as realtime). It relies on Zookeeper to store metadata (table configs, schema, cluster state, etc). It does not explicitly provide tools to take backups or restore these data, but relies on the deep-storage (ADLS/S3/GCP/etc), and ZK to persist these data/metadata.

hashtag
Can I change a column name in my table, without losing data?

Changing a column name or data type is considered backward incompatible change. While Pinot does support schema evolution for backward compatible changes, it does not support backward incompatible changes like changing name/data-type of a column.

hashtag
How to change number of replicas of a table?

You can change the number of replicas by updating the table config's section. Make sure you have at least as many servers as the replication.

For OFFLINE table, update

For REALTIME table update

After changing the replication, run a .

hashtag
How to run a rebalance on a table?

Refer to .

hashtag
How to control number of segments generated?

The number of segments generated depends on the number of input files. If you provide only 1 input file, you will get 1 segment. If you break up the input file into multiple files, you will get as many segments as the input files.

hashtag
What are the common reasons my segment is in a BAD state ?

This typically happens when the server is unable to load the segment. Possible causes: Out-Of-Memory, no-disk space, unable to download segment from deep-store, and similar other errors. Please check server logs for more information.

hashtag
How to reset a segment when it runs into a BAD state?

Use the segment reset controller REST API to reset the segment:

hashtag
What's the difference to Reset, Refresh, or Reload a segment?

RESET: this gets a segment in ERROR state back to ONLINE or CONSUMING state. Behind the scenes, Pinot controller takes the segment to OFFLINE state, waits for External View to stabilize, and then moves it back to ONLINE/CONSUMING state, thus effectively resetting segments or consumers in error states.

REFRESH: this replaces the segment with a new one, with the same name but often different data. Under the hood, Pinot controller sets new segment metadata in Zookeeper, and notifies brokers and servers to check their local states about this segment and update accordingly. Servers also download the new segment to replace the old one, when both have different checksums. There is no separate rest API for refreshing, and it is done as part of SegmentUpload API today.

RELOAD: this reloads the segment, often to generate a new index as updated in table config. Underlying, Pinot server gets the new table config from Zookeeper, and uses it to guide the segment reloading. In fact, the last step of REFRESH as explained above is to load the segment into memory to serve queries. There is a dedicated rest API for reloading. By default, it doesn't download segment. But option is provided to force server to download segment to replace the local one cleanly.

In addition, RESET brings the segment OFFLINE temporarily; while REFRESH and RELOAD swap the segment on server atomically without bringing down the segment or affecting ongoing queries.

hashtag
How can I make brokers/servers join the cluster without the DefaultTenant tag?

Set this property in your controller.conf file

Now your brokers and servers should join the cluster as broker_untagged and server_untagged . You can then directly use the POST /tenants API to create the desired tenants

hashtag
Tuning and Optimizations

hashtag
Do replica groups work for real-time?

Yes, replica groups work for realtime. There's 2 parts to enabling replica groups:

  1. Replica groups segment assignment

  2. Replica group query routing

Replica group segment assignment

Replica group segment assignment is achieved in realtime, if number of servers is a multiple of number of replicas. The partitions get uniformly sprayed across the servers, creating replica groups. For example, consider we have 6 partitions, 2 replicas, and 4 servers.

As you can see, the set (S0, S2) contains r1 of every partition, and (s1, S3) contains r2 of every partition. The query will only be routed to one of the sets, and not span every server. If you are are adding/removing servers from an existing table setup, you have to run for segment assignment changes to take effect.

Replica group query routing

Once replica group segment assignment is in effect, the query routing can take advantage of it. For replica group based query routing, set the following in the table config's section, and then restart brokers

hashtag

S3

p5

S0

S1

p6

S2

S3

r1

r2

p1

S0

S1

p2

S2

S3

p3

S0

S1

p4

segmentsConfigarrow-up-right
replicationarrow-up-right
replicasPerPartitionarrow-up-right
table rebalance
Rebalance
rebalance
routingarrow-up-right

S2

{ 
    "tableName": "pinotTable", 
    "tableType": "OFFLINE", 
    "segmentsConfig": {
      "replication": "3", 
      ... 
    }
    ..
{ 
    "tableName": "pinotTable", 
    "tableType": "REALTIME", 
    "segmentsConfig": {
      "replicasPerPartition": "3", 
      ... 
    }
    ..
curl -X POST "{host}/segments/{tableNameWithType}/{segmentName}/reset"
cluster.tenant.isolation.enable=false
curl -X POST "http://localhost:9000/tenants" 
-H "accept: application/json" 
-H "Content-Type: application/json" 
-d "{\"tenantRole\":\"BROKER\",\"tenantName\":\"foo\",\"numberOfInstances\":1}"
{
    "tableName": "pinotTable", 
    "tableType": "REALTIME",
    "routing": {
        "instanceSelectorType": "replicaGroup"
    }
    ..
}

Batch import example

Step-by-step guide on pushing your own data into the Pinot cluster

So far, we setup our cluster, ran some queries, explored the admin endpoints. Now, it's time to get our own data into Pinot. The rest of the instructions assume you're using (inside a pinot-quickstart container).

hashtag
Preparing your data

Let's gather our data files and put it in pinot-quick-start/rawdata.

Supported file formats are CVS, JSON, AVRO, PARQUET, THRIFT, ORC. If you don't have sample data, you can use this sample CSV.

hashtag
Creating a schema

Schema is used to define the columns and data types of the Pinot table. A detailed overview of the schema can be found in Schema.

Briefly, we categorize our columns into 3 types

Column Type

Description

Dimensions

Typically used in filters and group by, for slicing and dicing into data

Metrics

Typically used in aggregations, represents the quantitative data

Time

Optional column, represents the timestamp associated with each row

For example, in our sample table, the playerID, yearID, teamID, league, playerName columns are the dimensions, the playerStint, numberOfgames, numberOfGamesAsBatter, AtBatting, runs, hits, doules, triples, homeRuns, runsBattedIn, stolenBases, caughtStealing, baseOnBalls, strikeouts, intentionalWalks, hitsByPitch, sacrificeHits, sacrificeFlies, groundedIntoDoublePlays, G_old columns are the metrics and there is no time column.

Once you have identified the dimensions, metrics and time columns, create a schema for your data, using the reference below.

hashtag
Creating a table config

A table config is used to define the config related to the Pinot table. A detailed overview of the table can be found in Table.

Here's the table config for the sample CSV file. You can use this as a reference to build your own table config. Simply edit the tableName and schemaName.

hashtag
Uploading your table config and schema

Check the directory structure so far

Upload the table config using the following command

Check out the table config and schema in the Rest APIarrow-up-right to make sure it was successfully uploaded.

hashtag
Creating a segment

A Pinot table's data is stored as Pinot segments. A detailed overview of the segment can be found in Segment.

To generate a segment, we need to first create a job spec yaml file. JobSpec yaml file has all the information regarding data format, input data location and pinot cluster coordinates. You can just copy over this job spec file. If you're using your own data, be sure to 1) replace transcript with your table name 2) set the right recordReaderSpec

Use the following command to generate a segment and upload it

Sample output

Check that your segment made it to the table using the Rest APIarrow-up-right

hashtag
Querying your data

You're all set! You should see your table in the Query Consolearrow-up-right and be able to run queries against it now.

Pinot running in Dockerarrow-up-right

Running Pinot in Kubernetes

Pinot quick start in Kubernetes

hashtag
1. Prerequisites

circle-info

This quickstart assumes that you already have a running Kubernetes cluster. Please follow the links below to set up a Kubernetes cluster.

docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-batch-table-creation \
    apachepinot/pinot:latest AddTable \
    -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
    -tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
    -controllerHost pinot-quickstart \
    -controllerPort 9000 -exec
bin/pinot-admin.sh AddTable \
  -tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
  -schemaFile /tmp/pinot-quick-start/transcript-schema.json -exec
/tmp/pinot-quick-start/docker-job-spec.yml
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/tmp/pinot-quick-start/rawdata/'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/tmp/pinot-quick-start/segments/'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'transcript'
  schemaURI: 'http://pinot-quickstart:9000/tables/transcript/schema'
  tableConfigURI: 'http://pinot-quickstart:9000/tables/transcript'
pinotClusterSpecs:
  - controllerURI: 'http://pinot-quickstart:9000'
/tmp/pinot-quick-start/batch-job-spec.yml
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/tmp/pinot-quick-start/rawdata/'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/tmp/pinot-quick-start/segments/'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'transcript'
  schemaURI: 'http://localhost:9000/tables/transcript/schema'
  tableConfigURI: 'http://localhost:9000/tables/transcript'
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'
docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-data-ingestion-job \
    apachepinot/pinot:latest LaunchDataIngestionJob \
    -jobSpecFile /tmp/pinot-quick-start/docker-job-spec.yml
bin/pinot-admin.sh LaunchDataIngestionJob \
    -jobSpecFile /tmp/pinot-quick-start/batch-job-spec.yml
mkdir -p /tmp/pinot-quick-start/rawdata
/tmp/pinot-quick-start/rawdata/transcript.csv
studentID,firstName,lastName,gender,subject,score,timestampInEpoch
200,Lucy,Smith,Female,Maths,3.8,1570863600000
200,Lucy,Smith,Female,English,3.5,1571036400000
201,Bob,King,Male,Maths,3.2,1571900400000
202,Nick,Young,Male,Physics,3.6,1572418800000
/tmp/pinot-quick-start/transcript-schema.json
{
  "schemaName": "transcript",
  "dimensionFieldSpecs": [
    {
      "name": "studentID",
      "dataType": "INT"
    },
    {
      "name": "firstName",
      "dataType": "STRING"
    },
    {
      "name": "lastName",
      "dataType": "STRING"
    },
    {
      "name": "gender",
      "dataType": "STRING"
    },
    {
      "name": "subject",
      "dataType": "STRING"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "score",
      "dataType": "FLOAT"
    }
  ],
  "dateTimeFieldSpecs": [{
    "name": "timestampInEpoch",
    "dataType": "LONG",
    "format" : "1:MILLISECONDS:EPOCH",
    "granularity": "1:MILLISECONDS"
  }]
}
/tmp/pinot-quick-start/transcript-table-offline.json
{
  "tableName": "transcript",
  "segmentsConfig" : {
    "timeColumnName": "timestampInEpoch",
    "timeType": "MILLISECONDS",
    "replication" : "1",
    "schemaName" : "transcript"
  },
  "tableIndexConfig" : {
    "invertedIndexColumns" : [],
    "loadMode"  : "MMAP"
  },
  "tenants" : {
    "broker":"DefaultTenant",
    "server":"DefaultTenant"
  },
  "tableType":"OFFLINE",
  "metadata": {}
}
$ ls /tmp/pinot-quick-start
rawdata			transcript-schema.json	transcript-table-offline.json

$ ls /tmp/pinot-quick-start/rawdata 
transcript.csv
SegmentGenerationJobSpec: 
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
  segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**\/*.csv
inputDirURI: /tmp/pinot-quick-start/rawdata/
jobType: SegmentCreationAndTarPush
outputDirURI: /tmp/pinot-quick-start/segments
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: 'http://localhost:9000'}
pinotFSSpecs:
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: null
recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader,
  configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig,
  configs: null, dataFormat: csv}
segmentNameGeneratorSpec: null
tableSpec: {schemaURI: 'http://localhost:9000/tables/transcript/schema', tableConfigURI: 'http://localhost:9000/tables/transcript',
  tableName: transcript}

Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Finished building StatsCollector!
Collected stats for 4 documents
Using fixed bytes value dictionary for column: studentID, size: 9
Created dictionary for STRING column: studentID with cardinality: 3, max length in bytes: 3, range: 200 to 202
Using fixed bytes value dictionary for column: firstName, size: 12
Created dictionary for STRING column: firstName with cardinality: 3, max length in bytes: 4, range: Bob to Nick
Using fixed bytes value dictionary for column: lastName, size: 15
Created dictionary for STRING column: lastName with cardinality: 3, max length in bytes: 5, range: King to Young
Created dictionary for FLOAT column: score with cardinality: 4, range: 3.2 to 3.8
Using fixed bytes value dictionary for column: gender, size: 12
Created dictionary for STRING column: gender with cardinality: 2, max length in bytes: 6, range: Female to Male
Using fixed bytes value dictionary for column: subject, size: 21
Created dictionary for STRING column: subject with cardinality: 3, max length in bytes: 7, range: English to Physics
Created dictionary for LONG column: timestampInEpoch with cardinality: 4, range: 1570863600000 to 1572418800000
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
Converting segment: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to v3 format
v3 segment location for segment: transcript_OFFLINE_1570863600000_1572418800000_0 is /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3
Deleting files in v1 segment directory: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0
Starting building 1 star-trees with configs: [StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[org.apache.pinot.core.startree.v2.AggregationFunctionColumnPair@3a48efdc],maxLeafRecords=1]] using OFF_HEAP builder
Starting building star-tree with config: StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[org.apache.pinot.core.startree.v2.AggregationFunctionColumnPair@3a48efdc],maxLeafRecords=1]
Generated 3 star-tree records from 4 segment records
Finished constructing star-tree, got 9 tree nodes and 4 records under star-node
Finished creating aggregated documents, got 6 aggregated records
Finished building star-tree in 10ms
Finished building 1 star-trees in 27ms
Computed crc = 3454627653, based on files [/var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/columns.psf, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/index_map, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/metadata.properties, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index_map]
Driver, record read time : 0
Driver, stats collector time : 0
Driver, indexing time : 0
Tarring segment from: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz
Size for segment: transcript_OFFLINE_1570863600000_1572418800000_0, uncompressed: 6.73KB, compressed: 1.89KB
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: [/tmp/pinot-quick-start/segments/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz]... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@243c4f91] for table transcript
Pushing segment: transcript_OFFLINE_1570863600000_1572418800000_0 to location: http://localhost:9000 for table transcript
Sending request: http://localhost:9000/v2/segments?tableName=transcript to controller: nehas-mbp.hsd1.ca.comcast.net, version: Unknown
Response for pushing table transcript segment transcript_OFFLINE_1570863600000_1572418800000_0 to location http://localhost:9000 - 200: {"status":"Successfully uploaded segment: transcript_OFFLINE_1570863600000_1572418800000_0 of table: transcript"}
  • Enable Kubernetes on Docker-Desktoparrow-up-right

  • Install Minikube for local setuparrow-up-right (make sure to run with enough resources e.g. minikube start --vm=true --cpus=4 --memory=8g --disk-size=50g)

  • Setup a Kubernetes Cluster using Amazon Elastic Kubernetes Service (Amazon EKS)

  • hashtag
    2. Setting up a Pinot cluster in Kubernetes

    Before continuing, please make sure that you've downloaded Apache Pinot. The scripts for the setup in this guide can be found in our open source project on GitHub.

    The scripts can be found in the Pinot source at ./pinot/kubernetes/helm

    # checkout pinot
    git clone https://github.com/apache/pinot.git
    cd 

    hashtag
    2.1 Start Pinot with Helm

    Pinot repo has pre-packaged HelmCharts for Pinot and Presto. Helm Repo index file is herearrow-up-right.

    helm repo add pinot https://raw.githubusercontent.com/apache/pinot/master/kubernetes/helm
    kubectl create ns pinot-quickstart
    helm install pinot pinot/pinot \
        -n pinot-quickstart \
        --set cluster.name=pinot \
        --set server.replicaCount=2

    NOTE: Please specify StorageClass based on your cloud vendor. For Pinot Server, please don't mount blob store like AzureFile/GoogleCloudStorage/S3 as the data serving file system.

    Only use Amazon EBS/GCP Persistent Disk/Azure Disk style disks.

    • For AWS: "gp2"

    • For GCP: "pd-ssd" or "standard"

    • For Azure: "AzureDisk"

    • For Docker-Desktop: "hostpath"

    hashtag
    2.1.1 Update helm dependency

    hashtag
    2.1.2 Start Pinot with Helm

    hashtag
    2.2 Check Pinot deployment status

    hashtag
    3. Load data into Pinot using Kafka

    hashtag
    3.1 Bring up a Kafka cluster for real-time data ingestion

    hashtag
    3.2 Check Kafka deployment status

    Ensure the Kafka deployment is ready before executing the scripts in the following next steps.

    hashtag
    3.3 Create Kafka topics

    The scripts below will create two Kafka topics for data ingestion:

    hashtag
    3.4 Load data into Kafka and create Pinot schema/tables

    The script below will deploy 3 batch jobs.

    • Ingest 19492 JSON messages to Kafka topic flights-realtime at a speed of 1 msg/sec

    • Ingest 19492 Avro messages to Kafka topic flights-realtime-avro at a speed of 1 msg/sec

    • Upload Pinot schema airlineStats

    • Create Pinot table airlineStats to ingest data from JSON encoded Kafka topic flights-realtime

    • Create Pinot table airlineStatsAvro to ingest data from Avro encoded Kafka topic flights-realtime-avro

    hashtag
    4. Query using Pinot Data Explorer

    hashtag
    4.1 Pinot Data Explorer

    Please use the script below to perform local port-forwarding, which will also open Pinot query console in your default web browser.

    This script can be found in the Pinot source at ./pinot/kubernetes/helm/pinot

    hashtag
    5. Using Superset to query Pinot

    hashtag
    5.1 Bring up Superset

    Open superset.yaml file and goto the line showing storageClass. And change it based on your cloud vendor. kubectl get sc will get you the storageClass value for your Kubernetes system. E.g.

    • For AWS: "gp2"

    • For GCP: "pd-ssd" or "standard"

    • For Azure: "AzureDisk"

    • For Docker-Desktop: "hostpath"

    Then run:

    Ensure your cluster is up by running:

    hashtag
    5.2 (First time) Set up Admin account

    hashtag
    5.3 (First time) Init Superset

    hashtag
    5.4 Load Demo data source

    hashtag
    5.5 Access Superset UI

    You can run below command to navigate superset in your browser with the previous admin credential.

    You can open the imported dashboard by clicking Dashboards banner and then click on AirlineStats.

    hashtag
    6. Access Pinot using Trino

    hashtag
    6.1 Deploy Trino

    You can run the command below to deploy Trino with the Pinot plugin installed.

    The above command adds Trino HelmChart repo. You can then run the below command to see the charts.

    In order to connect Trino to Pinot, we need to add Pinot catalog, which requires extra configurations. You can run the below command to get all the configurable values.

    To add Pinot catalog, you can edit the additionalCatalogs section by adding:

    circle-info

    Pinot is deployed at namespace pinot-quickstart, so the controller serviceURL is pinot-controller.pinot-quickstart:9000

    After modifying the /tmp/trino-values.yaml file, you can deploy Trino with:

    Once you deployed the Trino, You can check Trino deployment status by:

    hashtag
    6.2 Query Trino using Trino CLI

    Once Trino is deployed, you can run the below command to get a runnable Trino CLI.

    hashtag
    6.2.1 Download Trino CLI

    6.2.2 Port forward Trino service to your local if it's not already exposed

    6.2.3 Use Trino console client to connect to Trino service

    6.2.4 Query Pinot data using Trino CLI

    hashtag
    6.3 Sample queries to execute

    • List all catalogs

    • List All tables

    • Show schema

    • Count total documents

    hashtag
    7. Access Pinot using Presto

    hashtag
    7.1 Deploy Presto using Pinot plugin

    You can run the command below to deploy a customized Presto with the Pinot plugin installed.

    The above command deploys Presto with default configs. For customizing your deployment, you can run the below command to get all the configurable values.

    After modifying the /tmp/presto-values.yaml file, you can deploy Presto with:

    Once you deployed the Presto, You can check Presto deployment status by:

    Sample Output of K8s Deployment Status

    hashtag
    7.2 Query Presto using Presto CLI

    Once Presto is deployed, you can run the below command from herearrow-up-right, or just follow steps 6.2.1 to 6.2.3.

    hashtag
    6.2.1 Download Presto CLI

    6.2.2 Port forward presto-coordinator port 8080 to localhost port 18080

    hashtag
    6.2.3 Start Presto CLI with pinot catalog to query it then query it

    6.2.4 Query Pinot data using Presto CLI

    hashtag
    7.3 Sample queries to execute

    • List all catalogs

    • List All tables

    • Show schema

    • Count total documents

    hashtag
    8. Deleting the Pinot cluster in Kubernetes

    helm repo add incubator https://charts.helm.sh/incubator
    helm install -n pinot-quickstart kafka incubator/kafka --set replicas=1
    helm repo add incubator https://charts.helm.sh/incubator
    helm install --namespace "pinot-quickstart"  --name kafka incubator/kafka
    helm install presto pinot/presto -n pinot-quickstart
    kubectl apply -f presto-coordinator.yaml
    kubectl get all -n pinot-quickstart
    kubectl get all -n pinot-quickstart | grep kafka
    pod/kafka-0                                                 1/1     Running     0          2m
    pod/kafka-zookeeper-0                                       1/1     Running     0          10m
    pod/kafka-zookeeper-1                                       1/1     Running     0          9m
    pod/kafka-zookeeper-2                                       1/1     Running     0          8m
    kubectl -n pinot-quickstart exec kafka-0 -- kafka-topics --zookeeper kafka-zookeeper:2181 --topic flights-realtime --create --partitions 1 --replication-factor 1
    kubectl -n pinot-quickstart exec kafka-0 -- kafka-topics --zookeeper kafka-zookeeper:2181 --topic flights-realtime-avro --create --partitions 1 --replication-factor 1
    kubectl apply -f pinot/pinot-realtime-quickstart.yml
    ./query-pinot-data.sh
    kubectl apply -f superset.yaml
    kubectl get all -n pinot-quickstart | grep superset
    kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'flask fab create-admin'
    kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset db upgrade'
    kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset init'
    kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset import_datasources -p /etc/superset/pinot_example_datasource.yaml'
    kubectl exec -it pod/superset-0 -n pinot-quickstart -- bash -c 'superset import_dashboards -p /etc/superset/pinot_example_dashboard.json'
    ./open-superset-ui.sh
    helm repo add trino https://trinodb.github.io/charts/
    helm search repo trino
    helm inspect values trino/trino > /tmp/trino-values.yaml
    additionalCatalogs:
      pinot: |
        connector.name=pinot
        pinot.controller-urls=pinot-controller.pinot-quickstart:9000
    kubectl create ns trino-quickstart
    helm install my-trino trino/trino --version 0.2.0 -n trino-quickstart --values /tmp/trino-values.yaml
    kubectl get pods -n trino-quickstart
    curl -L https://repo1.maven.org/maven2/io/trino/trino-cli/363/trino-cli-363-executable.jar -o /tmp/trino && chmod +x /tmp/trino
    echo "Visit http://127.0.0.1:18080 to use your application"
    kubectl port-forward service/my-trino 18080:8080 -n trino-quickstart
    /tmp/trino --server localhost:18080 --catalog pinot --schema default
    trino:default> show catalogs;
      Catalog
    ---------
     pinot
     system
     tpcds
     tpch
    (4 rows)
    
    Query 20211025_010256_00002_mxcvx, FINISHED, 2 nodes
    Splits: 36 total, 36 done (100.00%)
    0.70 [0 rows, 0B] [0 rows/s, 0B/s]
    trino:default> show tables;
        Table
    --------------
     airlinestats
    (1 row)
    
    Query 20211025_010326_00003_mxcvx, FINISHED, 3 nodes
    Splits: 36 total, 36 done (100.00%)
    0.28 [1 rows, 29B] [3 rows/s, 104B/s]
    trino:default> DESCRIBE airlinestats;
            Column        |      Type      | Extra | Comment
    ----------------------+----------------+-------+---------
     flightnum            | integer        |       |
     origin               | varchar        |       |
     quarter              | integer        |       |
     lateaircraftdelay    | integer        |       |
     divactualelapsedtime | integer        |       |
     divwheelsons         | array(integer) |       |
     divwheelsoffs        | array(integer) |       |
    ......
    
    Query 20211025_010414_00006_mxcvx, FINISHED, 3 nodes
    Splits: 36 total, 36 done (100.00%)
    0.37 [79 rows, 5.96KB] [212 rows/s, 16KB/s]
    trino:default> select count(*) as cnt from airlinestats limit 10;
     cnt
    ------
     9746
    (1 row)
    
    Query 20211025_015607_00009_mxcvx, FINISHED, 2 nodes
    Splits: 17 total, 17 done (100.00%)
    0.24 [1 rows, 9B] [4 rows/s, 38B/s]
    helm inspect values pinot/presto > /tmp/presto-values.yaml
    helm install presto pinot/presto -n pinot-quickstart --values /tmp/presto-values.yaml
    kubectl get pods -n pinot-quickstart
    ./pinot-presto-cli.sh
    curl -L https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.246/presto-cli-0.246-executable.jar -o /tmp/presto-cli && chmod +x /tmp/presto-cli
    kubectl port-forward service/presto-coordinator 18080:8080 -n pinot-quickstart> /dev/null &
    /tmp/presto-cli --server localhost:18080 --catalog pinot --schema default
    presto:default> show catalogs;
     Catalog
    ---------
     pinot
     system
    (2 rows)
    
    Query 20191112_050827_00003_xkm4g, FINISHED, 1 node
    Splits: 19 total, 19 done (100.00%)
    0:01 [0 rows, 0B] [0 rows/s, 0B/s]
    presto:default> show tables;
        Table
    --------------
     airlinestats
    (1 row)
    
    Query 20191112_050907_00004_xkm4g, FINISHED, 1 node
    Splits: 19 total, 19 done (100.00%)
    0:01 [1 rows, 29B] [1 rows/s, 41B/s]
    presto:default> DESCRIBE pinot.dontcare.airlinestats;
            Column        |  Type   | Extra | Comment
    ----------------------+---------+-------+---------
     flightnum            | integer |       |
     origin               | varchar |       |
     quarter              | integer |       |
     lateaircraftdelay    | integer |       |
     divactualelapsedtime | integer |       |
    ......
    
    Query 20191112_051021_00005_xkm4g, FINISHED, 1 node
    Splits: 19 total, 19 done (100.00%)
    0:02 [80 rows, 6.06KB] [35 rows/s, 2.66KB/s]
    presto:default> select count(*) as cnt from pinot.dontcare.airlinestats limit 10;
     cnt
    ------
     9745
    (1 row)
    
    Query 20191112_051114_00006_xkm4g, FINISHED, 1 node
    Splits: 17 total, 17 done (100.00%)
    0:00 [1 rows, 8B] [2 rows/s, 19B/s]
    kubectl delete ns pinot-quickstart

    For Helm v2.12.1

    If your Kubernetes cluster is recently provisioned, ensure Helm is initialized by running:

    Then deploy a new HA Pinot cluster using the following command:

    • For Helm v3.0.0

    hashtag
    2.1.3 Troubleshooting (For helm v2.12.1)

    • Error: Please run the below command if encountering the following issue:

    • Resolution:

    • Error: Please run the command below if encountering a permission issue:

    Error: release pinot failed: namespaces "pinot-quickstart" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "namespaces" in API group "" in the namespace "pinot-quickstart"

    • Resolution:

    pinot/kubernetes/helm
    helm dependency update
    Setup a Kubernetes Cluster using Google Kubernetes Engine (GKE)
    Setup a Kubernetes Cluster using Azure Kubernetes Service (AKS)
    helm init --service-account tiller
    helm install --namespace "pinot-quickstart" --name "pinot" pinot
    kubectl create ns pinot-quickstart
    helm install -n pinot-quickstart pinot pinot
    Error: could not find tiller.
    kubectl -n kube-system delete deployment tiller-deploy
    kubectl -n kube-system delete service/tiller-deploy
    helm init --service-account tiller
    kubectl apply -f helm-rbac.yaml