1 of 9

Tutorials

Here you will find a collection of how-to guides for operators or developers

Authentication, Authorization, and ACLs

Set up HTTP basic auth and ACLs for access to controller and broker

Apache Pinot 0.8.0+ comes out of the box with support for HTTP Basic Auth. While disabled by default for easier setup, authentication and authorization can be added to any environment simply via configuration. ACLs can be set on both API and table levels. This upgrade can be performed with zero downtime in any environment that provides replication.

For external access, Pinot exposes two primary APIs via the following components:

pinot-controller handles cluster management and configuration
pinot-broker handles incoming SQL queries

Both components can be protected via auth and even be configured independently. This makes it is possible to separate accounts for administrative functions such as table creation from accounts that are read the contents of tables in production.

Additionally, all other Pinot components such as pinot-server and pinot-minion can be configured to authenticate themselves to pinot-controller via the same mechanism. This can be done independently of (and in addition to) using 2-way TLS/SSL to ensure intra-cluster authentication on the lower networking layer.

Quickstart

If you'd rather dive directly into the action with an all-in-one running example, we provide an AuthQuickstart runnable with Apache Pinot. This sample app is preconfigured with the settings below but only intended as a dev-friendly, local, single-node deployment.

Tokens and User Credentials

The configuration of HTTP Basic Auth in Apache Pinot distinguishes between Tokens, which are typically provided to service accounts, and User Credentials, which can be used by a human to log onto the web UI or issue SQL queries. While we distinguish these two concepts in the configuration of HTTP Basic Auth, they are fully-convertible formats holding the same authentication information. This distinction allows us to support future token-based authentication methods not reliant on username and password pairs. Currently, Tokens are merely base64-encoded username & password tuples, similar to those you can find in HTTP Authorization header values ()

This is best demonstrated by example of introducing ACLs with a simple admin + user setup. In order to enable authentication on a cluster without interrupting operations, we'll go these steps in sequence:

1. Create "admin" and "user" in the controller properties

2. Distribute service tokens to pinot's components

For simplicity, we'll reuse the admin credentials as service tokens. In a production environment you'll keep them separate.

Restart the affected components for the configuration changes to take effect.

3. Enable ACL enforcement on the controller

After a controller restart, any access to controller APIs requires authentication information. Whether from internal components, external users, or the Web UI.

4. Create users and enable ACL enforcement on the Broker

After restarting the broker, any access to broker APIs requires authentication information as well.

Congratulation! You've successfully enabled authentication on Apache Pinot. Read on to learn more about the details and advanced configuration options.

Authentication with Web UI and API

Apache Pinot's Basic Auth follows the established standards for HTTP Basic Auth. Credentials are provided via an HTTP Authorization header. The pinot-controller web ui dynamically adapts to your auth configuration and will display a login prompt when basic auth is enabled. Restricted users are still shown all available ui functions, but their operations will fail with an error message if ACLs prohibit access.

If you're using pinot's CLI clients you can provide your credentials either via dedicated username and password arguments, or as pre-serialized token for the HTTP Authorization header. Note, that while most of Apache Pinot's CLI commands support auth, not all of them have been back-fitted yet. If you encounter any such case, you can access the REST API directly, e.g. via curl.

Controller Authentication and Authorization

Pinot-controller has supported custom access control implementations for quite some time. We expanded the scope of this support in 0.8.0+ and added a default implementation for HTTP Basic Auth. Furthermore, the controller's web UI added support for user login workflows and graceful handling of authentication and authorization messages.

Controller Auth can be enabled via configuration in the controller properties. The configuration options allow the specification of usernames and passwords as well as optional ACL restrictions on a per-table and per-access-type (CREATE, READ, UPDATE, DELETE) basis.

The example below creates two users, admin with password verysecret and user with password secret. admin has full access, whereas user is restricted to READ operations and, additionally, to tables named myusertable, baseballStats, and stuff in all cases where the API calls are table-specific.

This configuration will automatically allow other pinot components to access pinot-controller with the shared admin service token set up earlier.

If *.principals.<user>.tablesis not configured, all tables are accessible to <user>.

Broker Authentication and Authorization

Pinot-Broker, similar to pinot-controller above, has supported access control for a while now and we added a default implementation for HTTP Basic Auth. Since pinot-broker does not provide a web UI by itself, authentication is only relevant for SQL queries hitting the broker's REST API.

Broker Auth can be enabled via configuration in the broker properties, similar to the controller. The configuration options allow specification of usernames and passwords as well as optional ACL restrictions on a per-table table basis (access type is always READ). Note, that it is possible to configure a different set of users, credentials, and permissions for broker access. However, if you want a user to be able to access data via the query console on the controller web UI, that user must (a) share the same username and password on both controller and broker, and (b) have READ permissions and table-level access.

The example below again creates two users, admin with password verysecret and user with password secret. admin has full access, whereas user is restricted to tables named baseballStats and otherstuff.

If *.principals.<user>.tablesis not configured, all tables are accessible to <user>.

Minion and ingestion jobs

Similar to any API calls, offline jobs executed via command line or minion require credentials as well if ACLs are enabled on pinot-controller. These credentials can be provided either as part of the job spec itself or using CLI arguments and as values (via -values) or properties (via -propertyFile) if Groovy templates are defined in the jobSpec.

Configuring TLS/SSL

Set up TLS-secured connections inside and outside your cluster

Pinot versions from 0.7.0+ support client-cluster and intra-cluster TLS. TLS-support comes in both 1-way and 2-way flavors. This guide walks through the relevant configuration options.

Looking to ingest from Kafka via secured connections? Check out Kafka Streaming Ingestion with TLS/SSL.

Listeners

In order to support incremental upgrades of unsecured pinot clusters towards TLS, we introduce multi-ingress support via listeners. Each listener accepts connections for a specific protocol on a specific port. For example, pinot-broker may be configured to accept both, http on port 8099 and https on port 8443 at the same time.

Existing configuration properties such as controller.port are still parsed and automatically translated to a http listener configuration to enable full backwards-compatibility. TLS-secured ingress must be configured through the new listener specifications.

TLS upgrade

If you're bootstrapping a cluster from scratch, you can directly configure TLS-secured connections and you can forgo legacy http ingress. If you're upgrading an existing (production) cluster, you'll be able to perform the upgrade without downtime if your deployment is configured for high-availability.

On a high level, a zero-downtime upgrade includes the following 3 phases:

adding a secondary TLS-secured ingress to pinot controllers, brokers, and servers
switching client and internode egress to prefer TLS-secured connections
disabling unsecured ingress

This requires a rolling restart of (replicated) service containers after each re-configuration phase. The sample listener specifications below will guide you through this process.

Generating certificates

Apache Pinot leverages the JVM's native TLS infrastructure with all its benefits and limitations. Certificates should be generated to include the host IP, hostname, and fully-qualified domain names (if accessed or identified this way).

We support both, the JVM's default key/truststore, as well as configuration options to load certificates from secondary locations. Note, that some connector plugins require the default truststore to contain any trusted certs since they do not parse pinot's configuration properties for external truststores.

Most JVM's default certificate store can be configured with command-line arguments:

-Djavax.net.ssl.keyStore -Djavax.net.ssl.keyStorePassword -Djavax.net.ssl.trustStore -Djavax.net.ssl.trustStorePassword

Listener Specifications

This section contains a number of examples for common situations. The complete configuration reference can be found is each component's configuration reference.

If you're bootstrapping a new cluster, scroll down towards the end. We order this section for purposes of migrating an existing unsecured cluster to TLS-only.

Legacy HTTP config (unsecured)

This is a minimal example of network configuration options prior to 0.7.0. This specification is still supported for backwards-compatibility and translated internally to a listener specification.

HTTP with listener specification (unsecured)

This HTTP listener specification is the equivalent of manually translating the legacy configuration above to a listener specification.

HTTP/HTTPS multi-ingress (unsecured egress)

This is a common scenario for development clusters and an intermediate phase during a zero-downtime migration of an unsecured cluster towards TLS. This configuration optionally accepts secure ingress on alternate ports, but still defaults to unsecured egress for all operations.

HTTP/HTTPS multi-ingress (secure egress)

After all pinot components have been configured and restarted to offer secure ingress, we can modify egress to default to secure connections internode. Clients, such as pinot-admin.sh, support an optional flag -controllerProtocol https to enable secure access. Ingestion jobs similarly support an optional tlsSpec key to configure key/trststores. Note, that any console clients must have access to appropriate certificates via the JVM's default key/truststore.

TLS only

This is the default for a newly bootstrapped secure pinot cluster. It is also the final stage for any migration of an existing cluster. With this configuration applied, pinot's components will reject any unsecured connection attempt.

2-way TLS

Apache Pinot also supports 2-way TLS for environments with high security requirements. This can be enabled per component with the optional client.auth.enabled flag. Bear in mind that any client (or server) interacting with a component expecting client auth must have access to both, a keystore and a truststore. This setting does NOT have apply to unsecured http or netty connections.

Build Docker Images

Overview

The scripts to build Pinot related docker images is located at .

You can access those scripts by running below command to checkout Pinot repo:

You can find current supported 3 images in this directory:

Pinot: Pinot all-in-one distribution image
Pinot-Presto: Presto image with Presto-Pinot Connector built-in.
Pinot-Superset: Superset image with Pinot connector built-in.

Pinot

This is a docker image of .

How to build a docker image

There is a docker build script which will build a given Git repo/branch and tag the image.

Usage:

This script will check out Pinot Repo [Pinot Git URL] on branch [Git Branch] and build the docker image for that.

The docker image is tagged as [Docker Tag].

Docker Tag: Name and tag your docker image. Default is pinot:latest.

Git Branch: The Pinot branch to build. Default is master.

Pinot Git URL: The Pinot Git Repo to build, users can set it to their own fork. Please note that, the URL is https:// based, not git://. Default is the Apache Repo: https://github.com/apache/pinot.git.

Kafka Version: The Kafka Version to build pinot with. Default is 2.0

Java Version: The Java Build and Runtime image version. Default is 11

JDK Version: The JDK parameter to build pinot, set as part of maven build option: -Djdk.version=${JDK_VERSION}. Default is 11

OpenJDK Image: Base image to use for Pinot build and runtime. Default is openjdk.

Example of building and tagging a snapshot on your own fork:

Example of building a release version:

Build image with arm64 base image

For users on Mac M1 chips, they need to build the images with arm64 base image, e.g. arm64v8/openjdk

Example of building an arm64 image:

or just run the docker build script directly

Note that if you are not on arm64 machine, you can still build the image by turning on the experimental feature of docker, and add --platform linux/arm64 into the docker build ... script, e.g.

How to publish a docker image

Script docker-push.sh publishes a given docker image to your docker registry.

In order to push to your own repo, the image needs to be explicitly tagged with the repo name.

Tag a built image, then push.

Script docker-build-and-push.sh builds and publishes this docker image to your docker registry after build.

Kubernetes Examples

Pinot Presto

This docker build project is specialized for Pinot.

How to build

Usage:

This script will check out Presto Repo [Presto Git URL] on branch [Git Branch] and build the docker image for that.

The docker image is tagged as [Docker Tag].

Docker Tag: Name and tag your docker image. Default is pinot-presto:latest.

Git Branch: The Presto branch to build. Default is master.

Presto Git URL: The Presto Git Repo to build, users can set it to their own fork. Please note that, the URL is https:// based, not git://. Default is the Apache Repo: https://github.com/prestodb/presto.git.

How to push

Configuration

Volumes

The image defines two data volumes: one for mounting configuration into the container, and one for data.

The configuration volume is located alternatively at /home/presto/etc, which contains all the configuration and plugins.

The data volume is located at /home/presto/data.

Kubernetes Examples

Pinot Superset

How to build

Please modify file Makefile to change image and superset_version accordingly.

Below command will build docker image and tag it as superset_version and latest.

You can also build directly with docker build command by setting arguments:

How to push

Configuration

Place this file in a local directory and mount this directory to /etc/superset inside the container. This location is included in the image's PYTHONPATH. Mounting this file to a different location is possible, but it will need to be in the PYTHONPATH.

Volumes

The image defines two data volumes: one for mounting configuration into the container, and one for data (logs, SQLite DBs, &c).

The configuration volume is located alternatively at /etc/superset or /home/superset; either is acceptable. Both of these directories are included in the PYTHONPATH of the image. Mount any configuration (specifically the superset_config.py file) here to have it read by the app on startup.

The data volume is located at /var/lib/superset and it is where you would mount your SQLite file (if you are using that as your backend), or a volume to collect any logs that are routed there. This location is used as the value of the SUPERSET_HOME environmental variable.

Kubernetes Examples

Kubernetes Deployment

Pinot community has provided Helm based .

You can deploy it as simple as run a helm install command.

However there are a few things to be noted before starting the benchmark/production.

Container Resources

We recommend to run Pinot with pre-defined resources for the container, and make requests and limits to be the same.

This will ensure the container won't be killed if there is a sudden bump of workload.

It will also be simpler to benchmark the system, e.g. get broker qps limit.

Below is an example for values to set in values.yaml file. Default resources is not set.

JVM Setting

Pinot Controller/Broker

JVM setting should be complaint with the container resources for Pinot Controller and Pinot Broker.

You can make JVM setting like below to make -Xmx the same size as your container.

Pinot Server

For Pinot Server, heap is majorly used for query processing, metadata management. It uses off-heap memory for data loading/persistence, memory mapped files page caching. So we recommend just keep minimal requirement for JVM, and leave the rest of the container for off-heap data operations.

E.g. Assuming data is 100 GB on disk, the container size is 4 CPU, 10GB Memory.

For JVM, limit -Xmx to not exceed 50% container memory limit, so that the rest of the container could be leveraged by the off-heap operations.

Deep storage

Pinot uses remote storage as deep storage to backup segments.

Default deployment creates a mount disk(e.g Amazon EBS) as deep storage in controller.

Amazon EKS (Kafka)

If you need to connect non-EKS AWS jobs (Lambdas/EC2) to a Kafka running inside an AWS EKS

General steps: update Kafka's advertised.listeners and make sure Kafka is accessible (e.g. allow inputs on Security Groups).

You will probably face the following problems.

If you want to connect to Kafka outside of EKS, you will need to change advertised.listeners. When a client connects to a single Kafka bootstrap server (like other brokers), a bootstrap server sends a list of addresses for all brokers to the client. If you want to connect to a EKS Kafka, these default values will not be correct. This provides an excellent explanation of the field.

If you use Helm to deploy Kafka to AWS EKS, please review the . It describes multiple setups for communicating into EKS.

Running helm upgrade on the Kafka chart does not always update the pods. The exact reason is unknown. It's probably an issue with the chart's implementation. You should run kubectl describe pod and other commands to see the current status of the pods. During initial development, you can run helm uninstall and then helm installto force the values to update.

Amazon MSK (Kafka)

How to Connect Pinot with Amazon Managed Streaming for Apache Kafka (Amazon MSK)

This wiki documents how to connect Pinot deployed in Amazon EKS to Amazon Managed Kafka.

Prerequisite

Please follow this AWS Quickstart Wiki to run Pinot on Amazon EKS.

Create an Amazon MSK Cluster

Please go to MSK Landing Page to create a Kafka Cluster.

Note:

For demo simplicity, this MSK cluster reuses same VPC created by EKS cluster in the previous step. Otherwise a VPC Peering is required to ensure two VPCs could talk to each other.
Under Encryption section, chooseBoth TLS encrypted and plaintext traffic allowed

Below is a sample screenshot to create an Amazon MSK cluster.

                                                       ![](../../.gitbook/assets/snapshot-msk.png)

After click on Create button, you can take a coffee break and come back.

Once the cluster is created, you can view it and click View client information to see the Zookeeper and Kafka Broker list.

Sample Client Information

Connect to MSK

Config SecurityGroup

Until now, the MSK cluster is still not accessible, you can follow this Wiki to create an EC2 instance to connect to it for topic creation, run console producer and consumer.

In order to connect MSK to EKS, we need to allow the traffic could go through each other.

This is configured through Amazon VPC Page.

Record the Amazon MSK SecurityGroup from the Cluster page, in the above demo, it's sg-01e7ab1320a77f1a9.
Open Amazon VPC Page, click on SecurityGroups on left bar. Find the EKS Security group: eksctl-${PINOT_EKS_CLUSTER}-cluster/ClusterSharedNodeSecurityGroup.

Please ensure you are picking ClusterShardNodeSecurityGroup

In SecurityGroup, click on MSK SecurityGroup (sg-01e7ab1320a77f1a9), then Click on Edit Rules , then add above ClusterSharedNodeSecurityGroup (sg-0402b59d7e440f8d1) to it.

Click EKS Security Group ClusterSharedNodeSecurityGroup (sg-0402b59d7e440f8d1), add In bound Rule for MSK Security Group (sg-01e7ab1320a77f1a9).

Now, EKS cluster should be able to talk to Amazon MSK.

Create Kafka topic

To run below commands, please ensure you set two environment variable with ZOOKEEPER_CONNECT_STRING and BROKER_LIST_STRING (Use plaintext) from Amazon MSK client information, and replace the Variables accordingly.

E.g.

ZOOKEEPER_CONNECT_STRING="z-3.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:2181,z-1.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:2181,z-2.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:2181"
BROKER_LIST_STRING="b-1.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:9092,b-2.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:9092"

You can log into one EKS node or container and run below command to create a topic.

E.g. Enter into Pinot controller container:

kubectl exec -it pod/pinot-controller-0  -n pinot-quickstart bash

Then install wget then download Kafka binary.

apt-get update
apt-get install wget -y
wget https://archive.apache.org/dist/kafka/2.2.1/kafka_2.12-2.2.1.tgz
tar -xzf kafka_2.12-2.2.1.tgz
cd kafka_2.12-2.2.1

Create a Kafka topic:

bin/kafka-topics.sh \
  --zookeeper ${ZOOKEEPER_CONNECT_STRING} \
  --create \
  --topic pullRequestMergedEventsAwsMskDemo \
  --replication-factor 1 \
  --partitions 1

Topic creation succeeds with below message:

Created topic "pullRequestMergedEventsAwsMskDemo".

Write sample data into Kafka

Once topic is created, we can start a simple application to produce to it.

You can download below yaml file, then replace:

${ZOOKEEPER_CONNECT_STRING} -> MSK Zookeeper String
${BROKER_LIST_STRING} -> MSK Plaintext Broker String in the deployment
${GITHUB_PERSONAL_ACCESS_TOKEN} -> A personal Github Personal Access Token generated from here, please grant all read permissions to it. Here is the source code to generate Github Events.

And apply the YAML file by.

kubectl apply -f github-events-aws-msk-demo.yaml

Once the pod is up, you can verify by running a console consumer to read from it.

Try to run from the Pinot Controller container entered in above step.

bin/kafka-console-consumer.sh \
  --bootstrap-server ${BROKER_LIST_STRING} \
  --topic pullRequestMergedEventsAwsMskDemo

Create a Pinot table

This step is relatively easy.

Since we already put table creation request into the ConfigMap, we can just enter into pinot-github-events-data-into-msk-kafka pod to execute the command.

Check if the pod is running:

kubectl get pod -n pinot-quickstart  |grep pinot-github-events-data-into-msk-kafka

Sample output:

pinot-github-events-data-into-msk-kafka-68948fb4cd-rrzlf   1/1     Running     0          14m

Enter into the pod

podname=`kubectl get pod -n pinot-quickstart  |grep pinot-github-events-data-into-msk-kafka|awk '{print $1}'`
kubectl exec -it ${podname} -n pinot-quickstart bash

Create Table

bin/pinot-admin.sh AddTable \
  -controllerHost pinot-controller \
  -tableConfigFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_realtime_table_config.json \
  -schemaFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_schema.json \
  -exec

Sample output:

Executing command: AddTable -tableConfigFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_realtime_table_config.json -schemaFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_schema.json -controllerHost pinot-controller -controllerPort 9000 -exec
Sending request: http://pinot-controller:9000/schemas to controller: pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local, version: Unknown
{"status":"Table pullRequestMergedEventsAwsMskDemo_REALTIME succesfully added"}

Then you can open Pinot Query Console to browse the data

Monitor Pinot using Prometheus and Grafana

Here we will introduce how to monitor Pinot with Prometheus and Grafana in Kubernetes environment.

Prerequisite

Kubernetes v1.16.5
HelmCharts v3.1.2

Deploy Pinot

Install Pinot helm repo

## Adding Pinot helm repo
helm repo add pinot https://raw.githubusercontent.com/apache/pinot/master/kubernetes/helm
## Extract all the configurable values of Pinot Helm into a config.
helm inspect values pinot/pinot > /tmp/pinot-values.yaml

Configure Pinot Helm to enable Prometheus JMX Exporter

1. Configure jvmOpts:

Add JMX Prometheus Java Agent to controller.jvmOpts / broker.jvmOpts/ server.jvmOpts . Note that Pinot Docker image already packages jmx_prometheus_javaagent.jar.

Below config will expose pinot metrics to port 8008 for Prometheus to scrape.

controller:
  ...
  jvmOpts: "-javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent.jar=8008:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml -Xms256M -Xmx1G"

You can port forward port 8008 to local and access metrics though: http://localhost:8008/metrics

2. Configure service annotations:

Add Prometheus related annotations to enable Prometheus to scrape metrics.

controller.service.annotations
broker.service.annotations
server.service.annotations
controller.podAnnotations
broker.podAnnotations
server.podAnnotations

controller:
  ...
  service:
    annotations:
      "prometheus.io/scrape": "true"
      "prometheus.io/port": "8008"
  ...
  podAnnotations:
    "prometheus.io/scrape": "true"
    "prometheus.io/port": "8008"

Deploy Pinot Helm

kubectl create ns pinot
helm install pinot pinot/pinot -n pinot --values /tmp/pinot-values.yaml

Deploy Prometheus

Once Pinot is deployed and running, we can start deploy Prometheus.

Similar to Pinot Helm, we will have Prometheus Helm and its config yaml file:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm inspect values prometheus-community/prometheus > /tmp/prometheus-values.yaml

Configure Prometheus

Please remember to check the configs:

server.persistentVolume: data storage location/size limit/storage class
server.retention: how long to keep the data (default is 15d)

Deploy Prometheus

kubectl create ns prometheus
helm install prometheus prometheus-community/prometheus -n prometheus --values /tmp/prometheus-values.yaml

Access Prometheus

Port forward Prometheus service to local and open the page on localhost:30080

kubectl port-forward service/prometheus-server 30080:80 -n prometheus

Then we can query metrics Prometheus scrapped:

Deploy Grafana

Similar to Pinot Helm, we will have Grafana Helm and it's config yaml file:

helm repo add grafana https://grafana.github.io/helm-charts
helm inspect values grafana/grafana > /tmp/grafana-values.yaml

Configure Grafana
Deploy Grafana

kubectl create ns grafana
helm install grafana grafana/grafana -n grafana --values /tmp/grafana-values.yaml

Get Password to access Grafana

kubectl get secret --namespace grafana grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Access Grafana dashboard

You can access it locally through port forwarding:

kubectl port-forward service/grafana 20080:80 -n grafana

Once open the dashboard, you can login with credential:

admin/[ PASSWORD GET FROM PREVIOUS STEP]

Add data source

Click on Prometheus and set HTTP URL to : http://prometheus-server.prometheus.svc.cluster.local

Configure Pinot Dashboard

Once data source is added, we can import a Pinot Dashboard:

A sample Pinot dashboard JSON is:

Now you can upload this file and select Prometheus as data source to finish the import

Then you can explore and make your own Pinot dashboard!

Build Docker Images

Overview

The scripts to build Pinot related docker images is located at .

You can access those scripts by running below command to checkout Pinot repo:

You can find current supported 3 images in this directory:

Pinot: Pinot all-in-one distribution image
Pinot-Presto: Presto image with Presto-Pinot Connector built-in.
Pinot-Superset: Superset image with Pinot connector built-in.

Pinot

This is a docker image of .

How to build a docker image

There is a docker build script which will build a given Git repo/branch and tag the image.

Usage:

This script will check out Pinot Repo [Pinot Git URL] on branch [Git Branch] and build the docker image for that.

The docker image is tagged as [Docker Tag].

Docker Tag: Name and tag your docker image. Default is pinot:latest.

Git Branch: The Pinot branch to build. Default is master.

Kafka Version: The Kafka Version to build pinot with. Default is 2.0

Java Version: The Java Build and Runtime image version. Default is 11

JDK Version: The JDK parameter to build pinot, set as part of maven build option: -Djdk.version=${JDK_VERSION}. Default is 11

OpenJDK Image: Base image to use for Pinot build and runtime. Default is openjdk.

Example of building and tagging a snapshot on your own fork:

./docker-build.sh pinot_fork:snapshot-5.2 snapshot-5.2 https://github.com/your_own_fork/pinot.git

Example of building a release version:

./docker-build.sh pinot:release-0.1.0 release-0.1.0 https://github.com/apache/pinot.git

Build image with arm64 base image

For users on Mac M1 chips, they need to build the images with arm64 base image, e.g. arm64v8/openjdk

Example of building an arm64 image:

./docker-build.sh pinot:latest master https://github.com/apache/pinot.git 2.0 11 11 arm64v8/openjdk

or just run the docker build script directly

docker build -t pinot:latest --no-cache --network=host --build-arg PINOT_GIT_URL=https://github.com/apache/pinot.git --build-arg PINOT_BRANCH=master --build-arg JDK_VERSION=11 --build-arg OPENJDK_IMAGE=arm64v8/openjdk -f Dockerfile .

Note that if you are not on arm64 machine, you can still build the image by turning on the experimental feature of docker, and add --platform linux/arm64 into the docker build ... script, e.g.

docker build -t pinot:latest --platform linux/arm64 --no-cache --network=host --build-arg PINOT_GIT_URL=https://github.com/apache/pinot.git --build-arg PINOT_BRANCH=master --build-arg JDK_VERSION=11 --build-arg OPENJDK_IMAGE=arm64v8/openjdk -f Dockerfile .

How to publish a docker image

Script docker-push.sh publishes a given docker image to your docker registry.

In order to push to your own repo, the image needs to be explicitly tagged with the repo name.

Example of publishing a image to dockerHub repo.

./docker-push.sh apachepinot/pinot:latest

Tag a built image, then push.

docker tag pinot:release-0.1.0 apachepinot/pinot:release-0.1.0
docker push apachepinot/pinot:release-0.1.0

Script docker-build-and-push.sh builds and publishes this docker image to your docker registry after build.

Example of building and publishing a image to dockerHub repo.

./docker-build-and-push.sh apachepinot/pinot:latest master https://github.com/apache/pinot.git

Kubernetes Examples

Please refer to for deployment examples.

Pinot Presto

Docker image for with Pinot integration.

This docker build project is specialized for Pinot.

How to build

Usage:

./docker-build.sh [Docker Tag] [Git Branch] [Presto Git URL]

This script will check out Presto Repo [Presto Git URL] on branch [Git Branch] and build the docker image for that.

The docker image is tagged as [Docker Tag].

Docker Tag: Name and tag your docker image. Default is pinot-presto:latest.

Git Branch: The Presto branch to build. Default is master.

How to push

docker push apachepinot/pinot-presto:latest

Configuration

Follow the provided by Presto for writing your own configuration files under etc directory.

Volumes

The image defines two data volumes: one for mounting configuration into the container, and one for data.

The configuration volume is located alternatively at /home/presto/etc, which contains all the configuration and plugins.

The data volume is located at /home/presto/data.

Kubernetes Examples

Please refer to as k8s deployment example.

Pinot Superset

Docker image for with Pinot integration.

This docker build project is based on Project and specialized for Pinot.

How to build

Please modify file Makefile to change image and superset_version accordingly.

Below command will build docker image and tag it as superset_version and latest.

make latest

You can also build directly with docker build command by setting arguments:

docker build \
    --build-arg NODE_VERSION=latest \
    --build-arg PYTHON_VERSION=3.6 \
    --build-arg SUPERSET_VERSION=0.34.1 \
    --tag apachepinot/pinot-superset:0.34.1 \
    --target build .

How to push

make push

Configuration

Follow the provided by Apache Superset for writing your own superset_config.py.

Volumes

The image defines two data volumes: one for mounting configuration into the container, and one for data (logs, SQLite DBs, &c).

Kubernetes Examples

Please refer to as k8s deployment example.

Amazon MSK (Kafka)

How to Connect Pinot with Amazon Managed Streaming for Apache Kafka (Amazon MSK)

This wiki documents how to connect Pinot deployed in Amazon EKS to Amazon Managed Kafka.

Prerequisite

Please follow this AWS Quickstart Wiki to run Pinot on Amazon EKS.

Create an Amazon MSK Cluster

Please go to MSK Landing Page to create a Kafka Cluster.

Note:

For demo simplicity, this MSK cluster reuses same VPC created by EKS cluster in the previous step. Otherwise a VPC Peering is required to ensure two VPCs could talk to each other.
Under Encryption section, chooseBoth TLS encrypted and plaintext traffic allowed

Below is a sample screenshot to create an Amazon MSK cluster.

                                                       ![](../../.gitbook/assets/snapshot-msk.png)

After click on Create button, you can take a coffee break and come back.

Once the cluster is created, you can view it and click View client information to see the Zookeeper and Kafka Broker list.

Sample Client Information

Connect to MSK

Config SecurityGroup

Until now, the MSK cluster is still not accessible, you can follow this Wiki to create an EC2 instance to connect to it for topic creation, run console producer and consumer.

In order to connect MSK to EKS, we need to allow the traffic could go through each other.

This is configured through Amazon VPC Page.

Record the Amazon MSK SecurityGroup from the Cluster page, in the above demo, it's sg-01e7ab1320a77f1a9.
Open Amazon VPC Page, click on SecurityGroups on left bar. Find the EKS Security group: eksctl-${PINOT_EKS_CLUSTER}-cluster/ClusterSharedNodeSecurityGroup.

Please ensure you are picking ClusterShardNodeSecurityGroup

In SecurityGroup, click on MSK SecurityGroup (sg-01e7ab1320a77f1a9), then Click on Edit Rules , then add above ClusterSharedNodeSecurityGroup (sg-0402b59d7e440f8d1) to it.

Click EKS Security Group ClusterSharedNodeSecurityGroup (sg-0402b59d7e440f8d1), add In bound Rule for MSK Security Group (sg-01e7ab1320a77f1a9).

Now, EKS cluster should be able to talk to Amazon MSK.

Create Kafka topic

E.g.

ZOOKEEPER_CONNECT_STRING="z-3.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:2181,z-1.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:2181,z-2.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:2181"
BROKER_LIST_STRING="b-1.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:9092,b-2.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:9092"

You can log into one EKS node or container and run below command to create a topic.

E.g. Enter into Pinot controller container:

kubectl exec -it pod/pinot-controller-0  -n pinot-quickstart bash

Then install wget then download Kafka binary.

apt-get update
apt-get install wget -y
wget https://archive.apache.org/dist/kafka/2.2.1/kafka_2.12-2.2.1.tgz
tar -xzf kafka_2.12-2.2.1.tgz
cd kafka_2.12-2.2.1

Create a Kafka topic:

bin/kafka-topics.sh \
  --zookeeper ${ZOOKEEPER_CONNECT_STRING} \
  --create \
  --topic pullRequestMergedEventsAwsMskDemo \
  --replication-factor 1 \
  --partitions 1

Topic creation succeeds with below message:

Created topic "pullRequestMergedEventsAwsMskDemo".

Write sample data into Kafka

Once topic is created, we can start a simple application to produce to it.

You can download below yaml file, then replace:

${ZOOKEEPER_CONNECT_STRING} -> MSK Zookeeper String
${BROKER_LIST_STRING} -> MSK Plaintext Broker String in the deployment
${GITHUB_PERSONAL_ACCESS_TOKEN} -> A personal Github Personal Access Token generated from here, please grant all read permissions to it. Here is the source code to generate Github Events.

And apply the YAML file by.

kubectl apply -f github-events-aws-msk-demo.yaml

Once the pod is up, you can verify by running a console consumer to read from it.

Try to run from the Pinot Controller container entered in above step.

bin/kafka-console-consumer.sh \
  --bootstrap-server ${BROKER_LIST_STRING} \
  --topic pullRequestMergedEventsAwsMskDemo

Create a Pinot table

This step is relatively easy.

Since we already put table creation request into the ConfigMap, we can just enter into pinot-github-events-data-into-msk-kafka pod to execute the command.

Check if the pod is running:

kubectl get pod -n pinot-quickstart  |grep pinot-github-events-data-into-msk-kafka

Sample output:

pinot-github-events-data-into-msk-kafka-68948fb4cd-rrzlf   1/1     Running     0          14m

Enter into the pod

podname=`kubectl get pod -n pinot-quickstart  |grep pinot-github-events-data-into-msk-kafka|awk '{print $1}'`
kubectl exec -it ${podname} -n pinot-quickstart bash

Create Table

bin/pinot-admin.sh AddTable \
  -controllerHost pinot-controller \
  -tableConfigFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_realtime_table_config.json \
  -schemaFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_schema.json \
  -exec

Sample output:

Executing command: AddTable -tableConfigFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_realtime_table_config.json -schemaFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_schema.json -controllerHost pinot-controller -controllerPort 9000 -exec
Sending request: http://pinot-controller:9000/schemas to controller: pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local, version: Unknown
{"status":"Table pullRequestMergedEventsAwsMskDemo_REALTIME succesfully added"}

Then you can open Pinot Query Console to browse the data

Authentication, Authorization, and ACLs

Set up HTTP basic auth and ACLs for access to controller and broker

For external access, Pinot exposes two primary APIs via the following components:

pinot-controller handles cluster management and configuration
pinot-broker handles incoming SQL queries

Quickstart

Tokens and User Credentials

1. Create "admin" and "user" in the controller properties

2. Distribute service tokens to pinot's components

For simplicity, we'll reuse the admin credentials as service tokens. In a production environment you'll keep them separate.

# Enable the controller to fetch segments by providing the credentials as a token
controller.segment.fetcher.auth.token=Basic YWRtaW46dmVyeXNlY3JldA

# "Basic " + base64encode("admin:verysecret")

# no tokens required

segment.fetcher.auth.token=Basic YWRtaW46dmVyeXNlY3JldA
 task.auth.token=Basic YWRtaW46dmVyeXNlY3JldA

pinot.server.segment.fetcher.auth.token=Basic YWRtaW46dmVyeXNlY3JldA
 pinot.server.segment.uploader.auth.token=Basic YWRtaW46dmVyeXNlY3JldA 
pinot.server.instance.auth.token=Basic YWRtaW46dmVyeXNlY3JldA

Restart the affected components for the configuration changes to take effect.

3. Enable ACL enforcement on the controller

controller.admin.access.control.factory.class=org.apache.pinot.controller.api.access.BasicAuthAccessControlFactory

After a controller restart, any access to controller APIs requires authentication information. Whether from internal components, external users, or the Web UI.

4. Create users and enable ACL enforcement on the Broker

# the factory class property is different for the broker
pinot.broker.access.control.class=org.apache.pinot.broker.broker.BasicAuthAccessControlFactory 

pinot.broker.access.control.principals=admin,user
 pinot.broker.access.control.principals.admin.password=verysecret 
pinot.broker.access.control.principals.user.password=secret

# No need to set READ permissions here since broker requests are read-only

After restarting the broker, any access to broker APIs requires authentication information as well.

Congratulation! You've successfully enabled authentication on Apache Pinot. Read on to learn more about the details and advanced configuration options.

Authentication with Web UI and API

$ bin/pinot-admin.sh PostQuery \
  -user user -password secret \
  -brokerPort 8000 -query 'SELECT * FROM baseballStats'

$ bin/pinot-admin.sh PostQuery \
  -authToken "Basic dXNlcjpzZWNyZXQ=" \
  -brokerPort 8000 -query 'SELECT * FROM baseballStats'

$ curl http://localhost:8000/query/sql \
  -H 'Authorization: Basic dXNlcjpzZWNyZXQ=' \
  -d '{"sql":"SELECT * FROM baseballStats"}'

Controller Authentication and Authorization

This configuration will automatically allow other pinot components to access pinot-controller with the shared admin service token set up earlier.

If *.principals.<user>.tablesis not configured, all tables are accessible to <user>.

Broker Authentication and Authorization

If *.principals.<user>.tablesis not configured, all tables are accessible to <user>.

Minion and ingestion jobs

Running Pinot in Production

Requirements

You will need the following in order to run pinot in production:

Hardware for controller/broker/servers as per your load
Working installation of Zookeeper that Pinot can use. We recommend setting aside a path within zookpeer and including that path in pinot.controller.zkStr. Pinot will create its own cluster under this path (cluster name decided by pinot.controller.helixClusterName)
Shared storage mounted on controllers (if you plan to have multiple controllers for the same cluster). Alternatively, an implementation of PinotFS that the Pinot hosts have access to.
HTTP load balancers for spraying queries across brokers (or other mechanism to balance queries)
HTTP load balancers for spraying controller requests (e.g. segment push, or other controller APIs) or other mechanisms for distribution of these requests.

Deploying Pinot

In general, when deploying Pinot services, it is best to adhere to a specific ordering in which the various components should be deployed. This deployment order is recommended in case of the scenario that there might be protocol or other significant differences, the deployments go out in a predictable order in which failure due to these changes can be avoided.

The ordering is as follows:

pinot-controller
pinot-broker
pinot-server
pinot-minion

Managing Pinot

Pinot provides a web-based management console and a command-line utility (pinot-admin.sh) in order to help provision and manage pinot clusters.

Pinot Management Console

The web based management console allows operations on tables, tenants, segments and schemas. You can access the console via http://controller-host:port/help. The console also allows you to enter queries for interactive debugging. Here are some screen-shots from the console.

Listing all the schemas in the Pinot cluster:

Rebalancing segments of a table:

Command line utility (pinot-admin.sh)

The command line utility (pinot-admin.sh) can be generated by running mvn install package -DskipTests -Pbin-dist in the directory in which you checked out Pinot.

Here is an example of invoking the command to create a pinot segment:

$ ./pinot-distribution/target/apache-pinot-0.8.0-SNAPSHOT-bin/apache-pinot-0.8.0-SNAPSHOT-bin/bin/pinot-admin.sh CreateSegment -dataDir /Users/host1/Desktop/test/ -format CSV -outDir /Users/host1/Desktop/test2/ -tableName baseballStats -segmentName baseballStats_data -overwrite -schemaFile ./pinot-distribution/target/apache-pinot-0.8.0-SNAPSHOT-bin/apache-pinot-0.8.0-SNAPSHOT-bin/sample_data/baseballStats_schema.json
Executing command: CreateSegment  -generatorConfigFile null -dataDir /Users/host1/Desktop/test/ -format CSV -outDir /Users/host1/Desktop/test2/ -overwrite true -tableName baseballStats -segmentName baseballStats_data -timeColumnName null -schemaFile ./pinot-distribution/target/apache-pinot-0.8.0-SNAPSHOT-bin/apache-pinot-0.8.0-SNAPSHOT-bin/sample_data/baseballStats_schema.json -readerConfigFile null -enableStarTreeIndex false -starTreeIndexSpecFile null -hllSize 9 -hllColumns null -hllSuffix _hll -numThreads 1
Accepted files: [/Users/host1/Desktop/test/baseballStats_data.csv]
Finished building StatsCollector!
Collected stats for 97889 documents
Created dictionary for INT column: homeRuns with cardinality: 67, range: 0 to 73
Created dictionary for INT column: playerStint with cardinality: 5, range: 1 to 5
Created dictionary for INT column: groundedIntoDoublePlays with cardinality: 35, range: 0 to 36
Created dictionary for INT column: numberOfGames with cardinality: 165, range: 1 to 165
Created dictionary for INT column: AtBatting with cardinality: 699, range: 0 to 716
Created dictionary for INT column: stolenBases with cardinality: 114, range: 0 to 138
Created dictionary for INT column: tripples with cardinality: 32, range: 0 to 36
Created dictionary for INT column: hitsByPitch with cardinality: 41, range: 0 to 51
Created dictionary for STRING column: teamID with cardinality: 149, max length in bytes: 3, range: ALT to WSU
Created dictionary for INT column: numberOfGamesAsBatter with cardinality: 166, range: 0 to 165
Created dictionary for INT column: strikeouts with cardinality: 199, range: 0 to 223
Created dictionary for INT column: sacrificeFlies with cardinality: 20, range: 0 to 19
Created dictionary for INT column: caughtStealing with cardinality: 36, range: 0 to 42
Created dictionary for INT column: baseOnBalls with cardinality: 154, range: 0 to 232
Created dictionary for STRING column: playerName with cardinality: 11976, max length in bytes: 43, range:  to Zoilo Casanova
Created dictionary for INT column: doules with cardinality: 64, range: 0 to 67
Created dictionary for STRING column: league with cardinality: 7, max length in bytes: 2, range: AA to UA
Created dictionary for INT column: yearID with cardinality: 143, range: 1871 to 2013
Created dictionary for INT column: hits with cardinality: 250, range: 0 to 262
Created dictionary for INT column: runsBattedIn with cardinality: 175, range: 0 to 191
Created dictionary for INT column: G_old with cardinality: 166, range: 0 to 165
Created dictionary for INT column: sacrificeHits with cardinality: 54, range: 0 to 67
Created dictionary for INT column: intentionalWalks with cardinality: 45, range: 0 to 120
Created dictionary for INT column: runs with cardinality: 167, range: 0 to 192
Created dictionary for STRING column: playerID with cardinality: 18107, max length in bytes: 9, range: aardsda01 to zwilldu01
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
Converting segment: /Users/host1/Desktop/test2/baseballStats_data_0 to v3 format
v3 segment location for segment: baseballStats_data_0 is /Users/host1/Desktop/test2/baseballStats_data_0/v3
Deleting files in v1 segment directory: /Users/host1/Desktop/test2/baseballStats_data_0
Driver, record read time : 369
Driver, stats collector time : 0
Driver, indexing time : 373

Here is an example of executing a query on a Pinot table:

$ ./pinot-distribution/target/apache-pinot-0.8.0-SNAPSHOT-bin/apache-pinot-0.8.0-SNAPSHOT-bin/bin/pinot-admin.sh PostQuery -query "select count(*) from baseballStats"
Executing command: PostQuery -brokerHost [broker_host] -brokerPort [broker_port] -query select count(*) from baseballStats
Result: {"aggregationResults":[{"function":"count_star","value":"97889"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":97889,"numEntriesScannedInFilter":0,"numEntriesScannedPostFilter":0,"numGroupsLimitReached":false,"totalDocs":97889,"timeUsedMs":107,"segmentStatistics":[],"traceInfo":{}}

Monitoring Pinot

Pinot exposes several metrics to monitor the service and ensure that pinot users are not experiencing issues. In this section we discuss some of the key metrics that are useful to monitor. A full list of metrics is available in the Metrics section.

Pinot Server

Missing Segments - NUM_MISSING_SEGMENTS
- Number of missing segments that the broker queried for (expected to be on the server) but the server didn’t have. This can be due to retention or stale routing table.
Query latency - TOTAL_QUERY_TIME
- Total time to take from receiving to finishing executing the query.
Query Execution Exceptions - QUERY_EXECUTION_EXCEPTIONS
- The number of exception which might have occurred during query execution.
Realtime Consumption Status - LLC_PARTITION_CONSUMING
- This gives a binary value based on whether low-level consumption is healthy (1) or unhealthy (0). It’s important to ensure at least a single replica of each partition is consuming.
Realtime Highest Offset Consumed - HIGHEST_STREAM_OFFSET_CONSUMED
- The highest offset which has been consumed so far.

Pinot Broker

Incoming QPS (per broker) - QUERIES
- The rate which an individual broker is receiving queries. Units are in QPS.
Dropped Requests - REQUEST_DROPPED_DUE_TO_SEND_ERROR, REQUEST_DROPPED_DUE_TO_CONNECTION_ERROR, REQUEST_DROPPED_DUE_TO_ACCESS_ERROR
- These multiple metrics will indicate if a query is dropped, ie the processing of that query has been forfeited for some reason.
Partial Responses - BROKER_RESPONSES_WITH_PARTIAL_SERVERS_RESPONDED
- Indicates a count of partial responses. A partial response is when at least 1 of the requested servers fails to respond to the query.
Table QPS quota exceeded - QUERY_QUOTA_EXCEEDED
- Binary metric which will indicate when the configured QPS quota for a table is exceeded (1) or if there is capacity remaining (0).
Table QPS quota usage percent - QUERY_QUOTA_CAPACITY_UTILIZATION_RATE
- Percentage of the configured QPS quota being utilized.

Pinot Controller

Many of the controller metrics include a table name and thus are dynamically generated in the code. The metrics below point to the classes which generate the corresponding metrics.

To get the real metric name, the easiest route is to spin up a controller instance, create a table with the desired name and look through the generated metrics.

Todo

Give a more detailed explanation of how metrics are generated, how to identify real metrics names and where to find them in the code.

Percent Segments Available - PERCENT_SEGMENTS_AVAILABLE
- Percentage of complete online replicas in external view as compared to replicas in ideal state.
Segments in Error State - SEGMENTS_IN_ERROR_STATE
- Number of segments in an ERROR state for a given table.
Last push delay - Generated in the ValidationMetrics class.
- The time in hours since the last time an offline segment has been pushed to the controller.
Percent of replicas up - PERCENT_OF_REPLICAS
- Percentage of complete online replicas in external view as compared to replicas in ideal state.
Table storage quota usage percent - TABLE_STORAGE_QUOTA_UTILIZATION
- Shows how much of the table’s storage quota is currently being used, metric will a percentage of a the entire quota.