Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This page contains guides related to importing data from Apache Kafka using stream ingestion.
Loading...
This section contains a collection of short guides to show you how to import from a Pinot supported file system.
Loading...
Loading...
Loading...
This section contains a collection of guides that will show you how to import data from a Pinot supported input format.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This section contains articles that provide technical and implementation details of Pinot features
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Here you will find a collection of ready-made sample applications and examples for real-world data
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The Pinot Controller is responsible for a number of things
Controllers maintain the global metadata (e.g. configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store.
Controllers host Helix Controller and is responsible for managing other pinot components (brokers, servers, minions)
They maintain the mapping of which servers are responsible for which segments. This mapping is used by the servers, to download the portion of the segments that they are responsible for. This mapping is also used by the broker to decide which servers to route the queries to.
Controller has admin endpoints for viewing, creating, updating and deleting configs which help us manage and operate the cluster.
Controllers also have endpoints for segment uploads which are used in offline data pushes. They are responsible for initializing realtime consumption and coordination of persisting the realtime segments into the segment store periodically.
They undertake other management activities such as managing retention of segments, validations.
There can be multiple instances of Pinot controller for redundancy. If there are multiple controllers, Pinot expects that all of them are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or ADLS.
Make sure you've setup Zookeeper. If you're using docker, make sure to pull the pinot docker image. To start a controller
Pinot Minion is a new component which leverages the Helix Task Framework . It can be attached to an existing Pinot cluster and then execute tasks as provided by the controller. It's a generic and single place for running background jobs. They help offload computationally intensive tasks—such as adding indexes to segments and merging segments—from other components.
This quick start guide will help you bootstrap a Pinot standalone instance on your local machine.
In this guide you'll learn how to download and install Apache Pinot as a standalone instance.
This is a quickstart guide that will show you how to quickly start an example recipe in a standalone instance and is meant for learning. To run Pinot in cluster mode, please take a look at Manual cluster setup.
First, let's download the Pinot distribution for this tutorial. You can either build the distribution from source or download a packaged release.
Prerequisites
Install JDK8 or higher.
Follow these steps to checkout code from Github and build Pinot locally
Prerequisites
Install Apache Maven 3.6 or higher
Note that Pinot scripts is located under pinot-distribution/target not target directory under root.
Download the latest binary release from Apache Pinot, or use this command
Once you have the tar file,
We'll be using a quick-start script, which does the following:
Sets up the Pinot cluster QuickStartCluster
Creates a sample table and loads sample data
There's 3 kinds of quick start
Batch quick start creates the pinot cluster, creates an offline table baseballStats
and pushes sample offline data to the table.
That's it! We've spun up a Pinot cluster. You can continue playing with other types of quick start, or simply head on to Pinot Data Explorer to check out the data in the baseballStats
table.
Streaming quick start sets up a Kafka cluster and pushes sample data to a Kafka topic. Then, it creates the Pinot cluster and creates a realtime table meetupRSVP
which ingests data from the Kafka topic.
We now have a Pinot cluster with a realtime table! You can head over to Pinot Data Explorer to check out the data in the meetupRSVP
table.
Hybrid quick start sets up a Kafka cluster and pushes sample data to a Kafka topic. Then, it creates the Pinot cluster and creates a hybrid table airlineStats
. The realtime table ingests data from the Kafka topic. Lastly, sample data is pushed into the offline table.
Let's head over to Pinot Data Explorer to check out the data we pushed to the airlineStats
table.
This page has a collection of frequently asked questions with answers from the community.
This is a list of frequent questions most often asked in our troubleshooting channel on Slack. Please feel free to contribute your questions and answers here and make a pull request.
We have toJsonStr(key)
function which can store a top level json field as a STRING in Pinot.
Then you can use jsonExtractScalar(JSON_STRING_FIELD, JSON_PATH, OUTPUT_FORMAT)
function during query time to fetch the desired field from the json string. For example
NOTE This works well if some of your fields are nested json, but most of your fields are top level json keys. If all of your fields are within a nested JSON key, you will have to store the entire payload as 1 column, which is not ideal.
Support for flattening during ingestion is on the roadmap: https://github.com/apache/incubator-pinot/issues/5264
Inverted indexes are set in the tableConfig's tableIndexConfig -> invertedIndexColumns list. Here's the documentation for tableIndexConfig: https://docs.pinot.apache.org/basics/components/table#tableindexconfig-1 along with a sample table that has set inverted indexes on some columns.
Applying inverted indexes to a table config will generate inverted index to all new segments. In order to apply the inverted indexes to all existing segments, follow steps in How to apply inverted index to existing setup?
Add the columns you wish to index to the tableIndexConfig-> invertedIndexColumns list. This sample table config show inverted indexes set: https://docs.pinot.apache.org/basics/components/table#offline-table-config To update the table config use the Pinot Swagger API: http://localhost:9000/help#!/Table/updateTableConfig
Invoke the reload API: http://localhost:9000/help#!/Segment/reloadAllSegments
Right now, there’s no easy way to confirm that reload succeeded. One way it to check out the index_map file inside the segment metadata, you should see inverted index entries for the new columns. An API for this is coming soon: https://github.com/apache/incubator-pinot/issues/5390
Here's the page explaining the Pinot response format: https://docs.pinot.apache.org/users/api/querying-pinot-using-standard-sql/response-format
"timestamp" is a reserved keyword in SQL. Escape timestamp with double quotes.
Other commonly encountered reserved keywords are date, time, table.
For filtering on STRING columns, use single quotes
The fields in the ORDER BY
clause must be one of the group by clauses or aggregations, BEFORE applying the alias. Therefore, this will not work
Instead, this will work
You can change the number of replicas by updating the table config's segmentsConfig section. Make sure you have at least as many servers as the replication.
For OFFLINE table, update replication
For REALTIME table update replicasPerPartition
After changing the replication, run a table rebalance.
A rebalance is run to reassign all the segments of a table to the available servers. This is typically done when capacity changes are done i.e. adding more servers or removing servers from a table.
Offline
Use the rebalance API from the Swagger APIs on the controller http://localhost:9000/help#!/Table/rebalance, with tableType OFFLINE
Realtime
Use the rebalance API from the Swagger APIs on the controller http://localhost:9000/help#!/Table/rebalance, with tableType REALTIME.
A realtime table has 2 components, the consuming segments and the completed segments. By default, only the completed segments will get rebalanced. The consuming segments will pick the right assignment once they complete. But you can enforce the consuming segments to also be included in the rebalance, by setting the param includeConsuming
to true. Note that rebalancing the consuming segments would mean the consuming segment will drop the consumed data so far, and restart consumption from the last offset, which may lead to a short duration of data staleness.
You can check the status of the rebalance by
Checking the controller logs
Running rebalance again after a while, you should receive status "status": "NO_OP"
Checking the External View of the table, to see the changes in capacity/replicas have taken effect.
Yes, replica groups work for realtime. There's 2 parts to enabling replica groups:
Replica groups segment assignment
Replica group query routing
Replica group segment assignment
Replica group segment assignment is achieved in realtime, if number of servers is a multiple of number of replicas. The partitions get uniformly sprayed across the servers, creating replica groups.
For example, consider we have 6 partitions, 2 replicas, and 4 servers.
r1
r2
p1
S0
S1
p2
S2
S3
p3
S0
S1
p4
S2
S3
p5
S0
S1
p6
S2
S3
As you can see, the set (S0, S2) contains r1 of every partition, and (s1, S3) contains r2 of every partition. The query will only be routed to one of the sets, and not span every server. If you are are adding/removing servers from an existing table setup, you have to run rebalance for segment assignment changes to take effect.
Replica group query routing
Once replica group segment assignment is in effect, the query routing can take advantage of it. For replica group based query routing, set the following in the table config's routing section, and then restart brokers
This starter provides a quick start for running Pinot on Google Cloud Platform (GCP)
This document provides the basic instruction to set up a Kubernetes Cluster on Google Kubernetes Engine(GKE)
Please follow this link (https://kubernetes.io/docs/tasks/tools/install-kubectl) to install kubectl.
For Mac User
Please check kubectl version after installation.
QuickStart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12
Please follow this link (https://helm.sh/docs/using_helm/#installing-helm) to install helm.
For Mac User
Please check helm version after installation.
This QuickStart provides helm supports for helm v3.0.0 and v2.12.1. Please pick the script based on your helm version.
__
Please follow this link (https://cloud.google.com/sdk/install) to install Google Cloud SDK.
Install Google Cloud SDK
Restart your shell
Below script will create a 3 nodes cluster named pinot-quickstart in us-west1-b with n1-standard-2 machines for demo purposes.
Please modify the parameters in the example command below:
You can monitor cluster status by command:
Once the cluster is in RUNNING status, it's ready to be used.
Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.
To verify the connection, you can run:
Please follow this Kubernetes QuickStart to deploy your Pinot Demo.
This starter guide provides a quick start for running Pinot on Microsoft Azure
This document provides the basic instruction to set up a Kubernetes Cluster on Azure Kubernetes Service (AKS)
Please follow this link (https://kubernetes.io/docs/tasks/tools/install-kubectl) to install kubectl.
For Mac User
Please check kubectl version after installation.
QuickStart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12
Please follow this link (https://helm.sh/docs/using_helm/#installing-helm) to install helm.
For Mac User
Please check helm version after installation.
This QuickStart provides helm supports for helm v3.0.0 and v2.12.1. Please pick the script based on your helm version.
Please follow this link (https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) to install Azure CLI.
For Mac User
Below script will open default browser to sign-in to your Azure Account.
Below script will create a resource group in location eastus.
Below script will create a 3 nodes cluster named pinot-quickstart for demo purposes.
Please modify the parameters in the example command below:
Once the command is succeed, it's ready to be used.
Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.
To verify the connection, you can run:
Please follow this Kubernetes QuickStart to deploy your Pinot Demo.
This guide provides a quick start for running Pinot on Amazon Web Services (AWS).
This document provides the basic instruction to set up a Kubernetes Cluster on Amazon Elastic Kubernetes Service (Amazon EKS)
Please follow this link (https://kubernetes.io/docs/tasks/tools/install-kubectl) to install kubectl.
For Mac User
Please check kubectl version after installation.
QuickStart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12
Please follow this link (https://helm.sh/docs/using_helm/#installing-helm) to install helm.
For Mac User
Please check helm version after installation.
This QuickStart provides helm supports for helm v3.0.0 and v2.12.1. Please pick the script based on your helm version.
__
Please follow this link (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html#install-tool-bundled) to install AWS CLI.
For Mac User
Please follow this link (https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html#installing-eksctl) to install AWS CLI.
For Mac User
For first time AWS user, please register your account at https://aws.amazon.com/.
Once created the account, you can go to AWS Identity and Access Management (IAM) to create a user and create access keys under Security Credential tab.
Environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
will override AWS configuration stored in file ~/.aws/credentials
Below script will create a 3 nodes cluster named pinot-quickstart in us-west-2 with t3.small machines for demo purposes.
Please modify the parameters in the example command below:
You can monitor cluster status by command:
Once the cluster is in ACTIVE status, it's ready to be used.
Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.
To verify the connection, you can run:
Please follow this Kubernetes QuickStart to deploy your Pinot Demo.
This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.
This guide is a work in progress.
We're actively working on improving our documentation. This doc will be available very soon. Please check back in a day or two for more details.
While you're waiting, please take a look at our section on plugin architecture.
This guide shows you how to import data from files stored in Azure Data Lake Storage (ADLS)
This guide is a work in progress.
We're actively working on improving our documentation. This doc will be available very soon. Please check back in a day or two for more details.
While you're waiting, please take a look at our section on plugin architecture.
This guide shows you how to import data from HDFS.
This guide is a work in progress.
We're actively working on improving our documentation. This doc will be available very soon. Please check back in a day or two for more details.
While you're waiting, please take a look at our section on plugin architecture.
This guide shows you how to import data from GCP (Google Cloud Platform).
This guide is a work in progress.
We're actively working on improving our documentation. This doc will be available very soon. Please check back in a day or two for more details.
While you're waiting, please take a look at our section on plugin architecture.
Step-by-step guide on pushing your own data into the Pinot cluster
So far, we setup our cluster, ran some queries, explored the admin endpoints. Now, it's time to get our own data into Pinot
Let's gather our data files and put it in pinot-quick-start/rawdata
.
Supported file formats are CVS, JSON, AVRO, PARQUET, THRIFT, ORC. If you don't have sample data, you can use this sample CSV.
Schema is used to define the columns and data types of the Pinot table. A detailed overview of the schema can be found in Schema.
Briefly, we categorize our columns into 3 types
Column Type
Description
Dimensions
Typically used in filters and group by, for slicing and dicing into data
Metrics
Typically used in aggregations, represents the quantitative data
Time
Optional column, represents the timestamp associated with each row
For example, in our sample table, the playerID, yearID, teamID, league, playerName
columns are the dimensions, the playerStint, numberOfgames, numberOfGamesAsBatter, AtBatting, runs, hits, doules, triples, homeRuns, runsBattedIn, stolenBases, caughtStealing, baseOnBalls, strikeouts, intentionalWalks, hitsByPitch, sacrificeHits, sacrificeFlies, groundedIntoDoublePlays, G_old
columns are the metrics and there is no time column.
Once you have identified the dimensions, metrics and time columns, create a schema for your data, using the reference below.
A table config is used to define the config related to the Pinot table. A detailed overview of the table can be found in Table.
Here's the table config for the sample CSV file. You can use this as a reference to build your own table config. Simply edit the tableName and schemaName.
Check the directory structure so far
Upload the table config using the following command
Check out the table config and schema in the Rest API to make sure it was successfully uploaded.
A Pinot table's data is stored as Pinot segments. A detailed overview of the segment can be found in Segment.
To generate a segment, we need to first create a job spec yaml file. JobSpec yaml file has all the information regarding data format, input data location and pinot cluster coordinates. You can just copy over this job spec file. If you're using your own data, be sure to 1) replace transcript
with your table name 2) set the right recordReaderSpec
Use the following command to generate a segment and upload it
Sample output
Check that your segment made it to the table using the Rest API
You're all set! You should see your table in the Query Console and be able to run queries against it now.
A table is a logical abstraction to refer to a collection of related data. It consists of columns and rows (documents).
Data in Pinot tables is sharded into segments. A Pinot table is modeled as a Helix resource. Each segment of a table is modeled as a Helix Partition.
A table is typically associated with a schema, which is used to define the names, data types and other information of the columns of the table.
There are 3 types of a Pinot table
Table type
Description
Offline
Offline tables ingest pre-built pinot-segments from external data stores.
Realtime
Realtime tables ingest data from streams (such as Kafka) and build segments.
Hybrid
A hybrid Pinot table has both realtime as well as offline tables under the hood.
Note that the query does not know the existence of offline or realtime tables. It only specifies the table name in the query. For example, regardless of whether we have an offline table myTable_OFFLINE
, or a realtime table myTable_REALTIME
or a hybrid table containing both of these, the query will simply use mytable
as select count(*) from myTable
.
A table config file is used to define the table properties, such as name, type, indexing, routing, retention etc. It is written in JSON format, and stored in the property store in Zookeeper, along with the table schema.
Here's an example table config for an offline table
We will now discuss each section of the table config in detail.
Top level field
Description
tableName
Specifies the name of the table. Should only contain alpha-numeric characters, hyphens (‘-‘), or underscores (‘’). (Using a double-underscore (‘_’) is not allowed and reserved for other features within Pinot)
tableType
Defines the table type - OFFLINE
for offline table, REALTIME
for realtime table. A hybrid table is essentially 2 table configs one of each type, with the same table name.
quota
routing
segmentsConfig
tableIndexConfig
tenants
metadata
This section is for keeping custom configs, which are expressed as key value pairs.
quota fields
Description
storage
The maximum storage space the table is allowed to use, before replication. For example, in the above table, the storage is 140G and replication is 3. Therefore, the maximum storage the table is allowed to use is 140*3=420G. The space used by the table is calculated by adding up the sizes of all segments from every server hosting this table. Once this limit is reached, offline segment push throws a 403
exception with message, Quota check failed for segment: segment_0 of table: pinotTable
.
maxQueriesPerSecond
The maximum queries per second allowed to execute on this table. If query volume exceeds this, a 429
exception with message,Request 123 exceeds query quota for table:pinotTable, query:select count(*) from pinotTable