arrow-left

Only this pageAll pages
gitbookPowered by GitBook
triangle-exclamation
Couldn't generate the PDF for 615 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

latest

Loading...

Start Here

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Architecture & Concepts

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Build with Pinot

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

What is Pinot?

Learn what Apache Pinot is, what problems it solves, and whether it is the right tool for your use case.

hashtag
Outcome

By the end of this page you will understand what Apache Pinot is, what problems it solves, and whether it is the right tool for your use case.

hashtag
Prerequisites

None. This is the starting point of the onboarding path.

hashtag
What Apache Pinot does

Apache Pinot is a real-time distributed online analytical processing (OLAP) datastore. It ingests data from streaming sources (such as Apache Kafka and Amazon Kinesis) and batch sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage) and makes that data immediately available for analytic queries with sub-second latency.

hashtag
Key capabilities

  • Ultra-low-latency analytics -- Queries return in milliseconds, even at hundreds of thousands of queries per second.

  • Columnar storage with smart indexing -- Purpose-built storage format with inverted, sorted, range, text, and other indexes to accelerate query patterns.

  • Horizontal scaling -- Scale out by adding nodes with no upper bound on cluster size.

hashtag
When to use Pinot

hashtag
User-facing real-time analytics

Pinot was built at LinkedIn to power interactive analytics features such as Who Viewed Profile and Company Analytics. UberEats Restaurant Manager is another production example. These applications serve personalized analytics to every end user, generating hundreds of thousands of queries per second with strict latency requirements.

hashtag
Real-time dashboards

Pinot supports slice-and-dice, drill-down, roll-up, and pivot operations on high-dimensional data. Connect business intelligence tools such as Apache Superset, Tableau, or PowerBI to Pinot to build live dashboards over streaming data.

hashtag
Enterprise analytics

Pinot works well as a highly scalable platform for business intelligence. It converges the capabilities of a big data platform with the traditional role of a data warehouse, making it suitable for analysis and reporting at scale.

hashtag
Aggregate store for microservices

Application developers can use Pinot as an aggregate store that consumes events from streaming sources and exposes them through SQL. This is useful for building a unified, queryable view across a microservice architecture. Query models are eventually consistent, as with all aggregate stores.

hashtag
When NOT to use Pinot

circle-exclamation

Pinot is not a general-purpose transactional database. It does not support row-level updates, deletes, or transactions in the way that PostgreSQL or MySQL do. If your workload requires ACID transactions or frequent single-row mutations, a relational database is a better fit.

circle-info

If your dataset is small enough to fit comfortably in a single PostgreSQL or MySQL instance (a few million rows or less) and you do not need sub-second query latency at high concurrency, a traditional database will be simpler to operate and sufficient for your needs.

hashtag
Verify

You now know:

  • What Apache Pinot is and how it differs from transactional databases.

  • The four main categories of use cases where Pinot excels.

  • When a simpler tool would be a better choice.

hashtag
Next step

Continue to the 10-minute quickstart to launch a local Pinot cluster and run your first query:

Consistent performance -- Latency stays predictable as data volume and query load grow, based on cluster sizing and expected throughput.

  • Real-time ingestion -- Data is available for querying within seconds of arriving at the streaming source.

  • 10-Minute Quickstartchevron-right

    10-Minute Quickstart

    Run a complete Pinot cluster with sample data in under 10 minutes.

    hashtag
    Outcome

    By the end of this guide you will have a fully functional Apache Pinot cluster running locally with sample data loaded, ready to query.

    hashtag
    Prerequisites

    • installed and running

    • Recommended resources: 8 CPUs, 16 GB RAM

    hashtag
    Steps

    1. Set the Pinot version

    See the page for the current stable release.

    2. Start Pinot with sample data

    This single command starts ZooKeeper, Controller, Broker, Server, and Minion, then loads a baseball statistics dataset.

    hashtag
    Verify

    1. Open the Pinot Query Console at .

    2. Run a sample query:

    You should see results returned within milliseconds.

    hashtag
    Next step

    This quickstart bundles everything in a single process for convenience. For a list of all available quickstart types (batch, streaming, hybrid, and more), see .

    Ready for a production-style setup? Continue to .

    Install / Deploy

    Choose the deployment method that matches your environment.

    hashtag
    Outcome

    Select the right installation method for your use case and deploy a Pinot cluster.

    hashtag

    Components

    Discover the core components of Apache Pinot, enabling efficient data processing and analytics. Unleash the power of Pinot's building blocks for high-performance data-driven applications.

    Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-latency queries on large datasets through a set of common data model abstractions. Delivering on these goals requires several foundational architectural commitments, including:

    • Storing data in columnar form to support high-performance scanning

    • Sharding of data to scale both storage and computation

    Query Syntax Overview

    Query Pinot using supported syntax.

    Query Pinot using supported syntax.

    Segment Threshold

    Learn how segment thresholds work in Pinot.

    The segment threshold determines when a segment is committed in real-time tables.

    When data is first ingested from a streaming provider like Kafka, Pinot stores the data in a consuming segment.

    This segment is on the disk of the server(s) processing a particular partition from the streaming provider.

    However, it's not until a segment is committed that the segment is written to the deep store. The segment threshold decides when that should happen.

    hashtag
    Why is the segment threshold important?

    The segment threshold is important because it ensures segments are a reasonable size.

    When queries are processed, smaller segments may increase query latency due to more overhead (number of threads spawned, meta data processing, and so on).

    Larger segments may cause servers to run out of memory. When a server is restarted, the consuming segment must start consuming from the first row again, causing a lag between Pinot and the streaming provider.

    Mark Needham explains the segment threshold

    Dockerarrow-up-right
    Version reference
    http://localhost:9000arrow-up-right
    Quick Start Examplesarrow-up-right
    Install / deploy
    export PINOT_VERSION=1.4.0
    docker run \
        -p 2123:2123 \
        -p 9000:9000 \
        -p 8000:8000 \
        -p 7050:7050 \
        -p 6000:6000 \
        apachepinot/pinot:${PINOT_VERSION} QuickStart \
        -type batch
    SELECT playerName, sum(runs) AS totalRuns
    FROM baseballStats
    GROUP BY playerName
    ORDER BY totalRuns DESC
    LIMIT 10
    Deployment methods
    Method
    Best for
    Time
    Prerequisites

    Development, debugging

    10 min

    JDK 11+

    Quick evaluation, CI

    hashtag
    Next step

    Pick a method above, then continue to First table and schema.

    A distributed architecture designed to scale capacity linearly

  • A tabular data model read by SQL queries

  • hashtag
    Components

    Learn about the major components and logical abstractions used in Pinot.

    hashtag
    Operator reference

    hashtag
    Developer reference

    Clusterchevron-right
    Controllerchevron-right
    Brokerchevron-right
    Serverchevron-right
    Minionchevron-right
    Tenantchevron-right
    Tablechevron-right
    Schemachevron-right
    Segmentchevron-right

    Azure

    Provision a managed Kubernetes cluster on Azure AKS ready for Pinot.

    hashtag
    Outcome

    Create an Azure Kubernetes Service cluster with the required tooling, ready to deploy Apache Pinot.

    hashtag
    Prerequisites

    • An Azure account

    • The following CLI tools installed (see steps below)

    hashtag
    Steps

    hashtag
    1. Install tooling

    kubectl

    Verify:

    Helm

    Verify:

    Azure CLI

    Follow the or run:

    hashtag
    2. Log in to Azure

    hashtag
    3. Create a resource group

    hashtag
    4. Create an AKS cluster

    The following creates a 3-node cluster named pinot-quickstart:

    hashtag
    5. Connect to the cluster

    hashtag
    Verify

    You should see your worker nodes listed and in Ready status.

    hashtag
    Cleaning up

    To delete the cluster when you are done:

    hashtag
    Next step

    Your cluster is ready. Continue to to deploy Pinot.

    Local

    Start a Pinot cluster on your local machine.

    hashtag
    Outcome

    Start a multi-component Pinot cluster directly on your machine without containers.

    hashtag
    Prerequisites

    • JDK 11 or 21 (JDK 17 should work but is not officially supported)

    • Apache Maven 3.6+ (only if building from source)

    hashtag
    Steps

    hashtag
    1. Download or build Apache Pinot

    See the page for the current stable release.

    Extract and enter the directory:

    circle-info

    Prerequisite: Install 3.6 or higher.

    hashtag
    2. Start ZooKeeper

    hashtag
    3. Start Pinot Controller

    hashtag
    4. Start Pinot Broker

    hashtag
    5. Start Pinot Server

    hashtag
    6. Start Pinot Minion (optional)

    hashtag
    7. Start Kafka (optional)

    Only needed if you plan to ingest real-time streaming data.

    hashtag
    Verify

    Check that the Controller is healthy:

    The response should return OK. You can also open the Pinot Query Console at .

    hashtag
    Next step

    Your cluster is running. Continue to to load data.

    Managed Kubernetes

    Set up a Kubernetes cluster on your cloud provider.

    hashtag
    Outcome

    Provision a managed Kubernetes cluster on AWS, GCP, or Azure that is ready for a Pinot deployment.

    hashtag
    Overview

    These guides walk you through creating a managed Kubernetes cluster on your cloud provider. Once the cluster is running, you will use the page to deploy Pinot onto it.

    hashtag
    Cloud providers

    Provider
    Service
    Guide

    hashtag
    Next step

    Once your cluster is ready, follow the to deploy Pinot.

    Concepts

    Explore the fundamental concepts of Apache Pinot™ as a distributed OLAP database.

    Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-latency queries on large datasets through a set of common data model abstractions. Delivering on these goals requires several foundational architectural commitments, including:

    • Storing data in columnar form to support high-performance scanning

    • Sharding of data to scale both storage and computation

    • A distributed architecture designed to scale capacity linearly

    • A tabular data model read by SQL queries

    To learn about Pinot components, terminology, and gain a conceptual understanding of how data is stored in Pinot, review the following sections:

    GCP

    Provision a managed Kubernetes cluster on Google GKE ready for Pinot.

    hashtag
    Outcome

    Create a Google Kubernetes Engine cluster with the required tooling, ready to deploy Apache Pinot.

    hashtag

    Server

    Uncover the efficient data processing and storage capabilities of Apache Pinot's server component, optimizing performance for data-driven applications.

    Pinot servers provide the primary storage for and perform the computation required to execute queries. A production Pinot cluster contains many servers. In general, the more servers, the more data the cluster can retain in tables, the lower latency the cluster can deliver on queries, and the more concurrent queries the cluster can process.

    Servers are typically segregated into real-time and offline workloads, with "real-time" servers hosting only real-time tables, and "offline" servers hosting only offline tables. This is a ubiquitous operational convention, not a difference or an explicit configuration in the server process itself. There are two types of servers:

    hashtag
    Offline

    Broker

    Discover how Apache Pinot's broker component optimizes query processing, data retrieval, and enhances data-driven applications.

    Pinot brokers take query requests from client processes, scatter them to applicable servers, gather the results, and return results to the client. The controller shares cluster metadata with the brokers, which allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.

    A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.

    Pinot brokers are modeled as Helix spectators. They need to know the location of each segment of a table (and each replica of the segments) and route requests to the appropriate server that hosts the segments of the table being queried.

    The broker ensures that all the rows of the table are queried exactly once so as to return correct, consistent results for a query. The brokers may optimize to prune some of the segments

    Ingestion

    Plan Pinot ingestion around batch, stream, upsert, dedup, formats, filesystems, and transformation choices.

    Ingestion is where Pinot tables become real. Start here to choose the right path for batch or stream data, then refine the design with upsert, dedup, file format, filesystem, transform, and aggregation decisions.

    The detailed controller and table-config material belongs in . This section stays focused on data flow and operational choices.

    hashtag
    Start Here

    Stream Ingestion on Kubernetes

    Load streaming data into Pinot on Kubernetes using Kafka

    This guide walks you through loading streaming data into a Pinot cluster running in Kubernetes. Make sure you have completed first.

    hashtag
    Load data into Pinot using Kafka

    hashtag

    Segment Retention

    In this Apache Pinot concepts guide, we'll learn how segment retention works.

    Segments in Pinot tables have a retention time, after which the segments are deleted. Typically, offline tables retain segments for a longer period of time than real-time tables.

    The removal of segments is done by the retention manager. By default, the retention manager runs once every 6 hours.

    The retention manager purges two types of segments:

    • Expired segments: Segments whose end time has exceeded the retention period.

    Stream Ingestion

    Choose stream ingestion when Pinot should consume events continuously and expose new rows quickly.

    Stream ingestion keeps Pinot close to the source of truth. Use it when rows should be queryable soon after they are emitted, and when the system needs a steady flow rather than periodic batch loads.

    hashtag
    Core decisions

    Pick the stream connector and partitioning strategy.

    Choose how Pinot should flush, commit, and complete segments.

    Time Boundary

    Learn about time boundaries in hybrid tables.

    Learn about time boundaries in hybrid tables. Hybrid tables are when we have offline and real-time tables with the same name.

    When querying these tables, the Pinot broker decides which records to read from the offline table and which to read from the real-time table. It does this using the time boundary.

    hashtag
    How is the time boundary determined?

    The time boundary is determined by looking at the maximum end time of the offline segments and the segment ingestion frequency specified for the offline table.

    Schema Evolution

    Evolve Pinot schemas safely by adding columns, reloading segments, and deciding when a new table is the cleaner path.

    Pinot schema evolution is intentionally narrow. The safe path is to add columns, reload the affected segments, and backfill only when the table type and data flow support it. If the change is more invasive than that, create a new table instead of forcing the old one to stretch.

    hashtag
    What is safe

    Additive schema changes are the normal path. New columns can be introduced without rewriting the whole table, as long as the ingestion flow and segment reload behavior are understood.

    Batch Ingestion

    Choose batch ingestion when Pinot should load prebuilt data from files, warehouses, or distributed processing jobs.

    Batch ingestion builds Pinot segments outside the cluster and pushes them into Pinot after the data is already shaped. Use it when the data changes in larger chunks, when you need deterministic backfills, or when the pipeline already produces files or segment artifacts.

    The most important design choice is not the framework, but the output contract: what the schema looks like, what the table expects, and where the segments land.

    hashtag
    Common batch paths

    Spark-based ingestion.

    Introduction

    Apache Pinot is a real-time distributed OLAP datastore purpose-built for low-latency, high-throughput analytics, and perfect for user-facing analytical workloads.

    Apache Pinot™ is a real-time distributed online analytical processing (OLAP) datastore. Use Pinot to ingest and immediately query data from streaming or batch data sources (including, Apache Kafka, Amazon Kinesis, Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage).

    circle-info

    We'd love to hear from you! to ask questions, troubleshoot, and share feedback.

    Apache Pinot includes the following:

    Querying & SQL

    Learn how to query Apache Pinot, choose the right query engine, and find SQL and function guidance quickly.

    Use this section to decide how to query Pinot, how much SQL support you need, which query engine to use, and where to look for execution controls such as quotas, cancellation, and cursors. Narrative guidance lives here. Dense syntax and endpoint detail is linked where needed.

    hashtag
    Start here

    hashtag

    SQL Insert Into From Files

    Insert a file into Pinot from Query Console

    circle-info

    This feature is supported after the 0.11.0 release. Reference PR:

    hashtag
    Prerequisite

    Start Here

    Start here to learn Apache Pinot and go from zero to running your first query. Follow the guided onboarding path or jump to the section that fits your experience level.

    Apache Pinot is a real-time distributed OLAP datastore purpose-built for low-latency, high-throughput analytics. It ingests data from streaming and batch sources and makes it queryable in under a second. This guide walks you through everything you need to go from first contact to a working Pinot deployment.

    hashtag
    Onboarding path

    Follow these pages in order for a complete introduction:

    Multi-Stage Query

    Deep dive into the multi-stage engine (MSE) internals, execution model, and troubleshooting.

    For an overview of when to use the multi-stage engine (MSE) versus the single-stage engine (SSE), see . This section provides a deep dive into MSE internals. Most of the concepts explained here are related to the engine's execution model and are not required for writing queries. However, understanding them can help you take advantage of MSE's capabilities and troubleshoot issues.

    Replaced segments: Segments that have been replaced as part of the merge rollup task.

    There are a couple of scenarios where segments in offline tables won't be purged:

    • If the segment doesn't have an end time. This would happen if the segment doesn't contain a time column.

    • If the segment's table has a segmentIngestionType of REFRESH.

    If the retention period isn't specified, segments aren't purged from tables.

    The retention manager initially moves these segments into a Deleted Segments area, from where they will eventually be permanently removed. The duration that deleted segments are kept is controlled by the controller.deleted.segments.retentionInDays configuration (default: 7 days).

    When deleting a table via the API, you can override this behavior by passing a retention query parameter. For example, DELETE /tables/{tableName}?retention=0d deletes all segments immediately without moving them to the deleted-segments area. See the Controller API Examples for more details.

    Pinot storage model
    Pinot architecture
    Pinot components

    5 min

    Docker

    Kubernetes

    Staging, production

    15 min

    K8s cluster, Helm

    Managed Kubernetes

    Production on cloud

    20 min

    Cloud account, CLI tools

    Local
    Docker
    Deep dives

    For explain plans, joins, optimizer behavior, and operator details, continue into the multi-stage query docs and engine-specific material linked from SSE vs MSE and SQL syntax.

    hashtag
    What this page covered

    This page mapped the main query workflows in Pinot: learning the query path, understanding SQL behavior, finding functions, choosing between SSE and MSE, and tuning execution controls.

    hashtag
    Next step

    Read Querying Pinot if you want the end-to-end query flow, or SSE vs MSE if you are deciding which engine to use.

    hashtag
    Related pages

    • Build with Pinot

    • Functions

    • Reference

    Querying Pinotchevron-right
    SQL syntaxchevron-right
    Overviewchevron-right
    Query Engines (SSE vs MSE)chevron-right
    Query options, quotas, cancellation & cursorschevron-right
    Query Engines (SSE vs MSE)

    What is Pinot? -- Understand what Pinot does and whether it fits your use case.

  • 10-minute quickstart -- Launch a local cluster and run your first query in minutes.

  • Install / deploy -- Set up Pinot for local development, Docker, or Kubernetes.

  • First table + schema -- Define a schema and create your first table.

  • First batch ingest -- Load data from a file into Pinot.

  • First stream ingest -- Connect Pinot to a streaming source for real-time data.

  • First query -- Write SQL queries against your Pinot tables.

  • hashtag
    Choose your path

    hashtag
    Just exploring?

    Start with the conceptual overview, then try the quickstart to see Pinot in action with zero setup:

    hashtag
    Ready to build?

    Jump straight to installation and follow the linear onboarding path from step 3 onward:

    hashtag
    Next step

    What is Pinot?chevron-right
    10-Minute Quickstartchevron-right
    Install / Deploychevron-right
    What is Pinot?chevron-right
    Azure CLI installation guidearrow-up-right
    Kubernetes install
    Version reference
    Apache Mavenarrow-up-right
    http://localhost:9000arrow-up-right
    First table and schema
    export PINOT_VERSION=1.4.0
    
    wget https://downloads.apache.org/pinot/apache-pinot-${PINOT_VERSION}/apache-pinot-${PINOT_VERSION}-bin.tar.gz
    tar -zxvf apache-pinot-${PINOT_VERSION}-bin.tar.gz
    cd apache-pinot-${PINOT_VERSION}-bin
    git clone https://github.com/apache/pinot.git
    cd pinot
    mvn install package -DskipTests -Pbin-dist
    cd build
    brew install pinot
    Prerequisites
    • A Google Cloud account and project

    • The following CLI tools installed (see steps below)

    hashtag
    Steps

    hashtag
    1. Install tooling

    kubectl

    Verify:

    Helm

    Verify:

    Google Cloud SDK

    Follow the gcloud CLI installation guidearrow-up-right or run:

    hashtag
    2. Initialize Google Cloud

    hashtag
    3. Create a GKE cluster

    The following creates a 3-node cluster named pinot-quickstart in us-west1-b using n1-standard-2 machines:

    Monitor cluster status:

    Wait until the cluster status is RUNNING.

    hashtag
    4. Connect to the cluster

    hashtag
    Verify

    You should see your worker nodes listed and in Ready status.

    hashtag
    Cleaning up

    To delete the cluster when you are done:

    hashtag
    Next step

    Your cluster is ready. Continue to Kubernetes install to deploy Pinot.

    Bring up a Kafka cluster for real-time data ingestion
    circle-info

    The Bitnami Kafka Helm chart deploys Kafka in KRaft mode (with a built-in controller quorum) by default, so a separate ZooKeeper deployment is not required for Kafka.

    hashtag
    Check Kafka deployment status

    Ensure the Kafka deployment is ready before executing the scripts in the following steps. Run the following command:

    Below is an example output showing the deployment is ready:

    hashtag
    Create Kafka topics

    Run the scripts below to create two Kafka topics for data ingestion:

    hashtag
    Load data into Kafka and create Pinot schema/tables

    The script below does the following:

    • Ingests 19492 JSON messages to Kafka topic flights-realtime at a speed of 1 msg/sec

    • Ingests 19492 Avro messages to Kafka topic flights-realtime-avro at a speed of 1 msg/sec

    • Uploads Pinot schema airlineStats

    • Creates Pinot table airlineStats to ingest data from JSON encoded Kafka topic flights-realtime

    • Creates Pinot table airlineStatsAvro to ingest data from Avro encoded Kafka topic flights-realtime-avro

    hashtag
    Query with the Pinot Data Explorer

    hashtag
    Pinot Data Explorer

    The following script (located at ./pinot/helm/pinot) performs local port forwarding, and opens the Pinot query console in your default web browser.

    Running in Kubernetesarrow-up-right
    If it's set to hourly, then:

    Otherwise:

    It is possible to force the hybrid table to use max(all offline segments' end time) by calling the API (V 0.12.0+)

    Note that this will not automatically update the time boundary as more segments are added to the offline table, and must be called each time a segment with more recent end time is uploaded to the offline table. You can revert back to using the derived time boundary by calling API:

    hashtag
    Querying

    When a Pinot broker receives a query for a hybrid table, the broker sends a time boundary annotated version of the query to the offline and real-time tables.

    For example, if we executed the following query:

    The broker would send the following query to the offline table:

    And the following query to the real-time table:

    The results of the two queries are merged by the broker before being returned to the client.

    timeBoundary = Maximum end time of offline segments - 1 hour

    Ensure you have available Pinot Minion instances deployed within the cluster.

  • Pinot version is 0.11.0 or above

  • hashtag
    How it works

    1. Parse the query with the table name and directory URI along with a list of options for the ingestion job.

    2. Call controller minion task execution API endpoint to schedule the task on minion

    3. Response has the schema of table name and task job id.

    hashtag
    Usage Syntax

    INSERT INTO [database.]table FROM FILE dataDirURI OPTION ( k=v ) [, OPTION (k=v)]*

    hashtag
    Example

    hashtag
    Insert Rows into Pinot

    We are actively developing this feature...

    The details will be revealed soon.

    https://github.com/apache/pinot/pull/8557arrow-up-right
    brew install kubernetes-cli
    kubectl version
    brew install kubernetes-helm
    helm version
    brew update && brew install azure-cli
    az login
    AKS_RESOURCE_GROUP=pinot-demo
    AKS_RESOURCE_GROUP_LOCATION=eastus
    az group create --name ${AKS_RESOURCE_GROUP} \
                    --location ${AKS_RESOURCE_GROUP_LOCATION}
    AKS_RESOURCE_GROUP=pinot-demo
    AKS_CLUSTER_NAME=pinot-quickstart
    az aks create --resource-group ${AKS_RESOURCE_GROUP} \
                  --name ${AKS_CLUSTER_NAME} \
                  --node-count 3
    AKS_RESOURCE_GROUP=pinot-demo
    AKS_CLUSTER_NAME=pinot-quickstart
    az aks get-credentials --resource-group ${AKS_RESOURCE_GROUP} \
                           --name ${AKS_CLUSTER_NAME}
    kubectl get nodes
    AKS_RESOURCE_GROUP=pinot-demo
    AKS_CLUSTER_NAME=pinot-quickstart
    az aks delete --resource-group ${AKS_RESOURCE_GROUP} \
                  --name ${AKS_CLUSTER_NAME}
    ./bin/pinot-admin.sh StartZookeeper \
      -zkPort 2181
    export JAVA_OPTS="-Xms4G -Xmx8G"
    ./bin/pinot-admin.sh StartController \
        -zkAddress localhost:2181 \
        -controllerPort 9000
    export JAVA_OPTS="-Xms4G -Xmx4G"
    ./bin/pinot-admin.sh StartBroker \
        -zkAddress localhost:2181
    export JAVA_OPTS="-Xms4G -Xmx16G"
    ./bin/pinot-admin.sh StartServer \
        -zkAddress localhost:2181
    export JAVA_OPTS="-Xms4G -Xmx4G"
    ./bin/pinot-admin.sh StartMinion \
        -zkAddress localhost:2181
    ./bin/pinot-admin.sh StartKafka \
      -zkAddress=localhost:2181/kafka \
      -port 19092
    curl localhost:9000/health
    brew install kubernetes-cli
    kubectl version
    brew install kubernetes-helm
    helm version
    curl https://sdk.cloud.google.com | bash
    exec -l $SHELL
    gcloud init
    GCLOUD_PROJECT=[your gcloud project name]
    GCLOUD_ZONE=us-west1-b
    GCLOUD_CLUSTER=pinot-quickstart
    GCLOUD_MACHINE_TYPE=n1-standard-2
    GCLOUD_NUM_NODES=3
    gcloud container clusters create ${GCLOUD_CLUSTER} \
      --num-nodes=${GCLOUD_NUM_NODES} \
      --machine-type=${GCLOUD_MACHINE_TYPE} \
      --zone=${GCLOUD_ZONE} \
      --project=${GCLOUD_PROJECT}
    gcloud compute instances list
    GCLOUD_PROJECT=[your gcloud project name]
    GCLOUD_ZONE=us-west1-b
    GCLOUD_CLUSTER=pinot-quickstart
    gcloud container clusters get-credentials ${GCLOUD_CLUSTER} --zone ${GCLOUD_ZONE} --project ${GCLOUD_PROJECT}
    kubectl get nodes
    GCLOUD_ZONE=us-west1-b
    gcloud container clusters delete pinot-quickstart --zone=${GCLOUD_ZONE}
    helm repo add kafka https://charts.bitnami.com/bitnami
    helm install -n pinot-quickstart kafka kafka/kafka \
        --set replicas=1 \
        --set listeners.client.protocol=PLAINTEXT
    kubectl get all -n pinot-quickstart | grep kafka
    pod/kafka-controller-0                   1/1     Running     0          2m
    pod/kafka-controller-1                   1/1     Running     0          2m
    pod/kafka-controller-2                   1/1     Running     0          2m
    kubectl -n pinot-quickstart exec kafka-controller-0 -- kafka-topics.sh --bootstrap-server kafka:9092 --topic flights-realtime --create --partitions 1 --replication-factor 1
    kubectl -n pinot-quickstart exec kafka-controller-0 -- kafka-topics.sh --bootstrap-server kafka:9092 --topic flights-realtime-avro --create --partitions 1 --replication-factor 1
    kubectl apply -f pinot/helm/pinot/pinot-realtime-quickstart.yml
    ./query-pinot-data.sh
    timeBoundary = Maximum end time of offline segments - 1 day
    curl -X POST \
      "http://localhost:9000/tables/{tableName}/timeBoundary" \
      -H "accept: application/json"
    curl -X DELETE \
      "http://localhost:9000/tables/{tableName}/timeBoundary" \
      -H "accept: application/json"
    SELECT count(*)
    FROM events
    SELECT count(*)
    FROM events_OFFLINE
    WHERE timeColumn <= $timeBoundary
    SELECT count(*)
    FROM events_REALTIME
    WHERE timeColumn > $timeBoundary
    SET taskName = 'myTask-s3';
    SET input.fs.className = 'org.apache.pinot.plugin.filesystem.S3PinotFS';
    SET input.fs.prop.accessKey = 'my-key';
    SET input.fs.prop.secretKey = 'my-secret';
    SET input.fs.prop.region = 'us-west-2';
    INSERT INTO "baseballStats"
    FROM FILE 's3://my-bucket/public_data_set/baseballStats/rawdata/'

    Azure AKS

    Amazon Web Services

    Amazon EKS

    AWS setup

    Google Cloud Platform

    Google GKE

    GCP setup

    Kubernetes install
    Kubernetes install guide

    Microsoft Azure

    Offline servers are responsible for downloading segments from the segment store, to host and serve queries off. When a new segment is uploaded to the controller, the controller decides the servers (as many as replication) that will host the new segment and notifies them to download the segment from the segment store. On receiving this notification, the servers download the segment file and load the segment onto the server, to server queries off them.

    hashtag
    Real-time

    Real-time servers directly ingest from a real-time stream (such as Kafka or EventHubs). Periodically, they make segments of the in-memory ingested data, based on certain thresholds. This segment is then persisted onto the segment store.

    Pinot servers are modeled as Helix participants, hosting Pinot tables (referred to as resources in Helix terminology). Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more Helix partitions of one or more helix resources (i.e. one or more segments of one or more tables).

    hashtag
    Starting a server

    Make sure you've set up Zookeeper. If you're using Docker, make sure to pull the Pinot Docker image. To start a server:

    segments
    as long as accuracy is not sacrificed.

    Helix provides the framework by which spectators can learn the location in which each partition of a resource (i.e. participant) resides. The brokers use this mechanism to learn the servers that host specific segments of a table.

    In the case of hybrid tables, the brokers ensure that the overlap between real-time and offline segment data is queried exactly once, by performing offline and real-time federation.

    Let's take this example, we have real-time data for five days - March 23 to March 27, and offline data has been pushed until Mar 25, which is two days behind real-time. The brokers maintain this time boundary.

    Suppose, we get a query to this table : select sum(metric) from table. The broker will split the query into 2 queries based on this time boundary – one for offline and one for real-time. This query becomes select sum(metric) from table_REALTIME where date >= Mar 25 and select sum(metric) from table_OFFLINE where date < Mar 25

    The broker merges results from both these queries before returning the result to the client.

    hashtag
    Starting a broker

    Make sure you've set up Zookeeper. If you're using Docker, make sure to pull the Pinot Docker image. To start a broker:

    Broker interaction with other components

    Batch Ingestion - for data that arrives in files or lands in a warehouse-style batch flow.

  • Stream Ingestion - for Kafka-style or other event streams that should be queryable quickly.

  • Upsert and Dedup - for tables that need one canonical row per key instead of raw event history.

  • Formats and Filesystems - for source formats, file systems, and deep-storage choices.

  • Transformations and Aggregations - for ingest-time cleanup and pre-aggregation decisions.

  • hashtag
    Related Existing Docs

    • Import Data

    • Data Ingestion Overview

    • Ingestion Transformations

    hashtag
    What this page covered

    This landing page defines the ingestion subtree and points to the main decision pages.

    hashtag
    Next step

    Read Batch Ingestion if your source data arrives in files or prebuilt segments.

    hashtag
    Related pages

    • Batch Ingestion

    • Stream Ingestion

    • Upsert and Dedup

    Reference
    Decide whether the table should remain purely realtime or later become hybrid.

    hashtag
    What matters most

    The stream has to support the consumption mode you choose. The table config has to describe the partitioning, replicas, and segment lifecycle clearly enough that the servers can behave predictably under load.

    hashtag
    Learn more

    The existing walk-throughs in Import Data and Data Ingestion Overview still contain the detailed mechanics.

    hashtag
    What this page covered

    This page covered the stream-ingestion model and the main lifecycle choices behind it.

    hashtag
    Next step

    Read Upsert and Dedup if the stream should collapse duplicate keys or keep only the latest row.

    hashtag
    Related pages

    • Ingestion

    • Batch Ingestion

    • Upsert and Dedup

    hashtag
    What is not safe

    Renaming a column, dropping a column, or changing a column type is not a small schema tweak. Treat those as table redesign work.

    hashtag
    Typical flow

    1. Add the new column to the schema.

    2. Update the table config or ingestion config if the new field needs transforms.

    3. Reload the affected segments.

    4. Backfill historical data if the use case needs it.

    hashtag
    Reference material

    The detailed walkthrough still lives in Schema Evolution tutorial.

    hashtag
    What this page covered

    This page covered the additive schema-evolution path and the cases where a new table is safer.

    hashtag
    Next step

    Read the ingestion pages to see how schema design affects batch and stream pipelines.

    hashtag
    Related pages

    • Data Modeling

    • Schema and Table Shape

    • Logical Tables

    Hadoop-style distributed ingestion.

    Backfill jobs for historical ranges.

    Dimension tables and other specialized offline loads.

    hashtag
    What to decide early

    Decide on the file format, the deep-storage target, and the segment push workflow before you optimize the job itself. Most batch ingestion problems come from mismatched assumptions at those boundaries.

    hashtag
    Learn more

    The original step-by-step batch docs live in Import Data and Data Ingestion Overview.

    hashtag
    What this page covered

    This page covered when to choose batch ingestion and the main design decisions that shape it.

    hashtag
    Next step

    Read Stream Ingestion if the source system is a live event stream.

    hashtag
    Related pages

    • Ingestion

    • Stream Ingestion

    • Formats and Filesystems

    Ultra low-latency analytics even at extremely high throughput.

  • Columnar data store with several smart indexing and pre-aggregation techniques.

  • Scaling up and out with no upper bound.

  • Consistent performance based on the size of your cluster and an expected query per second (QPS) threshold.

  • It's perfect for user-facing real-time analytics and other analytical use cases, including internal dashboards, anomaly detection, and ad hoc data exploration.

    hashtag
    User-facing real-time analytics

    User-facing analytics refers to the analytical tools exposed to the end users of your product. In a user-facing analytics application, all users receive personalized analytics on their devices, resulting in hundreds of thousands of queries per second. Queries triggered by apps may grow quickly in proportion to the number of active users on the app, as many as millions of events per second. Data generated in Pinot is immediately available for analytics in latencies under one second.

    User-facing real-time analytics requires the following:

    • Fresh data. The system needs to be able to ingest data in real time and make it available for querying, also in real time.

    • Support for high-velocity, highly dimensional event data from a wide range of actions and from multiple sources.

    • Low latency. Queries are triggered by end users interacting with apps, resulting in hundreds of thousands of queries per second with arbitrary patterns.

    • Reliability and high availability.

    • Scalability.

    • Low cost to serve.

    hashtag
    Why Pinot?

    Pinot is designed to execute OLAP queries with low latency. It works well where you need fast analytics, such as aggregations, on both mutable and immutable data.

    User-facing, real-time analytics

    Pinot was originally built at LinkedIn to power rich interactive real-time analytics applications, such as Who Viewed Profilearrow-up-right, Company Analyticsarrow-up-right, Talent Insightsarrow-up-right, and many more. UberEats Restaurant Managerarrow-up-right is another example of a user-facing analytics app built with Pinot.

    Real-time dashboards for business metrics

    Pinot can perform typical analytical operations such as slice and dice, drill down, roll up, and pivot on large scale multi-dimensional data. For instance, at LinkedIn, Pinot powers dashboards for thousands of business metrics. Connect various business intelligence (BI) tools such as Supersetarrow-up-right, Tableauarrow-up-right, or PowerBIarrow-up-right to visualize data in Pinot.

    Enterprise business intelligence

    For analysts and data scientists, Pinot works well as a highly-scalable data platform for business intelligence. Pinot converges big data platforms with the traditional role of a data warehouse, making it a suitable replacement for analysis and reporting.

    Enterprise application development

    For application developers, Pinot works well as an aggregate store that sources events from streaming data sources, such as Kafka, and makes it available for a query using SQL. You can also use Pinot to aggregate data across a microservice architecture into one easily queryable view of the domain.

    Pinot tenants prevent any possibility of sharing ownership of database tables across microservice teams. Developers can create their own query models of data from multiple systems of record depending on their use case and needs. As with all aggregate stores, query models are eventually consistent.

    hashtag
    Get started

    If you're new to Pinot, take a look at our Getting Started guide:

    To start importing data into Pinot, see how to import batch and stream data:

    To start querying data in Pinot, check out our Query guide:

    hashtag
    Learn

    For a conceptual overview that explains how Pinot works, check out the Concepts guide:

    To understand the distributed systems architecture that explains Pinot's operating model, take a look at our basic architecture section:

    Join us in our Slack channelarrow-up-right
    Start Herechevron-right
    Ingestionchevron-right
    Querying & SQLchevron-right
    Conceptschevron-right
    Architecturechevron-right

    First Query

    Run your first SQL queries against Pinot using the Query Console and REST API.

    hashtag
    Outcome

    Run your first SQL queries against Pinot and understand the query interface.

    hashtag
    Prerequisites

    • You have completed either or . The transcript table exists and contains data.

    • The Pinot cluster is running (Controller on port 9000, Broker on port 8099).

    hashtag
    Steps

    hashtag
    1. Open the Query Console

    Navigate to in your browser. Click Query Console in the left sidebar. You should see the transcript table listed in the table explorer on the left.

    hashtag
    2. Run a simple SELECT

    Paste the following query into the query editor and click Run Query:

    The results panel shows all columns in the transcript table -- studentID, firstName, lastName, gender, subject, score, and timestamp. The rows returned come from whichever data you loaded (batch, stream, or both). LIMIT 10 caps the result set so the response is fast.

    hashtag
    3. Run an aggregation

    This query calculates the average score per subject and sorts the results from highest to lowest. Pinot executes aggregations directly on each server's segment data and merges the results at the Broker, making GROUP BY queries fast even on large datasets.

    hashtag
    4. Run a count

    This returns the total number of rows in the table. The exact count depends on which ingestion steps you completed:

    • Batch ingest only -- 4 rows

    • Stream ingest only -- the number of events you published (up to 12 in the tutorial)

    • Both -- the combined total

    hashtag
    5. Run a filter

    This filters rows to show only students with a score above 3.5. Pinot pushes filter predicates down to the servers so only matching rows are scanned and returned.

    hashtag
    6. Try the REST API

    The Query Console UI is convenient for exploration, but production applications query Pinot through its REST API. Open a terminal and run:

    Port 8099 is the Broker, which handles all query requests. The Query Console UI uses the same API under the hood. The response is a JSON object containing the result rows, schema, and query execution metadata.

    hashtag
    Verify

    All five queries return results without errors. You have successfully completed the end-to-end onboarding flow: you set up a Pinot cluster, defined a schema and table, loaded data, and queried it through both the UI and the REST API.

    hashtag
    What's next

    You have finished the linear Start Here path. From here, explore the areas most relevant to your use case:

    • -- the full SQL reference for Pinot's query language

    • -- enable JOINs and complex queries across tables

    • -- understand how queries flow from Broker to Server and back

    Kubernetes

    Deploy a Pinot cluster on Kubernetes using Helm.

    hashtag
    Outcome

    Deploy a production-ready Pinot cluster on Kubernetes with Helm charts.

    circle-info

    The examples in this guide are sample configurations for reference. For production deployments, customize settings as needed -- especially security features like TLS and authentication.

    hashtag
    Prerequisites

    • A running Kubernetes cluster. Options include:

      • -- start with sufficient resources: minikube start --vm=true --cpus=4 --memory=8g --disk-size=50g

    hashtag
    Steps

    hashtag
    1. Add the Pinot Helm repository

    hashtag
    2. Create a namespace

    hashtag
    3. Install Pinot

    circle-info

    StorageClass: Specify the StorageClass for your cloud vendor. Use block storage only -- do not mount blob stores (S3, GCS, AzureFile) as the data-serving file system.

    • AWS: gp2

    hashtag
    Verify

    Check the deployment status:

    All pods should reach Running status. You can port-forward the Controller to access the UI:

    Then open .

    hashtag
    Loading data

    For stream ingestion on Kubernetes, see the . For batch data loading and table creation, continue with the onboarding path below.

    hashtag
    Deleting the cluster

    To remove Pinot from your cluster:

    hashtag
    Next step

    Your cluster is running. Continue to to load data.

    Backfill Data

    Batch ingestion of backfill data into Apache Pinot.

    hashtag
    Introduction

    Pinot batch ingestion involves two parts: routine ingestion job(hourly/daily) and backfill. Here are some examples to show how routine batch ingestion works in Pinot offline table:

    • Batch Ingestion Overview

    High-level description

    1. Organize raw data into buckets (eg: /var/pinot/airlineStats/rawdata/2014/01/01). Each bucket typically contains several files (eg: /var/pinot/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01_0.avro)

    2. Run a Pinot batch ingestion job, which points to a specific date folder like ‘/var/pinot/airlineStats/rawdata/2014/01/01’. The segment generation job will convert each such avro file into a Pinot segment for that day and give it a unique name.

    3. Run Pinot segment push job to upload those segments with those uniques names via a Controller API

    circle-info

    IMPORTANT: The segment name is the unique identifier used to uniquely identify that segment in Pinot. If the controller gets an upload request for a segment with the same name - it will attempt to replace it with the new one.

    This newly uploaded data can now be queried in Pinot. However, sometimes users will make changes to the raw data which need to be reflected in Pinot. This process is known as 'Backfill'.

    hashtag
    How to backfill data in Pinot

    Pinot supports data modification only at the segment level, which means you must update entire segments for doing backfills. The high level idea is to repeat steps 2 (segment generation) and 3 (segment upload) mentioned above:

    • Backfill jobs must run at the same granularity as the daily job. E.g., if you need to backfill data for 2014/01/01, specify that input folder for your backfill job (e.g.: ‘/var/pinot/airlineStats/rawdata/2014/01/01’)

    • The backfill job will then generate segments with the same name as the original job (with the new data).

    • When uploading those segments to Pinot, the controller will replace the old segments with the new ones (segment names act like primary keys within Pinot) one by one.

    hashtag
    Edge case example

    Backfill jobs expect the same number of (or more) data files on the backfill date. So the segment generation job will create the same number of (or more) segments than the original run.

    For example, assuming table airlineStats has 2 segments(airlineStats_2014-01-01_2014-01-01_0, airlineStats_2014-01-01_2014-01-01_1) on date 2014/01/01 and the backfill input directory contains only 1 input file. Then the segment generation job will create just one segment: airlineStats_2014-01-01_2014-01-01_0. After the segment push job, only segment airlineStats_2014-01-01_2014-01-01_0 got replaced and stale data in segment airlineStats_2014-01-01_2014-01-01_1 are still there.

    If the raw data is modified in such a way that the original time bucket has fewer input files than the first ingestion run, backfill will fail.

    Version Reference

    Current Apache Pinot release version and how to pin versions in examples.

    hashtag
    Outcome

    Know which Pinot version to use and how to pin versions in examples.

    circle-exclamation

    All code samples in the Start Here guide use PINOT_VERSION=1.4.0. If you are using a different version, set the variable accordingly before running any commands.

    This page is the single source of truth for version information across the Start Here guide and the wider documentation. When following tutorials or code samples, make sure the version you use matches your installed release.

    hashtag
    Current stable release

    Artifact
    Version

    hashtag
    Using PINOT_VERSION in examples

    Most code samples in these docs set a PINOT_VERSION environment variable near the top of each snippet. Always verify that the value matches your installed version:

    Once the variable is set, every command in the tutorial that references ${PINOT_VERSION} will use the correct value automatically.

    circle-info

    Start Here pages never use the latest Docker tag. Always pin to a specific version for reproducibility. The latest tag can change without notice and may introduce breaking changes during a tutorial.

    hashtag
    Compatibility notes

    Requirement
    Detail

    If you are running JDK 8 and cannot upgrade, use Pinot 0.12.1. For all new deployments, JDK 11 or 21 is recommended.

    hashtag
    Release links

    circle-info

    You can find all published releases on the page, and all Docker tags on .

    hashtag
    Older versions

    Older Pinot binaries are archived at .

    Deep Store

    Leverage Apache Pinot's deep store component for efficient large-scale data storage and management, enabling impactful data processing and analysis.

    The deep store (or deep storage) is the permanent store for segment files.

    It is used for backup and restore operations. New server nodes in a cluster will pull down a copy of segment files from the deep store. If the local segment files on a server gets damaged in some way (or accidentally deleted), a new copy will be pulled down from the deep store on server restart.

    The deep store stores a compressed version of the segment files and it typically won't include any indexes. These compressed files can be stored on a local file system or on a variety of other file systems. For more details on supported file systems, see File Systems.

    Note: Deep store by itself is not sufficient for restore operations. Pinot stores metadata such as table config, schema, segment metadata in Zookeeper. For restore operations, both Deep Store as well as Zookeeper metadata are required.

    hashtag
    How do segments get into the deep store?

    There are several different ways that segments are persisted in the deep store.

    For offline tables, the batch ingestion job writes the segment directly into the deep store, as shown in the diagram below:

    The ingestion job then sends a notification about the new segment to the controller, which in turn notifies the appropriate server to pull down that segment.

    For real-time tables, by default, a segment is first built-in memory by the server. It is then uploaded to the lead controller (as part of the Segment Completion Protocol sequence), which writes the segment into the deep store, as shown in the diagram below:

    Having all segments go through the controller can become a system bottleneck under heavy load, in which case you can use the peer download policy, as described in .

    When using this configuration, the server will directly write a completed segment to the deep store, as shown in the diagram below:

    hashtag
    Configuring the deep store

    For hands-on examples of how to configure the deep store, see the following tutorials:

    Schema and Table Shape

    Understand Pinot schema design, table shape, null handling, and the schema fields that drive query and ingestion behavior.

    A Pinot schema defines the columns that exist in a table and how Pinot should treat them. The important part is not only the column list, but also the shape of the table: which fields are dimensions, metrics, and time fields, how nulls behave, and whether the table is built for offline, realtime, or hybrid ingestion.

    Pinot stores schema and table metadata separately, but the two should be designed together. Keep the schema narrow enough to match the data you actually query, and keep the table config dense enough for reference pages rather than this narrative overview.

    hashtag
    What to design

    The schema answers four practical questions:

    • What columns exist?

    • What data type does each column use?

    • Which columns are dimensions, metrics, or date-time fields?

    hashtag
    Good defaults

    Use column names that are stable and business-facing. Prefer simple types that match the source data. Add only the fields you need at query time, because schema changes are additive and should be deliberate.

    For time columns, keep one primary time field in mind for retention and hybrid-table boundary behavior. For null handling, decide early whether the table needs column-based or table-based semantics.

    hashtag
    Example schema

    hashtag
    When to use the reference pages

    Use the when you need the exact JSON fields, validation rules, or date-time field formats. Use the when you need indexing, retention, or routing configuration.

    hashtag
    What this page covered

    This page covered the parts of Pinot schema design that shape ingestion and query behavior.

    hashtag
    Next step

    Read if one query name should route to multiple physical tables.

    hashtag
    Related pages

    Formats and Filesystems

    Match Pinot ingestion to the right input formats and deep-storage filesystems without overcomplicating the table design.

    Pinot supports several source formats and deep-storage choices. Pick these early, because they affect how segments are produced, moved, and recovered.

    hashtag
    Source formats

    Use the original format docs when you need the exact supported file types or loader behavior. The main landing page is Supported Data Formats.

    hashtag
    Filesystems and deep storage

    Choose the deep-storage backend that matches your operational environment. The detailed filesystem docs still live under .

    hashtag
    Keep it simple

    Do not mix format decisions with schema design. The schema says what the data means; the filesystem says where segments survive after Pinot produces them.

    hashtag
    What this page covered

    This page covered how source formats and deep storage fit into the ingestion design.

    hashtag
    Next step

    Read if data needs cleanup or pre-aggregation before query time.

    hashtag
    Related pages

    Row Expression Comparison


    hashtag
    Row Expression

    hashtag
    ROW()

    hashtag
    Description:

    ROW value expressions is supported in Pinot in comparison contexts, enabling efficient keyset pagination queries. ROW expressions allow users to write cleaner multi-column comparisons like WHERE (col1, col2, col3) > (val1, val2, val3) instead of verbose nested conditions. The evaluation of row expression is based on the () for the comparison operators.

    hashtag
    Syntax:

    Pinot supports implicit ROW-style expressions in comparison predicates using a parenthesized list of expressions on both sides of the comparator:

    WHERE (col1, col2, col3) > (val1, val2, val3)

    Supported comparison operators:

    Note: Explicit use of the ROW() keyword (e.g., WHERE ROW(col1, col2) = ROW(1, 2)) is not yet supported due to the current SQL parser configuration (SqlConformanceEnum.BABEL). Future improvements may enable explicit row value constructors.

    hashtag
    Note:

    • ROW comparisons are lexicographic, not element-wise

    • Pinot does not materialize row types — it rewrites comparisons at planning time

    • Rewrite complexity grows linearly with the number of columns

    hashtag
    Sample Example Usage:

    Equality (=)

    is rewritten to

    Greater Than (>)

    is rewritten to

    Less Than (<)

    is rewritten to

    Overview

    Build applications and data workflows with Apache Pinot using task-oriented guidance.

    Use this section when you are designing tables, ingesting data, querying Pinot, choosing indexes, or connecting Pinot to applications and tools. The goal here is to help you decide what to do next and then take you to the right detailed docs without forcing you through raw reference first.

    hashtag
    Core build workflows

    Data modelingchevron-rightIngestionchevron-rightQuerying & SQLchevron-rightIndexingchevron-rightConnectors, clients & APIschevron-right

    hashtag
    When to use Reference

    If you already know the exact property, endpoint, or plugin you need, jump to the section. Build-focused pages in this section explain how pieces fit together. Reference pages stay dense on purpose.

    hashtag
    What this page covered

    This page introduced the task-oriented Build with Pinot structure and pointed to the main workflows for modeling, ingestion, querying, indexing, and integration.

    hashtag
    Next step

    Start with the workflow that matches your immediate task, such as or .

    hashtag
    Related pages

    Data modeling

    Build Pinot tables by getting schema, table shape, logical-table, and schema-evolution decisions right before ingestion starts.

    Pinot works best when the table shape is clear before data lands. Start here to understand the structure that every ingestion and query decision depends on: schema design, table composition, logical-table layout, and how schemas evolve without breaking existing pipelines.

    If you need dense JSON config or controller endpoints, jump to the Reference section instead. This section stays narrative and decision-oriented.

    hashtag
    Start Here

    hashtag
    Related Existing Docs

    hashtag
    What this page covered

    This landing page defines the scope of Pinot data modeling and points to the core pages that matter first.

    hashtag
    Next step

    Read to lock in the table structure before designing ingestion.

    hashtag
    Related pages

    Logical Tables

    Use logical tables when one query name should span multiple physical Pinot tables without exposing the partitioning scheme to users.

    Logical tables are a naming and routing layer on top of physical tables. They let you split data by region, age, or operating mode while keeping one user-facing table name.

    Use a logical table when the split is an implementation detail, not part of the query contract. Keep the physical tables aligned on schema, and use a reference physical table only as a metadata anchor.

    hashtag
    When they help

    Logical tables are most useful when you need one of these patterns:

    Different physical tables per region or business unit.

    Separate offline and realtime tables that still answer one business question.

    Time-sliced tables that should be queried together.

    hashtag
    Design rules

    Keep the underlying schemas aligned. Keep the logical name stable. Prefer this pattern only when the underlying split is operationally meaningful; do not use it to hide a modeling problem that should instead be solved with cleaner ingestion.

    For hybrid-style layouts, make the time boundary explicit so Pinot does not double count overlapping data.

    hashtag
    Example pattern

    hashtag
    Learn more

    The original logical-table walkthrough lives in .

    hashtag
    What this page covered

    This page covered when to use logical tables and how they hide physical table splits from readers.

    hashtag
    Next step

    Read if the schema needs to grow after the table is already in production.

    hashtag
    Related pages

    Transformations and Aggregations

    Use ingest-time transformations and aggregations when Pinot should normalize or reduce data before it reaches query time.

    Ingestion transformations clean up source records before they become Pinot rows. Ingestion aggregations reduce repeated values into fewer rows when a realtime table can safely store the summarized shape instead of the raw event stream.

    hashtag
    Transformations

    Use transformations to rename, reshape, extract, filter, or derive fields while ingesting. Keep the logic close to the table so the pipeline stays understandable.

    The detailed examples still live in Ingestion Transformations.

    hashtag
    Aggregations

    Use ingestion aggregation when the use case only needs summarized realtime data. This can reduce storage and improve query performance, but it changes the data you keep, so use it only when raw rows are not needed later.

    The detailed examples still live in .

    hashtag
    What this page covered

    This page covered when to transform or aggregate data during ingestion instead of waiting for query time.

    hashtag
    Next step

    Read the reference pages if you need the exact ingestion-config fields or table-config JSON.

    hashtag
    Related pages

    AWS

    Provision a managed Kubernetes cluster on Amazon EKS ready for Pinot.

    hashtag
    Outcome

    Create an Amazon EKS cluster with the required tooling, ready to deploy Apache Pinot.

    hashtag

    Offline Table Upsert

    Use upsert semantics on batch-ingested offline tables.

    Pinot supports upsert on OFFLINE tables in builds that include .

    Use it for batch corrections, replays, and late-arriving records.

    For a full overview of upsert features (comparison columns, delete columns, TTL, metadata management), see the main page. This page covers the OFFLINE-specific configuration and differences.

    hashtag

    File Systems

    This section contains a collection of short guides to show you how to import data from a Pinot-supported file system.

    FileSystem is an abstraction provided by Pinot to access data stored in distributed file systems (DFS).

    Pinot uses distributed file systems for the following purposes:

    • Batch ingestion job: To read the input data (CSV, Avro, Thrift, etc.) and to write generated segments to DFS.

    SQL syntax

    A narrative guide to Pinot SQL syntax and the main constructs you use most often.

    Pinot uses the Apache Calcite SQL parser with the MYSQL_ANSI dialect. This page is the practical overview: it explains the syntax patterns most people use every day and points to the deeper reference when you need the full operator list.

    hashtag
    Core rules

    Understanding Stages

    Learn more about multi-stage stages and how to extract stages from query plans.

    hashtag
    Deep dive into stages

    As explained in the reference documentation, the multi-stage query engine breaks down a query into multiple stages. Each stage corresponds to a subset of the query plan and is executed independently. Stages are connected in a tree-like structure where the output of one stage is the input to another stage. The stage that is at the root of the tree sends the final results to the client. The stages that are at the leaves of the tree read from the tables. The intermediate stages process the data and send it to the next stage.

    When the broker receives a query, it generates a query plan. This is a tree-like structure where each node is an operator. The plan is then optimized, moving and changing nodes to generate a plan that is semantically equivalent (it returns the same rows) but more efficient. During this phase the broker colors the nodes of the plan, assigning them to a stage. The broker also assigns a parallelism to each stage and defines which servers are going to execute each stage. For example, if a stage has a parallelism of 10, then at most 10 servers will execute that stage in parallel. One single server can execute multiple stages in parallel and it can even execute multiple instances of the same stage in parallel.

    Upsert and Dedup

    Use upsert or dedup when ingesting rows should collapse to one current record per key instead of preserving every event.

    Upsert and dedup are for tables that ingest repeated keys. Use them when the current value matters more than the raw event history, or when duplicate events should not fan out into duplicate query results.

    hashtag
    Choose the right behavior

    Use upsert when newer rows should replace older rows for the same primary key.

    Use dedup when repeated records should be filtered out and only the first or unique representation should remain.

    docker run \
        --network=pinot-demo \
        --name pinot-server \
        -d ${PINOT_IMAGE} StartServer \
        -zkAddress pinot-zookeeper:2181
    bin/pinot-admin.sh StartServer \
        -zkAddress localhost:2181
    Usage: StartServer
    	-serverHost               <String>                      : Host name for controller. (required=false)
    	-serverPort               <int>                         : Port number to start the server at. (required=false)
    	-serverAdminPort          <int>                         : Port number to serve the server admin API at. (required=false)
    	-dataDir                  <string>                      : Path to directory containing data. (required=false)
    	-segmentDir               <string>                      : Path to directory containing segments. (required=false)
    	-zkAddress                <http>                        : Http address of Zookeeper. (required=false)
    	-clusterName              <String>                      : Pinot cluster name. (required=false)
    	-configFileName           <Config File Name>            : Broker Starter Config file. (required=false)
    	-help                                                   : Print this message. (required=false)
    docker run \
        --network=pinot-demo \
        --name pinot-broker \
        -d ${PINOT_IMAGE} StartBroker \
        -zkAddress pinot-zookeeper:2181
    bin/pinot-admin.sh StartBroker \
      -zkAddress localhost:2181 \
      -clusterName PinotCluster \
      -brokerPort 7000
    Ingestion Aggregations
    Supported Data Formats
    File Systems
    Formats and Filesystems
    Transformations and Aggregations
    Formats and Filesystems
    Original Stream Docs
    Original Schema Evolution Tutorial
    Original Batch Docs
    Batch Ingestion in Practice
    Upsert and Dedup
  • Supported Data Formats

  • File Systems

  • File Systems
    Transformations and Aggregations
    Ingestion
    Batch Ingestion
    Stream Ingestion
    Reference
    Reference
    Data modeling
    Ingestion
    Querying & SQL
    Functions
    Connectors, clients & APIs
    Schema Evolution Tutorial
  • Schema Reference

  • Table Reference

  • Schema and Table Shape
    Logical Tables
    Schema Evolution
    Schema
    Table
    Logical Table
    Schema and Table Shape
    Schema and Table Shape
    Logical Tables
    Schema Evolution
    Formats and Filesystems
  • Original Ingestion Transformations

  • Original Ingestion Aggregations

  • Ingestion Aggregations
    Ingestion
    Batch Ingestion
    Stream Ingestion
    Azure setup
    Stream Ingestion from Kafka -- set up real-time ingestion for production workloads
    First batch ingest
    First stream ingest
    http://localhost:9000arrow-up-right
    Query Syntax
    Multi-Stage Query Engine
    Architecture

    A managed cloud cluster -- see Managed Kubernetes for AWS, GCP, and Azure setup guides

  • Helm 3arrow-up-right

  • kubectlarrow-up-right

  • GCP: pd-ssd or standard
  • Azure: AzureDisk

  • Docker Desktop: hostpath

  • Docker Desktop with Kubernetes enabledarrow-up-right
    Minikubearrow-up-right
    http://localhost:9000arrow-up-right
    Kubernetes stream ingestion guidearrow-up-right
    First table and schema

    Apache Pinot binary

    1.4.0

    Docker image

    apachepinot/pinot:1.4.0

    Maven / Gradle clients

    1.4.0

    Recommended JDK

    JDK 11 or JDK 21

    JDK 17

    Should work but is not officially supported

    Pinot 1.0+ minimum

    JDK 11 or higher required

    JDK 8 support

    Release notes
    Docker Hubarrow-up-right
    https://archive.apache.org/dist/pinot/arrow-up-right

    Pinot 0.12.1 is the last version that supports JDK 8

    How should Pinot handle missing values and time semantics?
    Schema Reference
    schema reference
    table reference
    Logical Tables
    Data Modeling
    Logical Tables
    Schema Evolution
    Mixed data types are supported as long as pairwise comparisons are valid
    JOOQ transformationsarrow-up-right
    Original Logical Table Doc
    Logical Table
    Schema Evolution
    Data Modeling
    Schema and Table Shape
    Schema Evolution
    Prerequisites
    • An AWS account

    • The following CLI tools installed (see steps below)

    hashtag
    Steps

    hashtag
    1. Install tooling

    kubectl

    Verify:

    Helm

    Verify:

    AWS CLI

    Follow the AWS CLI installation guidearrow-up-right or run:

    eksctl

    hashtag
    2. Configure AWS credentials

    circle-info

    Environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY override credentials stored in ~/.aws/credentials.

    hashtag
    3. Create an EKS cluster

    The following creates a single-node cluster named pinot-quickstart in us-west-2 using t3.xlarge instances:

    For Kubernetes 1.23+, enable the EBS CSI driver to allow persistent volume provisioning:

    Monitor cluster status:

    Wait until the cluster status is ACTIVE.

    hashtag
    4. Connect to the cluster

    hashtag
    Verify

    You should see your worker nodes listed and in Ready status.

    hashtag
    Cleaning up

    To delete the cluster when you are done:

    hashtag
    Next step

    Your cluster is ready. Continue to Kubernetes install to deploy Pinot.

    Use single quotes for string literals.
  • Use double quotes for identifiers when a column name is reserved or contains special characters.

  • SET statements apply query options before the query runs.

  • EXPLAIN PLAN FOR shows how Pinot will execute a query without returning data.

  • hashtag
    Common query shapes

    Pinot supports the usual SELECT, WHERE, GROUP BY, ORDER BY, and LIMIT patterns.

    Typical query shapes include:

    • filtering a table and returning a small result set

    • grouping and aggregating by one or more dimensions

    • using ORDER BY to rank rows before a LIMIT

    • using CASE WHEN and scalar functions in select lists

    hashtag
    Engine-aware syntax

    Some SQL features depend on the engine:

    • single-stage execution is best for simple analytic queries

    • multi-stage execution is required for joins, subqueries, and several advanced distributed patterns

    • EXPLAIN PLAN FOR is the best way to see how Pinot interprets a statement

    If you are working on a query and do not know whether a feature is supported, check the engine-specific guidance before you assume the syntax is invalid.

    hashtag
    Where the details live

    This page intentionally stays light. For the full statement-by-statement reference, use the detailed SQL syntax and operators reference. For query controls and diagnostics, use the pages under query-execution-controls/.

    hashtag
    What this page covered

    This page covered the main Pinot SQL rules, the most common statement patterns, and the difference between narrative guidance and the full SQL reference.

    hashtag
    Next step

    Read Querying Pinot for the broader query workflow, or jump to Query options if you want to control runtime behavior.

    hashtag
    Related pages

    • Querying Pinot

    • Query options

    • Explain plan

    Stages are identified by their stage ID, which is a unique identifier for each stage. In the current implementation the stage ID is a number and the root stage has a stage ID of 0, although this may change in the future.

    The current implementation has some properties that are worth mentioning:

    • The leaf stages execute a slightly modified version of the single-stage query engine. Therefore these stages cannot execute joins or aggregations, which are always executed in the intermediate stages.

    • Intermediate stages execute operations using a new query execution engine that has been created for the multi-stage query engine. This is why some of the functions that are supported in the single-stage query engine are not supported in the multi-stage query engine and vice versa.

    • An intermediate stage can only have one join, one window function or one set operation. If a query has more than one of these operations, the broker will create multiple stages, each with one of these operations.

    hashtag
    Extracting Stages from Query Plans

    As explained in Explain Plan (Multi-Stage), you can use the EXPLAIN PLAN syntax to obtain the logical plan of a query. This logical plan can be used to extract the stages of the query.

    For example, if the query is:

    A possible output of the EXPLAIN PLAN command is:

    As it happens with all queries, the logical plan forms a tree-like structure. In this default explain format, the tree-like structure is represented with indentation. The root of the tree is the first line, which is the last operator to be executed and marks the root stage. The boundary between stages are the PinotLogicalExchange operators. In the example above, there are four stages:

    • The root stage starts with the LogicalSort operator in the root of operators and ends with the PinotLogicalSortExchange operator. This is the last stage to be executed and the only one that is executed in the broker, which will directly send the result to the client once it is computed.

    • The next stage starts with this PinotLogicalSortExchange operator and includes the LogicalSort operator, the LogicalProject operator, the LogicalJoin operator and the two PinotLogicalExchange operators. This stage clearly is not a root stage and it is not reading data from the segments, so it is not a leaf stage. Therefore it has to be an intermediate stage.

    • The join has two children, which are the PinotLogicalExchange operators. In this specific case, both sides are very similar. They start with a PinotLogicalExchange operator and end with a LogicalTableScan operator. All stages that end with a LogicalTableScan operator are leaf stages.

    Now that we have identified the stages, we can understand what each stage is doing by interpreting multi-stage explain plans.

    Multi-stage query engine
    SELECT * FROM transcript LIMIT 10
    SELECT subject, AVG(score) AS avg_score
    FROM transcript
    GROUP BY subject
    ORDER BY avg_score DESC
    SELECT COUNT(*) FROM transcript
    SELECT firstName, lastName, score
    FROM transcript
    WHERE score > 3.5
    curl -X POST http://localhost:8099/query/sql \
      -H 'Content-Type: application/json' \
      -d '{"sql": "SELECT * FROM transcript LIMIT 5"}'
    git clone https://github.com/apache/pinot.git
    cd pinot/helm/pinot
    helm dependency update
    kubectl create ns pinot-quickstart
    helm install -n pinot-quickstart pinot ./pinot
    helm repo add pinot https://raw.githubusercontent.com/apache/pinot/master/helm
    kubectl create ns pinot-quickstart
    helm install pinot pinot/pinot \
        -n pinot-quickstart \
        --set cluster.name=pinot \
        --set server.replicaCount=2
    kubectl get all -n pinot-quickstart
    kubectl port-forward service/pinot-controller 9000:9000 -n pinot-quickstart
    kubectl delete ns pinot-quickstart
    export PINOT_VERSION=1.4.0
    
    # Then use ${PINOT_VERSION} in commands:
    docker pull apachepinot/pinot:${PINOT_VERSION}
    {
      "schemaName": "orders",
      "enableColumnBasedNullHandling": true,
      "dimensionFieldSpecs": [
        { "name": "orderId", "dataType": "STRING" },
        { "name": "customerId", "dataType": "STRING" },
        { "name": "region", "dataType": "STRING" }
      ],
      "metricFieldSpecs": [
        { "name": "amount", "dataType": "DOUBLE", "defaultNullValue": 0 }
      ],
      "dateTimeFieldSpecs": [
        {
          "name": "eventTime",
          "dataType": "LONG",
          "format": "EPOCH",
          "granularity": "1:DAYS"
        }
      ]
    }
    =, <>, <, <=, >, >=
    WHERE (a, b, c) = (x, y, z)
    WHERE a = x
      AND b = y
      AND c = z
    WHERE (a, b, c) > (x, y, z)
    WHERE a > x
       OR (a = x AND b > y)
       OR (a = x AND b = y AND c > z)
    WHERE (a, b, c) < (x, y, z)
    WHERE a < x
    OR (a = x AND b < y)
    OR (a = x AND b = y AND c < z)
    {
      "tableName": "orders",
      "brokerTenant": "DefaultTenant",
      "physicalTableConfigMap": {
        "ordersUS_OFFLINE": {},
        "ordersEU_OFFLINE": {}
      },
      "refOfflineTableName": "ordersUS_OFFLINE"
    }
    brew install kubernetes-cli
    kubectl version
    brew install kubernetes-helm
    helm version
    curl "https://d1vvhvl2y92vvt.cloudfront.net/awscli-exe-macos.zip" -o "awscliv2.zip"
    unzip awscliv2.zip
    sudo ./aws/install
    brew tap weaveworks/tap
    brew install weaveworks/tap/eksctl
    aws configure
    EKS_CLUSTER_NAME=pinot-quickstart
    eksctl create cluster \
        --name ${EKS_CLUSTER_NAME} \
        --version 1.16 \
        --region us-west-2 \
        --nodegroup-name standard-workers \
        --node-type t3.xlarge \
        --nodes 1 \
        --nodes-min 1 \
        --nodes-max 1
    eksctl utils associate-iam-oidc-provider --region=us-east-2 --cluster=pinot-quickstart --approve
    
    eksctl create iamserviceaccount \
      --name ebs-csi-controller-sa \
      --namespace kube-system \
      --cluster pinot-quickstart \
      --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
      --approve \
      --role-only \
      --role-name AmazonEKS_EBS_CSI_DriverRole
    
    eksctl create addon --name aws-ebs-csi-driver --cluster pinot-quickstart \
      --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force
    EKS_CLUSTER_NAME=pinot-quickstart
    aws eks describe-cluster --name ${EKS_CLUSTER_NAME} --region us-west-2
    EKS_CLUSTER_NAME=pinot-quickstart
    aws eks update-kubeconfig --name ${EKS_CLUSTER_NAME}
    kubectl get nodes
    EKS_CLUSTER_NAME=pinot-quickstart
    aws eks delete-cluster --name ${EKS_CLUSTER_NAME}
    SET useMultistageEngine = true;
    SELECT "date", city, COUNT(*)
    FROM orders
    WHERE status = 'shipped'
    GROUP BY "date", city
    ORDER BY "date" DESC
    LIMIT 20;
    explain plan for
    select customer.c_address, orders.o_shippriority
    from customer
    join orders
        on customer.c_custkey = orders.o_custkey
    limit 10
    LogicalSort(offset=[0], fetch=[10])
      PinotLogicalSortExchange(distribution=[hash], collation=[[]], isSortOnSender=[false], isSortOnReceiver=[false])
        LogicalSort(fetch=[10])
          LogicalProject(c_address=[$0], o_shippriority=[$3])
            LogicalJoin(condition=[=($1, $2)], joinType=[inner])
              PinotLogicalExchange(distribution=[hash[1]])
                LogicalProject(c_address=[$4], c_custkey=[$6])
                  LogicalTableScan(table=[[default, customer]])
              PinotLogicalExchange(distribution=[hash[0]])
                LogicalProject(o_custkey=[$5], o_shippriority=[$10])
                  LogicalTableScan(table=[[default, orders]])
    How offline upsert works

    Pinot keeps one row per primary key.

    For duplicate keys, Pinot keeps the row with the greatest comparison value.

    If you do not set comparisonColumns, Pinot uses the table time column.

    Offline upsert replaces full rows.

    It does not merge partial rows.

    hashtag
    Configure offline upsert

    1

    hashtag
    Define a primary key

    Add primaryKeyColumns to the schema.

    2

    hashtag
    Enable upsert on the offline table

    Set tableType to OFFLINE.

    Set upsertConfig.mode to FULL.

    3

    hashtag
    Ingest or replace segments

    Generate and upload offline segments as usual.

    Pinot applies upsert semantics when it loads those segments.

    Use append-style uploads for incremental corrections.

    hashtag
    When to use it

    Use offline upsert when updates arrive in files.

    Use it for daily corrections.

    Use it for backfills.

    Use it for replaying snapshots into offline segments.

    hashtag
    Differences from real-time upsert

    Offline upsert does not consume a stream.

    It does not require low-level consumers.

    It does not depend on stream partitioning.

    It fits batch ingestion and segment replacement workflows.

    For stream-based updates, use Stream ingestion with Upsert.

    hashtag
    Operational notes

    Changing the primary key needs a full rebuild.

    Changing comparison columns also needs a full rebuild.

    Reload alone is not enough for these changes.

    If you use a hybrid table, avoid overlapping offline and realtime time ranges.

    hashtag
    Related topics

    • Batch Ingestion

    • Backfill Data

    • Create and update a table configuration

    PR #17789arrow-up-right
    Upsert
    Controller: When a segment is uploaded to the controller, the controller saves it in the configured DFS.
  • Server:- When a server(s) is notified of a new segment, the server copies the segment from remote DFS to their local node using the DFS abstraction.

  • hashtag
    Supported file systems

    Pinot lets you choose a distributed file system provider. The following file systems are supported by Pinot:

    • Amazon S3

    • Google Cloud Storage

    • HDFS

    hashtag
    Enabling a file system

    To use a distributed file system, you need to enable plugins. To do that, specify the plugin directory and include the required plugins:

    You can change the file system in the controller and server configuration. In the following configuration example, the URI is s3://bucket/path/to/file and scheme refers to the file system URI prefix s3.

    You can also change the file system during ingestion. In the ingestion job spec, specify the file system with the following configuration:

    hashtag
    Operational notes

    These patterns need a careful schema, a stable primary key, and ingestion flow that understands the table-level metadata Pinot uses to keep the result consistent.

    The strongest detail still lives in the original docs under Upsert and Dedup.

    hashtag
    What this page covered

    This page covered the difference between upsert and dedup and when each is the better fit.

    hashtag
    Next step

    Read Formats and Filesystems to decide how Pinot should read source data and store generated segments.

    hashtag
    Related pages

    • Ingestion

    • Batch Ingestion

    • Stream Ingestion

    Decoupling Controller from the Data Path
    Use OSS as Deep Storage for Pinot
    Use S3 as Deep Storage for Pinot
    Batch job writing a segment into the deep store
    Server sends segment to Controller, which writes segments into the deep store
    Server writing a segment into the deep store

    Tenant

    Discover the tenant component of Apache Pinot, which facilitates efficient data isolation and resource management within Pinot clusters.

    Every table is associated with a tenant, or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., brokers and servers) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data in separate workloads from being stored or processed on the same physical hardware.

    By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster. If the cluster is planned to have multiple tenants, consider setting cluster.tenant.isolation.enable=false so that servers and brokers won't be tagged with DefaultTenant automatically while added into the cluster.

    To support multi-tenancy, Pinot has first-class support for tenants. Every table is associated with a server tenant and a broker tenant, which controls the nodes used by the table as servers and brokers. Multi-tenancy lets Pinot group all tables belonging to a particular use case under a single tenant name.

    The concept of tenants is very important when the multiple use cases are using Pinot and there is a need to provide quotas or some sort of isolation across tenants. For example, consider we have two tables Table A and Table B in the same Pinot cluster.

    We can configure Table A with server tenant Tenant A and Table B with server tenant Tenant B. We can tag some of the server nodes for Tenant A and some for Tenant B. This will ensure that segments of Table A only reside on servers tagged with Tenant A, and segment of Table B only reside on servers tagged with Tenant B. The same isolation can be achieved at the broker level, by configuring broker tenants to the tables.

    No need to create separate clusters for every table or use case!

    hashtag
    Tenant configuration

    This tenant is defined in the section of the table config.

    This section contains two main fields broker and server , which decide the tenants used for the broker and server components of this table.

    In the above example:

    • The table will be served by brokers that have been tagged as brokerTenantName_BROKER in Helix.

    • If this were an offline table, the offline segments for the table will be hosted in Pinot servers tagged in Helix as serverTenantName_OFFLINE

    hashtag
    Create a tenant

    hashtag
    Broker tenant

    Here's a sample broker tenant config. This will create a broker tenant sampleBrokerTenant by tagging three untagged broker nodes as sampleBrokerTenant_BROKER.

    To create this tenant use the following command. The creation will fail if number of untagged broker nodes is less than numberOfInstances.

    Follow instructions in to get Pinot locally, and then

    Check out the table config in the to make sure it was successfully uploaded.

    hashtag
    Server tenant

    Here's a sample server tenant config. This will create a server tenant sampleServerTenant by tagging 1 untagged server node as sampleServerTenant_OFFLINE and 1 untagged server node as sampleServerTenant_REALTIME.

    To create this tenant use the following command. The creation will fail if number of untagged server nodes is less than offlineInstances + realtimeInstances.

    Follow instructions in to get Pinot locally, and then

    Check out the table config in the to make sure it was successfully uploaded.

    First Table + Schema

    Create your first Pinot schema and table, ready for data ingestion.

    hashtag
    Outcome

    By the end of this page you will have a Pinot schema and an offline table called transcript registered in your cluster, ready to receive data.

    hashtag
    Prerequisites

    • A running Pinot cluster. See the install guides for or .

    • For Docker users: the cluster must be on the pinot-demo network.

    • Confirm your Pinot version. See the page and set the PINOT_VERSION environment variable:

    hashtag
    Steps

    hashtag
    1. Understand schemas

    A Pinot schema defines every column in your table and assigns each one a column type. There are three column types:

    Column type
    Description

    Every table must have a schema before it can accept data. The schema tells Pinot how to interpret, index, and store each field.

    hashtag
    2. Create the data directory

    hashtag
    3. Save the sample CSV data

    Create the file /tmp/pinot-quick-start/rawdata/transcript.csv with the following contents:

    In this dataset, studentID, firstName, lastName, gender, and subject are dimensions, score is a metric, and timestampInEpoch is the datetime column.

    hashtag
    4. Save the schema

    Create the file /tmp/pinot-quick-start/transcript-schema.json:

    hashtag
    5. Understand table configs

    A table config tells Pinot how to manage the table at runtime -- which columns to index, how many replicas to keep, which tenants to assign, and whether the table is OFFLINE (batch) or REALTIME (streaming). You pair one table config with one schema.

    hashtag
    6. Save the offline table config

    Create the file /tmp/pinot-quick-start/transcript-table-offline.json:

    hashtag
    7. Upload the schema and table config

    circle-info

    Replace pinot-controller with the actual container name of your Pinot controller if you used a different name during setup.

    hashtag
    Verify

    1. Open the Pinot Data Explorer at .

    2. Navigate to the Tables tab.

    3. Confirm you see transcript_OFFLINE listed.

    If the table appears, the schema and table config were registered successfully.

    hashtag
    Next step

    You now have an empty table. Continue to to import the CSV data into your transcript table.

    Cluster

    Learn to build and manage Apache Pinot clusters, uncovering key components for efficient data processing and optimized analysis.

    A Pinot cluster is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see Physical architecture.

    A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop:

    • Controller: Maintains cluster metadata and manages cluster resources.

    • Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.

    • Broker: Accepts queries from client processes and forwards them to servers for processing.

    • Server: Provides storage for segment files and compute for query processing.

    • (Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).

    The simplest possible Pinot cluster consists of four components: a server, a broker, a controller, and a Zookeeper node. In production environments, these components typically run on separate server instances, and scale out as needed for data volume, load, availability, and latency. Pinot clusters in production range from fewer than ten total instances to more than 1,000.

    Pinot uses as a distributed metadata store and for cluster management.

    Helix is a cluster management solution that maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. Helix constantly monitors the cluster to ensure that the right hardware resources are allocated for the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.

    hashtag
    Cluster configuration

    For details of cluster configuration settings, see .

    hashtag
    Cluster components

    Helix divides nodes into logical components based on their responsibilities:

    hashtag
    Participant

    Participants are the nodes that host distributed, partitioned resources

    Pinot servers are modeled as participants. For details about server nodes, see .

    hashtag
    Spectator

    Spectators are the nodes that observe the current state of each participant and use that information to access the resources. Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).

    Pinot brokers are modeled as spectators. For details about broker nodes, see .

    hashtag
    Controller

    The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability.

    Pinot controllers are modeled as controllers. For details about controller nodes, see .

    hashtag
    Logical view

    Another way to visualize the cluster is a logical view, where:

    • A cluster contains

    • Tenants contain

    • Tables contain

    hashtag
    Set up a Pinot cluster

    Typically, there is only one cluster per environment/data center. There is no need to create multiple Pinot clusters because Pinot supports .

    To set up a cluster, see one of the following guides:

    Controller

    Discover the controller component of Apache Pinot, enabling efficient data and query management.

    The Pinot controller schedules and reschedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, the Pinot controller schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (for example, ingest of real-time tables and offline tables). The Pinot controller can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

    The controller exposes a REST API endpoint for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.

    The Pinot controller is responsible for the following:

    • Maintaining global metadata (e.g., configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store.

    • Hosting the Helix Controller and managing other Pinot components (brokers, servers, minions)

    • Maintaining the mapping of which servers are responsible for which segments. This mapping is used by the servers to download the portion of the segments that they are responsible for. This mapping is also used by the broker to decide which servers to route the queries to.

    • Serving admin endpoints for viewing, creating, updating, and deleting configs, which are used to manage and operate the cluster.

    • Serving endpoints for segment uploads, which are used in offline data pushes. They are responsible for initializing real-time consumption and coordination of persisting real-time segments into the segment store periodically.

    • Undertaking other management activities such as managing retention of segments, validations.

    For redundancy, there can be multiple instances of Pinot controllers. Pinot expects that all controllers are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or .

    hashtag
    Running the periodic task manually

    The controller runs several periodic tasks in the background, to perform activities such as management and validation. Each periodic task has to define the run frequency and default frequency. Each task runs at its own schedule or can also be triggered manually if needed. The task runs on the lead controller for each table.

    For period task configuration details, see .

    Use the GET /periodictask/names API to fetch the names of all the periodic tasks running on your Pinot cluster.

    To manually run a named periodic task, use the GET /periodictask/run API:

    The Log Request Id (api-09630c07) can be used to search through pinot-controller log file to see log entries related to execution of the Periodic task that was manually run.

    If tableName (and its type OFFLINE or REALTIME) is not provided, the task will run against all tables.

    hashtag
    Starting a controller

    Make sure you've . If you're using Docker, make sure to . To start a controller:

    Kafka Connector Versions

    Choose the right Apache Kafka connector version for your Pinot deployment.

    Apache Pinot provides multiple Kafka connector versions to match different Kafka broker deployments. Choose the connector that matches your Kafka cluster version.

    hashtag
    Available Connectors

    Connector Plugin
    Kafka Client Version
    Notes
    circle-exclamation

    The pinot-kafka-2.0 (kafka20) plugin has been removed. If your table config references org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory, you must migrate to either kafka30 or kafka40.

    hashtag
    Kafka 4.0 Connector

    The Kafka 4.0 connector (pinot-kafka-4.0) supports Apache Kafka 4.x brokers running in KRaft mode (ZooKeeper-free). It uses pure Java Kafka clients with no Scala dependency, resulting in a smaller deployment footprint.

    hashtag
    When to use Kafka 4.0

    • Your Kafka cluster runs Kafka 4.0+ with KRaft mode

    • You want to eliminate the Scala transitive dependency

    • You are deploying new Pinot clusters against modern Kafka infrastructure

    hashtag
    Configuration

    The Kafka 4.0 connector uses the same configuration properties as the Kafka 3.0 connector. The only difference is the stream.kafka.consumer.factory.class.name:

    hashtag
    Migration from Kafka 2.0 or 3.0

    To migrate from an older Kafka connector to Kafka 3.0 or 4.0, update the consumer factory class name in your table configuration:

    From
    To
    1. Ensure the pinot-kafka-4.0 plugin JAR is available in your Pinot plugin directory.

    2. All other stream.kafka.* configuration properties remain the same.

    circle-info

    The Kafka 4.0 connector is fully compatible with all existing Kafka consumer configuration properties including SSL/TLS, SASL authentication, isolation levels, and Schema Registry integration. See the for detailed configuration examples.

    hashtag
    Kafka 3.0 Connector

    The Kafka 3.0 connector (pinot-kafka-3.0) supports Apache Kafka 3.x brokers. This is the most widely deployed connector version.

    hashtag
    Configuration

    hashtag
    Common Configuration Properties

    All Kafka connector versions share the same configuration properties. See for the complete configuration reference, including:

    • SSL/TLS setup

    • SASL authentication

    • Schema Registry integration (Avro, JSON Schema, Protobuf)

    • Consumer tuning properties

    hashtag
    Passing Native Kafka Consumer Properties

    You can pass any native Kafka consumer configuration property using the stream.kafka.consumer.prop. prefix:

    Configure Indexes

    Learn how to apply indexes to a Pinot table. This guide assumes that you have followed the guide.

    Pinot supports a series of different indexes that can be used to optimize query performance. In this guide, we'll learn how to add indexes to the events table that we set up in the guide.

    hashtag
    Why do we need indexes?

    If no indexes are applied to the columns in a Pinot segment, the query engine needs to scan through every document, checking whether that document meets the filter criteria provided in a query. This can be a slow process if there are a lot of documents to scan.

    Google Cloud Storage

    This guide shows you how to import data from GCP (Google Cloud Platform).

    Enable the using the pinot-gcs plugin. In the controller or server, add the config:

    circle-info

    By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro

    Azure Data Lake Storage

    This guide shows you how to import data from files stored in Azure Data Lake Storage Gen2 (ADLS Gen2)

    Enable the Azure Data Lake Storage using the pinot-adls plugin. In the controller or server, add the config:

    circle-info

    By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro

    {
      "schemaName": "orders",
      "primaryKeyColumns": ["order_id"]
    }
    -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-plugin-to-include-1,pinot-plugin-to-include-2
    #CONTROLLER
    
    pinot.controller.storage.factory.class.[scheme]=className of the pinot file system
    pinot.controller.segment.fetcher.protocols=file,http,[scheme]
    pinot.controller.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    #SERVER
    
    pinot.server.storage.factory.class.[scheme]=className of the Pinot file system
    pinot.server.segment.fetcher.protocols=file,http,[scheme]
    pinot.server.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinotFSSpecs
      - scheme: file
        className: org.apache.pinot.spi.filesystem.LocalPinotFS
    Azure Data Lake Storage
    Formats and Filesystems
    Original Upsert and Dedup Docs
    What is Apache Pinot? (and User-Facing Analytics) by Tim Berglund
    What is Apache Pinot? (and User-Facing Analytics) by Tim Berglund
    Use refresh-style uploads when replacing an existing batch.
    Stream ingestion with Upsert

    Dimension

    Used in filters and GROUP BY clauses for slicing and dicing data.

    Metric

    Used in aggregations; represents quantitative measurements.

    DateTime

    Represents the timestamp associated with each row.

    Local
    Docker
    Version reference
    http://localhost:9000arrow-up-right
    First batch ingest
    ADLSarrow-up-right
    its own configuration
    Controller configuration reference
    set up Zookeeper
    pull the Pinot Docker image
  • Isolation levels (read_committed / read_uncommitted)

  • pinot-kafka-3.0

    3.9.x

    Recommended for Kafka 3.x clusters. Requires Scala dependency.

    pinot-kafka-4.0

    4.1.x

    Recommended for Kafka 4.x clusters (KRaft mode). Pure Java — no Scala dependency.

    org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory

    org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory (Kafka 3.x) or org.apache.pinot.plugin.stream.kafka40.KafkaConsumerFactory (Kafka 4.x)

    org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory

    org.apache.pinot.plugin.stream.kafka40.KafkaConsumerFactory

    main Kafka ingestion guide
    Ingest streaming data from Apache Kafka

    When indexes are applied, the query engine can more quickly work out which documents satisfy the filter criteria, reducing the time it takes to execute the query.

    hashtag
    What indexes does Pinot support?

    By default, Pinot creates a forward index for every column. The forward index generally stores documents in insertion order.

    However, before flushing the segment, Pinot does a single pass over every column to see whether the data is sorted. If data is sorted, Pinot creates a sorted (forward) index for that column instead of the forward index.

    For real-time tables you can also explicitly tell Pinot that one of the columns should be sorted. For more details, see the [Sorted Index Documentation](../../../../build-with-pinot/indexing/forward-index.md#real-time-tables).

    For filtering documents within a segment, Pinot supports the following indexing techniques:

    • Inverted index: Used for exact lookups.

    • Range index - Used for range queries.

    • Text index - Used for phrase, term, boolean, prefix, or regex queries.

    • Geospatial index - Based on H3, a hexagon-based hierarchical gridding. Used for finding points that exist within a certain distance from another point.

    • JSON index - Used for querying columns in JSON documents.

    • Star-Tree index - Pre-aggregates results across multiple columns.

    hashtag
    View events table

    Let's see how we can apply these indexing techniques to our data. To recap, the events table has the following fields:

    Date Time Fields
    Dimensions Fields
    Metric Fields

    ts

    uuid

    count

    We might want to write queries that filter on the ts and uuid columns, so these are the columns on which we would want to configure indexes.

    Since the data we're ingesting into the Kafka topic is all implicitly ordered by timestamp, this means that the ts column already has a sorted index. This means that any queries that filter on this column are already optimised.

    So that leaves us with the uuid column.

    hashtag
    Add an inverted index

    We're going to add an inverted index to the uuid column so that queries that filter on that column will return quicker. We need to add the following line:

    To the tableIndexConfig section.

    Copy the following to the clipboard:

    /tmp/pinot/table-config-stream.json

    Navigate to localhost:9000/#/tenants/table/events_REALTIMEarrow-up-right, click on Edit Table, paste the next table config, and then click Save.

    Once you've done that, you'll need to click Reload All Segments and then Yes to apply the indexing change to all segments.

    hashtag
    Check the index has been applied

    We can check that the index has been applied to all our segments by querying Pinot's REST API. You can find Swagger documentation at localhost:9000/helparrow-up-right.

    The following query will return the indexes defined on the uuid column:

    Output

    We're using the jq command line JSON processorarrow-up-right to extract the fields that we're interested in.

    We can see from looking at the inverted-index property that the index has been applied.

    hashtag
    Querying

    You can now run some queries that filter on the uuid column, as shown below:

    You'll need to change the actual uuid value to a value that exists in your database, because the UUIDs are generated randomly by our script.

    Ingest data from Apache Kafka
    Ingest data from Apache Kafka
    ,
    pinot-kafka-3.0...

    GCP file systems provides the following options:

    • projectId - The name of the Google Cloud Platform project under which you have created your storage bucket.

    • gcpKey - Location of the json file containing GCP keys. You can refer Creating and managing service account keysarrow-up-right to download the keys.

    Each of these properties should be prefixed by pinot.[node].storage.factory.class.gs. where node is either controller or server depending on the configuration, like this:

    hashtag
    Examples

    hashtag
    Job spec

    hashtag
    Controller config

    hashtag
    Server config

    hashtag
    Minion config

    Google Cloud Storagearrow-up-right
    ,
    pinot-kafka-3.0...

    Azure Blob Storage provides the following options:

    • accountName: Name of the Azure account under which the storage is created.

    • accessKey: Access key required for the authentication.

    • fileSystemName: Name of the file system to use, for example, the container name (similar to the bucket name in S3).

    • enableChecksum: Enable MD5 checksum for verification. Default is false.

    Each of these properties should be prefixed by pinot.[node].storage.factory.class.adl2. where node is either controller or server depending on the config, like this:

    hashtag
    Examples

    hashtag
    Job spec

    hashtag
    Controller config

    hashtag
    Server config

    hashtag
    Minion config

    {
      "tableName": "orders_OFFLINE",
      "tableType": "OFFLINE",
      "segmentsConfig": {
        "timeColumnName": "event_time",
        "retentionTimeUnit": "DAYS",
        "retentionTimeValue": "30",
        "replication": "3"
      },
      "upsertConfig": {
        "mode": "FULL",
        "comparisonColumns": ["event_time"]
      }
    }
    export PINOT_VERSION=<your-pinot-version>
    mkdir -p /tmp/pinot-quick-start/rawdata
    /tmp/pinot-quick-start/rawdata/transcript.csv
    studentID,firstName,lastName,gender,subject,score,timestampInEpoch
    200,Lucy,Smith,Female,Maths,3.8,1570863600000
    200,Lucy,Smith,Female,English,3.5,1571036400000
    201,Bob,King,Male,Maths,3.2,1571900400000
    202,Nick,Young,Male,Physics,3.6,1572418800000
    /tmp/pinot-quick-start/transcript-schema.json
    {
      "schemaName": "transcript",
      "dimensionFieldSpecs": [
        { "name": "studentID", "dataType": "INT" },
        { "name": "firstName", "dataType": "STRING" },
        { "name": "lastName", "dataType": "STRING" },
        { "name": "gender", "dataType": "STRING" },
        { "name": "subject", "dataType": "STRING" }
      ],
      "metricFieldSpecs": [
        { "name": "score", "dataType": "FLOAT" }
      ],
      "dateTimeFieldSpecs": [{
        "name": "timestampInEpoch",
        "dataType": "LONG",
        "format": "1:MILLISECONDS:EPOCH",
        "granularity": "1:MILLISECONDS"
      }]
    }
    /tmp/pinot-quick-start/transcript-table-offline.json
    {
      "tableName": "transcript",
      "segmentsConfig": {
        "timeColumnName": "timestampInEpoch",
        "timeType": "MILLISECONDS",
        "replication": "1",
        "schemaName": "transcript"
      },
      "tableIndexConfig": {
        "invertedIndexColumns": [],
        "loadMode": "MMAP"
      },
      "tenants": {
        "broker": "DefaultTenant",
        "server": "DefaultTenant"
      },
      "tableType": "OFFLINE",
      "metadata": {}
    }
    bin/pinot-admin.sh AddTable \
      -tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
      -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
      -exec
    docker run --rm -ti \
        --network=pinot-demo \
        -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
        --name pinot-table-creation \
        apachepinot/pinot:${PINOT_VERSION} AddTable \
        -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
        -tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
        -controllerHost pinot-controller \
        -controllerPort 9000 \
        -exec
    curl -X GET "http://localhost:9000/periodictask/names" -H "accept: application/json"
    
    [
      "RetentionManager",
      "OfflineSegmentIntervalChecker",
      "RealtimeSegmentValidationManager",
      "BrokerResourceValidationManager",
      "SegmentStatusChecker",
      "SegmentRelocator",
      "StaleInstancesCleanupTask",
      "TaskMetricsEmitter"
    ]
    curl -X GET "http://localhost:9000/periodictask/run?taskname=SegmentStatusChecker&tableName=jsontypetable&type=OFFLINE" -H "accept: application/json"
    
    {
      "Log Request Id": "api-09630c07",
      "Controllers notified": true
    }
    docker run \
        --network=pinot-demo \
        --name pinot-controller \
        -p 9000:9000 \
        -d ${PINOT_IMAGE} StartController \
        -zkAddress pinot-zookeeper:2181
    bin/pinot-admin.sh StartController \
      -zkAddress localhost:2181 \
      -clusterName PinotCluster \
      -controllerPort 9000
    {
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "your-topic",
        "stream.kafka.broker.list": "kafka:9092",
        "stream.kafka.consumer.type": "lowlevel",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka40.KafkaConsumerFactory",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
        "realtime.segment.flush.threshold.rows": "0",
        "realtime.segment.flush.threshold.time": "24h",
        "realtime.segment.flush.threshold.segment.size": "100M"
      }
    }
    {
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "your-topic",
        "stream.kafka.broker.list": "kafka:9092",
        "stream.kafka.consumer.type": "lowlevel",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder"
      }
    }
    {
      "streamConfigs": {
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "stream.kafka.consumer.prop.max.poll.records": "500",
        "stream.kafka.consumer.prop.fetch.min.bytes": "100000",
        "stream.kafka.consumer.prop.session.timeout.ms": "30000"
      }
    }
    "invertedIndexColumns": ["uuid"]
    {
      "tableName": "events",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "ts",
        "schemaName": "events",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "invertedIndexColumns": ["uuid"],
        "loadMode": "MMAP",
        "streamConfigs": {
          "streamType": "kafka",
          "stream.kafka.topic.name": "events",
          "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
          "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
          "stream.kafka.broker.list": "kafka:9092",
          "realtime.segment.flush.threshold.rows": "0",
          "realtime.segment.flush.threshold.time": "24h",
          "realtime.segment.flush.threshold.segment.size": "50M",
          "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
        }
      },
      "metadata": {
        "customConfigs": {}
      }
    }
    curl -X GET "http://localhost:9000/segments/events/metadata?columns=uuid" \
      -H "accept: application/json" 2>/dev/null | 
      jq '.[] | [.segmentName, .indexes]'
    [
      "events__0__1__20220214T1106Z",
      {
        "uuid": {
          "bloom-filter": "NO",
          "dictionary": "YES",
          "forward-index": "YES",
          "inverted-index": "YES",
          "null-value-vector-reader": "NO",
          "range-index": "NO",
          "json-index": "NO"
        }
      }
    ]
    [
      "events__0__0__20220214T1053Z",
      {
        "uuid": {
          "bloom-filter": "NO",
          "dictionary": "YES",
          "forward-index": "YES",
          "inverted-index": "YES",
          "null-value-vector-reader": "NO",
          "range-index": "NO",
          "json-index": "NO"
        }
      }
    ]
    SELECT * 
    FROM events 
    WHERE uuid = 'f4a4f'
    LIMIT 10
    -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-gcs
    pinot.controller.storage.factory.class.gs.projectId=test-project
    executionFrameworkSpec:
        name: 'standalone'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 'gs://my-bucket/path/to/input/directory/'
    outputDirURI: 'gs://my-bucket/path/to/output/directory/'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: gs
          className: org.apache.pinot.plugin.filesystem.GcsPinotFS
          configs:
            projectId: 'my-project'
            gcpKey: 'path-to-gcp json key file'
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    controller.data.dir=gs://path/to/data/directory/
    controller.local.temp.dir=/path/to/local/temp/directory
    controller.enable.split.commit=true
    pinot.controller.storage.factory.class.gs=org.apache.pinot.plugin.filesystem.GcsPinotFS
    pinot.controller.storage.factory.gs.projectId=my-project
    pinot.controller.storage.factory.gs.gcpKey=path/to/gcp/key.json
    pinot.controller.segment.fetcher.protocols=file,http,gs
    pinot.controller.segment.fetcher.gs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.server.instance.enable.split.commit=true
    pinot.server.storage.factory.class.gs=org.apache.pinot.plugin.filesystem.GcsPinotFS
    pinot.server.storage.factory.gs.projectId=my-project
    pinot.server.storage.factory.gs.gcpKey=path/to/gcp/key.json
    pinot.server.segment.fetcher.protocols=file,http,gs
    pinot.server.segment.fetcher.gs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.minion.storage.factory.class.gs=org.apache.pinot.plugin.filesystem.GcsPinotFS
    pinot.minion.storage.factory.gs.projectId=my-project
    pinot.minion.storage.factory.gs.gcpKey=path/to/gcp/key.json
    pinot.minion.segment.fetcher.protocols=file,http,gs
    pinot.minion.segment.fetcher.gs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-adls
    pinot.controller.storage.factory.class.adl2.accountName=test-user
    executionFrameworkSpec:
        name: 'standalone'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 'adl2://path/to/input/directory/'
    outputDirURI: 'adl2://path/to/output/directory/'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: adl2
          className: org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
          configs:
            accountName: 'my-account'
            accessKey: 'foo-bar-1234'
            fileSystemName: 'fs-name'
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    controller.data.dir=adl2://path/to/data/directory/
    controller.local.temp.dir=/path/to/local/temp/directory
    controller.enable.split.commit=true
    pinot.controller.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
    pinot.controller.storage.factory.adl2.accountName=my-account
    pinot.controller.storage.factory.adl2.accessKey=foo-bar-1234
    pinot.controller.storage.factory.adl2.fileSystemName=fs-name
    pinot.controller.segment.fetcher.protocols=file,http,adl2
    pinot.controller.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.server.instance.enable.split.commit=true
    pinot.server.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
    pinot.server.storage.factory.adl2.accountName=my-account
    pinot.server.storage.factory.adl2.accessKey=foo-bar-1234
    pinot.server.storage.factory.adl2.fileSystemName=fs-name
    pinot.server.segment.fetcher.protocols=file,http,adl2
    pinot.server.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
    storage.factory.adl2.accountName=my-account
    storage.factory.adl2.fileSystemName=fs-name
    storage.factory.adl2.accessKey=foo-bar-1234
    segment.fetcher.protocols=file,http,adl2
    segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    If this were a real-time table, the real-time segments (both consuming as well as completed ones) will be hosted in pinot servers tagged in Helix as
    serverTenantName_REALTIME
    .
    tenants
    Getting Pinot
    Rest APIarrow-up-right
    Getting Pinot
    Rest APIarrow-up-right
    Defining tenants for tables
    Table isolation using tenants
    Apache Zookeeperarrow-up-right
    Apache Helixarrow-up-right
    Cluster configuration reference
    Server
    Broker
    Controller
    tenants
    tables
    segments
    tenants
    Running Pinot in Docker
    Running Pinot locally

    First Batch Ingest

    Import your first batch of data into Pinot and see it appear in the query console.

    hashtag
    Outcome

    By the end of this page you will have imported CSV data into your transcript offline table and confirmed the rows are queryable.

    hashtag
    Prerequisites

    • Completed -- the transcript_OFFLINE table must already exist.

    • The sample CSV file at /tmp/pinot-quick-start/rawdata/transcript.csv from the previous step.

    • For Docker users: set the PINOT_VERSION

    hashtag
    Steps

    hashtag
    1. Understand batch ingestion

    Batch ingestion reads data from files (CSV, JSON, Avro, Parquet, and others), converts them into Pinot segments, and pushes those segments to the cluster. A job specification YAML file tells Pinot where to find the input data, what format it is in, and where to send the finished segments.

    hashtag
    2. Create the ingestion job spec

    Create the file /tmp/pinot-quick-start/batch-job-spec.yml:

    When running inside Docker, the ingestion job container must reach the controller by its Docker network hostname, not localhost. Create the file /tmp/pinot-quick-start/batch-job-spec.yml:

    circle-info

    Replace pinot-controller

    hashtag
    3. Run the ingestion job

    The job reads the CSV file, builds a segment, and pushes it to the controller. You should see log output ending with a success message.

    hashtag
    Verify

    1. Open the in your browser.

    2. Run the following query:

    1. You should see 4 rows returned, matching the CSV data you loaded:

    studentID
    firstName
    lastName
    gender
    subject
    score
    timestampInEpoch

    hashtag
    Next step

    Continue to to learn how to set up real-time ingestion from Kafka.

    Schema

    Explore the Schema component in Apache Pinot, vital for defining the structure and data types of Pinot tables, enabling efficient data processing and analysis.

    Each table in Pinot is associated with a schema. A schema defines:

    • Fields in the table with their data types.

    • Whether the table uses column-based or table-based null handling. For more information, see Null value support.

    The schema is stored in Zookeeper along with the table configuration.

    circle-info

    Schema naming in Pinot follows typical database table naming conventions, such as starting names with a letter, not ending with an underscore, and using only alphanumeric characters

    hashtag
    Categories

    A schema also defines what category a column belongs to. Columns in a Pinot table can be categorized into three categories:

    Category
    Description

    Pinot does not enforce strict rules on which of these categories columns belong to, rather the categories can be thought of as hints to Pinot to do internal optimizations.

    For example, metrics may be stored without a dictionary and can have a different default null value.

    The categories are also relevant when doing segment merge and rollups. Pinot uses the dimension and time fields to identify records against which to apply merge/rollups.

    Metrics aggregation is another example where Pinot uses dimensions and time are used as the key, and automatically aggregates values for the metric columns.

    For configuration details, see .

    hashtag
    Date and time fields

    Since Pinot doesn't have a dedicated DATETIME datatype support, you need to input time in either STRING, LONG, or INT format. However, Pinot needs to convert the date into an understandable format such as epoch timestamp to do operations. You can refer to for more details on supported formats.

    hashtag
    Creating a schema

    First, Make sure your and running.

    Let's create a schema and put it in a JSON file. For this example, we have created a schema for flight data.

    circle-info

    For more details on constructing a schema file, see the .

    Then, we can upload the sample schema provided above using either a Bash command or REST API call.

    Check out the schema in the to make sure it was successfully uploaded

    Stream Ingestion with Dedup

    Deduplication support in Apache Pinot.

    Pinot provides native support for deduplication (dedup) during the real-time ingestion (v0.11.0+).

    hashtag
    Prerequisites for enabling dedup

    To enable dedup on a Pinot table, make the following table configuration and schema changes:

    hashtag
    Define the primary key in the schema

    To be able to dedup records, a primary key is needed to uniquely identify a given record. To define a primary key, add the field primaryKeyColumns to the schema definition.

    Note this field expects a list of columns, as the primary key can be composite.

    While ingesting a record, if its primary key is found to be already present, the record will be dropped.

    hashtag
    Partition the input stream by the primary key

    An important requirement for the Pinot dedup table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the API. If the original stream is not partitioned, then a streaming processing job (e.g. Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.

    hashtag
    Use strictReplicaGroup for routing

    The dedup Pinot table can use only the low-level consumer for the input streams. As a result, it uses the for the segments. Moreover, dedup poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires strictReplicaGroup as the routing strategy. To use that, configure instanceSelectorType in Routing as the following:

    circle-exclamation

    instance assignment is persisted. Note that numInstancesPerPartition should always be 1 in replicaGroupPartitionConfig.

    hashtag
    Other limitations

    • The incoming stream must be partitioned by the primary key such that, all records with a given primaryKey must be consumed by the same Pinot server instance.

    hashtag
    Enable dedup in the table configurations

    To enable dedup for a REALTIME table, add the following to the table config.

    Supported values for hashFunction are NONE, MD5 and MURMUR3, with the default being NONE.

    hashtag
    Metadata TTL

    Server stores the existing primary keys in dedup metadata map kept on JVM heap. As the dedup metadata grows, the heap memory pressure increases, which may affect the performance of ingestion and queries. One can set a positive metadata TTL to enable the TTL mechanism to keep the metadata size bounded. By default, the table's time colum is used as the dedup time column. The time unit of TTL is the same as the dedup time column. The TTL should be set long enough so that new records can be deduplicated before the primary keys gets removed. Time column must be NUMERIC data type when metadataTTl is enabled.

    hashtag
    Enable preload for faster server restarts

    When ingesting new records, the server has to read the metadata map to check for duplicates. But when server restarts, the documents in existing segments are all unique as ensured by the dedup logic during real-time ingestion. So we can do write-only to bootstrap the metadata map faster.

    The feature also requires you to specify pinot.server.instance.max.segment.preload.threads: N in the server config where N should be replaced with the number of threads that should be used for preload. It's 0 by default to disable the preloading feature. This preloading thread pool is shared with .

    hashtag
    Immutable dedup configuration fields

    triangle-exclamation

    Certain dedup and schema configuration fields cannot be modified after table creation.

    Changing these fields on an existing dedup table can lead to data inconsistencies or data loss between replicas. Pinot uses these configurations to determine which records to keep or discard, so altering them after data has been ingested will cause existing metadata to become inconsistent with the new configuration.

    The following fields are immutable after table creation:

    hashtag
    Best practices

    Unlike other real-time tables, Dedup table takes up more memory resources as it needs to bookkeep the primary key and its corresponding segment reference, in memory. As a result, it's important to plan the capacity beforehand, and monitor the resource usage. Here are some recommended practices of using Dedup table.

    • Create the Kafka topic with more partitions. The number of Kafka partitions determines the partition numbers of the Pinot table. The more partitions you have in the Kafka topic, more Pinot servers you can distribute the Pinot table to and therefore more you can scale the table horizontally. But note that, , you can't increase the partitions in future for dedup enabled tables so you need to start with good enough partitions (atleast 2-3X the number of pinot servers).

    • For Dedup tables, updating primary key columns or the dedupTimeColumn is not recommended, as it may lead to data loss and inconsistencies between replicas. If a change is unavoidable, ensure that consumption is paused and all servers are restarted for the change to take effect. Even then, consistency is not guaranteed.

    Confluent Schema Registry Decoders

    Decode Avro, JSON, and Protobuf messages from Kafka using Confluent Schema Registry.

    Pinot supports decoding Kafka messages serialized with Confluent Schema Registryarrow-up-right for Avro, JSON Schema, and Protocol Buffers formats. These decoders automatically fetch and cache schemas from the registry, ensuring data is deserialized according to the registered schema.

    hashtag
    Available Decoders

    Format
    Decoder Class
    Plugin

    hashtag
    Common Configuration

    All Confluent Schema Registry decoders share the same configuration properties:

    Property
    Required
    Default
    Description

    hashtag
    SSL/TLS Configuration

    To connect to a Schema Registry endpoint over SSL/TLS, add properties with the schema.registry. prefix:

    Property
    Description

    hashtag
    Confluent Avro Decoder

    Decodes Avro-serialized Kafka messages with schema managed by Confluent Schema Registry.

    hashtag
    Confluent JSON Schema Decoder

    Decodes JSON messages serialized with Confluent's JSON Schema serializer. Messages include a schema ID header that the decoder uses to fetch the JSON Schema from the registry for validation.

    circle-info

    The JSON Schema decoder validates incoming messages against the schema registered in Schema Registry. Messages that don't match the magic byte format (non-Confluent messages) are silently dropped.

    hashtag
    Confluent Protobuf Decoder

    Decodes Protocol Buffer messages serialized with Confluent's Protobuf serializer. The decoder fetches the .proto schema definition from the registry and deserializes the binary payload.

    hashtag
    SSL/TLS Example

    To connect to a secured Schema Registry:

    hashtag
    How Schema Resolution Works

    1. Each Confluent-serialized message starts with a magic byte (0x00) followed by a 4-byte schema ID

    2. The decoder extracts the schema ID from the message header

    3. The schema is fetched from Schema Registry and cached locally (up to cached.schema.map.capacity)

    Messages without the Confluent magic byte prefix are silently dropped and logged as errors.

    hashtag
    See Also

    • — General Kafka ingestion guide

    • — Full connector configuration reference

    • — All supported input formats

    Pinot Storage Model

    Apache Pinot™ uses a variety of terms which can refer to either abstractions that model the storage of data or infrastructure components that drive the functionality of the system, including:

    • Tables to store data

    • Segments to partition data

    • to isolate data

    • to manage data

    Pinot has a distributed systems architecture that scales horizontally. Pinot expects the size of a table to grow infinitely over time. To achieve this, all data needs to be distributed across multiple nodes. Pinot achieves this by breaking data into smaller chunks known as (similar to shards/partitions in HA relational databases). Segments can also be seen as time-based partitions.

    hashtag
    Table

    Similar to traditional databases, Pinot has the concept of a —a logical abstraction to refer to a collection of related data. As is the case with relational database management systems (RDBMS), a table is a construct that consists of columns and rows (documents) that are queried using SQL. A table is associated with a , which defines the columns in a table as well as their data types.

    As opposed to RDBMS schemas, multiple tables can be created in Pinot (real-time or batch) that inherit a single schema definition. Tables are independently configured for concerns such as indexing strategies, partitioning, tenants, data sources, and replication.

    Pinot stores data in . A Pinot table is conceptually identical to a relational database table with rows and columns. Columns have the same name and data type, known as the table's .

    Pinot schemas are defined in a JSON file. Because that schema definition is in its own file, multiple tables can share a single schema. Each table can have a unique name, indexing strategy, partitioning, data sources, and other metadata.

    Pinot table types include:

    • real-time: Ingests data from a streaming source like Apache Kafka®

    • offline: Loads data from a batch source

    • hybrid: Loads data from both a batch source and a streaming source

    hashtag
    Segment

    Pinot tables are stored in one or more independent shards called . A small table may be contained by a single segment, but Pinot lets tables grow to an unlimited number of segments. There are different processes for creating segments (see ). Segments have time-based partitions of table data, and are stored on Pinot that scale horizontally as needed for both storage and computation.

    hashtag
    Tenant

    To support multi-tenancy, Pinot has first class support for tenants. A table is associated with a . This allows all tables belonging to a particular logical namespace to be grouped under a single tenant name and isolated from other tenants. This isolation between tenants provides different namespaces for applications and teams to prevent sharing tables or schemas. Development teams building applications do not have to operate an independent deployment of Pinot. An organization can operate a single cluster and scale it out as new tenants increase the overall volume of queries. Developers can manage their own schemas and tables without being impacted by any other tenant on a cluster.

    Every table is associated with a , or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., and ) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data from separate workloads from being stored or processed on the same physical hardware.

    By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster.

    hashtag
    Cluster

    A Pinot is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see .

    hashtag
    Physical architecture

    **

    A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop.

    • Controller: Maintains cluster metadata and manages cluster resources.

    • Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.

    • Broker: Accepts queries from client processes and forwards them to servers for processing.

    The simplest possible Pinot cluster consists of four components: a server, a broker, a controller, and a Zookeeper node. In production environments, these components typically run on separate server instances, and scale out as needed for data volume, load, availability, and latency. Pinot clusters in production range from fewer than ten total instances to more than 1,000.

    Pinot uses as a distributed metadata store and and for cluster management.

    Helix is a cluster management solution created by the authors of Pinot. Helix maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. It constantly monitors the cluster to ensure that the right hardware resources are allocated to implement the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.

    hashtag
    Controller

    A is the core orchestrator that drives the consistency and routing in a Pinot cluster. Controllers are horizontally scaled as an independent component (container) and has visibility of the state of all other components in a cluster. The controller reacts and responds to state changes in the system and schedules the allocation of resources for tables, segments, or nodes. As mentioned earlier, Helix is embedded within the controller as an agent that is a participant responsible for observing and driving state changes that are subscribed to by other components.

    The Pinot schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of and ). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

    The controller exposes a for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.

    hashtag
    Server

    host segments (shards) that are scheduled and allocated across multiple nodes and routed on an assignment to a tenant (there is a single tenant by default). Servers are independent containers that scale horizontally and are notified by Helix through state changes driven by the controller. A server can either be a real-time server or an offline server.

    A real-time and offline server have very different resource usage requirements, where real-time servers are continually consuming new messages from external systems (such as Kafka topics) that are ingested and allocated on segments of a tenant. Because of this, resource isolation can be used to prioritize high-throughput real-time data streams that are ingested and then made available for query through a broker.

    hashtag
    Broker

    Pinot take query requests from client processes, scatter them to applicable servers, gather the results, and return them to the client. The controller shares cluster metadata with the brokers that allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.

    A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.

    hashtag
    Pinot minion

    Pinot minion is an optional component that can be used to run background tasks such as "purge" for GDPR (General Data Protection Regulation). As Pinot is an immutable aggregate store, records containing sensitive private data need to be purged on a request-by-request basis. Minion provides a solution for this purpose that complies with GDPR while optimizing Pinot segments and building additional indices that guarantees performance in the presence of the possibility of data deletion. One can also write a custom task that runs on a periodic basis. While it's possible to perform these tasks on the Pinot servers directly, having a separate process (Minion) lessens the overall degradation of query latency as segments are impacted by mutable writes.

    A Pinot is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minon tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.

    Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function with or without minions, they are typically present to support routine tasks like batch data ingest.

    \

    Multistage Lite Mode

    Introduces the Multistage Engine Lite Mode

    circle-exclamation

    MSE Lite Mode is included in Pinot 1.4 and is currently in Beta. This Beta label applies to Lite Mode specifically, not to the core multi-stage engine, which is generally available.

    **

    Multistage Engine (MSE) Lite Mode is an optional, guardrail-oriented execution mode for self-service and high-QPS tenants. Without additional bounds, queries can scan a large number of records or run expensive operations, which can impact the reliability of a shared tenant and create friction in onboarding new use-cases. Lite Mode addresses this by capping the rows returned from each leaf stage and applying tighter resource bounds automatically.

    It is based on the observation that most of the users need access to advanced SQL features like Window Functions, Subqueries, etc., and aren't interested in scanning a lot of data or running fully Distributed Joins.

    hashtag
    Overview

    MSE Lite Mode has the following key characteristics:

    • Users can still use all MSE query features like Window Functions, Subqueries, Joins, etc.

    • But, the maximum number of rows returned by a Leaf Stage will be set to a user configurable value. The default value is 100,000.

    • Query execution follows a scatter-gather paradigm, similar to the Single-stage Engine. This is different from regular MSE that uses shuffles across Pinot Servers.

    Leaf Stage in a Multistage Engine query usually refers to Table Scan, an optional Project, an optional Filter and an optional Aggregate Plan Node.

    At present, all joins in MSE Lite Mode are run in the Broker. This may change with the next release, since Colocated Joins can theoretically be run in the Servers.

    hashtag
    Example

    To illustrate how MSE Lite Mode applies automatic resource bounds, consider the query below based on the colocated_join Quickstart. If this query were allowed in production with the regular MSE, it would scan all the rows of the userFactEvents table. With Lite Mode, the full scan will be prevented because Lite Mode will automatically add a Sort to the leaf stage with a configurable limit (aka "fetch") value.

    The query plan for this query would be as follows. The window function, the filter in the filtered-events table, and the aggregation would be run in the Pinot Broker using a single thread. We assume that the Pinot Broker is configured with the lite mode limit value of 100k records:

    **

    hashtag
    Enabling Lite Mode

    To use Lite Mode, you can use the following query options.

    hashtag
    Running Non-Leaf Stages in Pinot Servers

    By default Lite Mode will run the non-leaf stage in the Broker. If you want to run the non-leaf stages in Pinot Servers, you can set the following query option to false. In this case, a random server will be picked for the non-leaf stage.

    hashtag
    Configuration

    You can set the following configs in your Pinot Broker.

    Configuration Key
    Default
    Description

    hashtag
    FAQ

    hashtag
    Q1: What is the Lite Mode intended for?

    Lite Mode was contributed by Uber and is inspired from . Lite Mode is an optional execution mode with tighter scan and resource bounds, designed for use-cases where users need advanced SQL features (window functions, subqueries, etc.) but do not need fully distributed execution of joins or CTEs. One can think of this as an advanced version of the Single-Stage Engine.

    hashtag
    Q2: Why use a single thread in the broker for the non-leaf stages?

    Using a single thread, or more importantly a single Operator Chain, means that the entire stage can be run without any Exchange. It also keeps the design simple and makes it easy to reason about performance and debugging.

    hashtag
    Q3: Can Lite Mode be used in tandem with server/segment pruning for high QPS use-cases?

    Yes, if you setup segmentPrunerTypes as in your Table Config, then segments and servers will be pruned. You can use this to scale out Read QPS.

    JOINs

    Pinot supports JOINs, including left, right, full, semi, anti, lateral, and equi JOINs. Use JOINs to connect two table to generate a unified view, based on a related column between the tables.

    This page explains the syntax used to write join. In order to get a more in deep knowledge of how joins work it is recommended to read Optimizing joins and also this blogarrow-up-right from Star Tree.

    circle-info

    Important: To query using JOINs, you must use Pinot's multi-stage engine (MSE).

    hashtag
    INNER JOIN

    The inner join selects rows that have matching values in both tables.

    hashtag
    Syntax

    hashtag
    Example of inner join

    Joins a table containing user transactions with a table containing promotions shown to the users, to show the spending for every userID.

    hashtag
    LEFT JOIN

    A left join returns all values from the left relation and the matched values from the right table, or appends NULL if there is no match. Also referred to as a left outer join.

    hashtag
    Syntax:

    hashtag
    RIGHT JOIN

    A right join returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also referred to as a right outer join.

    hashtag
    Syntax:

    hashtag
    FULL JOIN

    A full join returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join.

    hashtag
    Syntax:

    hashtag
    CROSS JOIN

    A cross join returns the Cartesian product of two relations. If no WHERE clause is used along with CROSS JOIN, this produces a result set that is the number of rows in the first table multiplied by the number of rows in the second table. If a WHERE clause is included with CROSS JOIN, it functions like an .

    hashtag
    Syntax:

    hashtag
    SEMI JOIN

    Semi-join returns rows from the first table where matches are found in the second table. Returns one copy of each row in the first table for which a match is found.

    hashtag
    Syntax:

    Some subqueries, like the following are also implemented as a semi-join under the hood:

    hashtag
    ANTI JOIN

    Anti-join returns rows from the first table where no matches are found in the second table. Returns one copy of each row in the first table for which no match is found.

    hashtag
    Syntax:

    Some subqueries, like the following are also implemented as an anti-join under the hood:

    hashtag
    Equi join

    An equi join uses an equality operator to match a single or multiple column values of the relative tables.

    hashtag
    Syntax:

    hashtag
    ASOF JOIN

    An ASOF JOIN selects rows from two tables based on a "closest match" algorithm.

    hashtag
    Syntax:

    The comparison operator in the MATCH_CONDITION can be one out of - <, >, <=, >=. Similar to an inner join, an ASOF join first calculate the set of matching rows in the right table for each row in the left table based on the ON condition. But instead of returning all of these rows, the only one returned is the closest match (if one exists) based on the match condition. Note that the two columns in the MATCH_CONDITION should be of the same type.

    The join condition in ON is mandatory and has to be a conjunction of equality comparisons (i.e., non-equi join conditions and clauses joined with OR aren't allowed). ON true can be used in case the join should only be performed using the MATCH_CONDITION.

    hashtag
    LEFT ASOF JOIN

    A LEFT ASOF JOIN is similar to the ASOF JOIN, with the only difference being that all rows from the left table are returned, even those without a match in the right table with the unmatched rows being padded with NULL values (similar to the difference between an INNER JOIN and a LEFT JOIN).

    hashtag
    Syntax:

    Segment Compaction on Upserts

    Use segment compaction on upsert-enabled real-time tables.

    hashtag
    Overview of segment compaction

    Compacting a segment replaces the completed segment with a compacted segment that only contains the latest version of records. For more information about how to use upserts on a real-time table in Pinot, see Stream Ingestion with Upsert.

    The Pinot upsert feature stores all versions of the record ingested into immutable segments on disk. Even though the previous versions are not queried, they continue to add to the storage overhead. To remove older records (no longer used in query results) and reclaim storage space, we need to compact Pinot segments periodically. Segment compaction is done via a new minion task. To schedule Pinot tasks periodically, see the Minion documentation.

    hashtag
    Compact segments on upserts in a real-time table

    To compact segments on upserts, complete the following steps:

    1. Ensure task scheduling is enabled and a minion is available.

    2. Add the following to your table configuration. These configurations (except schedule)determine which segments to compact.

    • bufferTimePeriod: To compact segments once they are complete, set to “0d”. To delay compaction (as the configuration above shows by 7 days ("7d")), specify the number of days to delay compaction after a segment completes.

    • invalidRecordsThresholdPercent (Optional) Limits the older records allowed in the completed segment as a percentage of the total number of records in the segment. In the example above, the completed segment may be selected for compaction when 30% of the records in the segment are old.

    circle-exclamation

    When using the two in-memory types, if the server gets restarted, the upsert view gets back consistent once server re-ingests the data it has ingested before starting. The in-memory bitmaps are updated when server ingests data into consuming segment, even before the consuming segment gets committed. So if server gets restarted whlie still consuming data, the upsert view gets back consistent once it catches up the previously ingested data. Instead, the bitmap snapshots are only taken after committing the segment, thus can be more consistent on server restarts, but is eventually consistent as well if server gets restarted while ingesting data.

    circle-info

    Because segment compaction is an expensive operation, we do not recommend setting invalidRecordsThresholdPercent and invalidRecordsThresholdCount too low (close to 1). By default, all configurations above are 0, so no thresholds are applied.

    hashtag
    Example

    The following example includes a dataset with 24M records and 240K unique keys that have each been duplicated 100 times. After ingesting the data, there are 6 segments (5 completed segments and 1 consuming segment) with a total estimated size of 22.8MB.

    Example dataset

    Submitting the query “set skipUpsert=true; select count(*) from transcript_upsert” before compaction produces 24,000,000 results:

    Results before segment compaction

    After the compaction tasks are complete, the reports the following.

    Minion compaction task completed

    Segment compactions generates a task for each segment to compact. Five tasks were generated in this case because 90% of the records (3.6–4.5M records) are considered ready for compaction in the completed segments, exceeding the configured thresholds.

    circle-info

    If a completed segment only contains old records, Pinot immediately deletes the segment (rather than creating a task to compact it).

    Submitting the query again shows the count matches the set of 240K unique keys.

    Results after segment compaction

    Once segment compaction has completed, the total number of segments remain the same and the total estimated size drops to 2.77MB.

    circle-info

    To further improve query latency, merge small segments into larger one.

    First Stream Ingest

    Set up real-time streaming ingestion from Kafka and watch data arrive in Pinot.

    circle-info

    For Kubernetes-specific streaming ingestion, see .

    hashtag
    Outcome

    By the end of this page you will have a realtime Pinot table consuming data from a Kafka topic, with 12 rows visible in the query console.

    Hadoop

    Batch ingestion of data into Apache Pinot using Apache Hadoop.

    hashtag
    Segment Creation and Push

    Pinot supports as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.

    You can follow the to build Pinot from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar

    Ingest from Amazon Kinesis

    This guide shows you how to ingest a stream of records from an Amazon Kinesis topic into a Pinot table.

    To ingest events from an Amazon Kinesis stream into Pinot, set the following configs into your table config:

    where the Kinesis specific properties are:

    Property
    Description
    "tenants": {
      "broker": "brokerTenantName",
      "server": "serverTenantName"
    }
    sample-broker-tenant.json
    {
         "tenantRole" : "BROKER",
         "tenantName" : "sampleBrokerTenant",
         "numberOfInstances" : 3
    }
    bin/pinot-admin.sh AddTenant \
        -name sampleBrokerTenant 
        -role BROKER 
        -instanceCount 3 -exec
    curl -i -X POST -H 'Content-Type: application/json' -d @sample-broker-tenant.json localhost:9000/tenants
    sample-server-tenant.json
    {
         "tenantRole" : "SERVER",
         "tenantName" : "sampleServerTenant",
         "offlineInstances" : 1,
         "realtimeInstances" : 1
    }
    bin/pinot-admin.sh AddTenant \
        -name sampleServerTenant \
        -role SERVER \
        -offlineInstanceCount 1 \
        -realtimeInstanceCount 1 -exec
    curl -i -X POST -H 'Content-Type: application/json' -d @sample-server-tenant.json localhost:9000/tenants

    Server: Provides storage for segment files and compute for query processing.

  • (Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).

  • Tenants
    Clusters
    segments
    table
    schema
    tables
    schema
    segments
    ingestion
    servers
    tenant
    tenant
    brokers
    servers
    cluster
    Physical architecture
    Apache Zookeeperarrow-up-right
    Apache Helixarrow-up-right
    controller
    controller
    real-time tables
    offline tables
    REST API endpoint
    Servers
    brokers
    minion
    environment variable. See the
    page.
    with the actual container name of your Pinot controller if you used a different name during setup.

    3.8

    1570863600000

    200

    Lucy

    Smith

    Female

    English

    3.5

    1571036400000

    201

    Bob

    King

    Male

    Maths

    3.2

    1571900400000

    202

    Nick

    Young

    Male

    Physics

    3.6

    1572418800000

    200

    Lucy

    Smith

    Female

    First table and schema
    Query Consolearrow-up-right
    First stream ingest
    /tmp/pinot-quick-start/batch-job-spec.yml
    executionFrameworkSpec:
      name: 'standalone'
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: '/tmp/pinot-quick-start/rawdata/'
    includeFileNamePattern: 'glob:**/*.csv'
    outputDirURI: '/tmp/pinot-quick-start/segments/'
    overwriteOutput: true
    pinotFSSpecs:
      - scheme: file
        className: org.apache.pinot.spi.filesystem.LocalPinotFS
    recordReaderSpec:
      dataFormat: 'csv'
      className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
      configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
      tableName: 'transcript'
      schemaURI: 'http://localhost:9000/tables/transcript/schema'
      tableConfigURI: 'http://localhost:9000/tables/transcript'
    pinotClusterSpecs:
      - controllerURI: 'http://localhost:9000'

    Maths

    Version reference

    Dimension

    Dimension columns are typically used in slice and dice operations for answering business queries. Some operations for which dimension columns are used: - GROUP BY - group by one or more dimension columns along with aggregations on one or more metric columns - Filter clauses such as WHERE

    Metric

    These columns represent the quantitative data of the table. Such columns are used for aggregation. In data warehouse terminology, these can also be referred to as fact or measure columns. Some operation for which metric columns are used: - Aggregation - SUM, MIN, MAX, COUNT, AVG etc - Filter clause such as WHERE

    DateTime

    This column represents time columns in the data. There can be multiple time columns in a table, but only one of them can be treated as primary. The primary time column is the one that is present in the segment config. The primary time column is used by Pinot to maintain the time boundary between offline and real-time data in a hybrid table and for retention management. A primary time column is mandatory if the table's push type is APPEND and optional if the push type is REFRESH . Common operations that can be done on time column: - GROUP BY - Filter clauses such as WHERE

    Schema configuration reference
    DateTime field spec configs
    cluster is up
    Schema configuration reference
    Rest APIarrow-up-right

    Maximum number of schemas to cache locally

    schema.registry.ssl.key.password

    Private key password

    The message payload is deserialized using the resolved schema

  • Fields are extracted into Pinot's GenericRow format for ingestion

  • Avro

    org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder

    pinot-confluent-avro

    JSON Schema

    org.apache.pinot.plugin.inputformat.json.confluent.KafkaConfluentSchemaRegistryJsonMessageDecoder

    pinot-confluent-json

    Protocol Buffers

    org.apache.pinot.plugin.inputformat.protobuf.KafkaConfluentSchemaRegistryProtoBufMessageDecoder

    pinot-confluent-protobuf

    schema.registry.rest.url

    Yes

    —

    Confluent Schema Registry REST endpoint URL

    cached.schema.map.capacity

    No

    schema.registry.ssl.truststore.location

    Path to truststore file

    schema.registry.ssl.truststore.password

    Truststore password

    schema.registry.ssl.keystore.location

    Path to keystore file

    schema.registry.ssl.keystore.password

    Ingest from Apache Kafka
    Stream Ingestion Connectors
    Supported Data Formats

    1000

    Keystore password

    Leaf stage(s) are run in the Servers, and all other operators are run using a single thread in the Broker.

    true

    Whether to run the non-leaf stages in the broker by default. This controls the default value of the query option runInBroker.

    pinot.broker.multistage.lite.mode.leaf.stage.limit

    100000

    The maximum number of records that a given Leaf Stage instance on a server is allowed to return. Recommended value is 100k records or lower.

    pinot.broker.multistage.use.lite.mode

    false

    Default value of the query option useLiteMode.

    their Presto over Pinot architecturearrow-up-right
    described here

    pinot.broker.multistage.run.in.broker

    INNER JOIN

    invalidRecordsThresholdCount (Optional) Limits the older records allowed in the completed segment by record count. In the example above, if the segment contains more than 100K records, it may be selected for compaction.

  • tableMaxNumTasks (Optional) Limits the number of tasks allowed to be scheduled.

  • validDocIdsType (Optional) Specifies the source of validDocIds to fetch when running the data compaction. The valid types are SNAPSHOT, IN_MEMORY, IN_MEMORY_WITH_DELETE

    • SNAPSHOT: Default validDocIds type. This indicates that the validDocIds bitmap is loaded from the snapshot from the Pinot segment. UpsertConfig's enableSnapshot must be enabled for this type.

    • IN_MEMORY: This indicates that the validDocIds bitmap is loaded from the real-time server's in-memory.

    • IN_MEMORY_WITH_DELETE: This indicates that the validDocIds bitmap is read from the real-time server's in-memory. The valid document ids here does take account into the deleted records. UpsertConfig's deleteRecordColumn must be provided for this type.

  • Minion Task Manager UI

    Kinesis stream name

    region

    Kinesis region e.g. us-west-1

    accessKey

    Kinesis access key

    secretKey

    Kinesis secret key

    shardIteratorType

    Only supports TRIM_HORIZON to consume from earliest. The support for LATEST, AT_SEQUENCE_NUMBER, AFTER_SEQUENCE_NUMBER is in progress but unsupported at this point

    maxRecordsToFetch

    Specifies the maximum number of records to retrieve in a single getRecords API call to Kinesis. This parameter controls the batch size for data retrieval. Can be set between 1 and 10,000 (Kinesis API limit by AWS) Larger values reduce the number of API calls needed but may increase latency and memory usage per batch. Default value is set to max 10000. Only lower this when you have memory constraints

    requests_per_second_limit

    Controls the maximum number of getRecords requests per second that the consumer will make to a Kinesis shard. This parameter is crucial for avoiding AWS Kinesis API throttling. Kinesis enforces a hard limit of 5 getRecords requests per second per shard. Exceeding this limit results in ProvisionedThroughputExceededException. The default value of 1 is intentionally conservative to prevent throttling in replicated setups where multiple consumers might be reading from the same shard simultaneously. You should only increase this if you are experiencing slow consumption rates and do not have ProvisionedThroughputExceededException in the logs yet

    Kinesis supports authentication using the DefaultCredentialsProviderChainarrow-up-right. The credential provider looks for the credentials in the following order:

    • Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)

    • Java System Properties - aws.accessKeyId and aws.secretKey

    • Web Identity Token credentials from the environment or container

    • Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI

    • Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable,

    • Instance profile credentials delivered through the Amazon EC2 metadata service

    circle-info

    You must provide all read access level permissions for Pinot to work with an AWS Kinesis data stream. See the AWS documentation for details.

    Although you can also specify the accessKey and secretKey in the properties above, we don't recommend this insecure method. We recommend using it only for non-production proof-of-concept (POC) setups. You can also specify other AWS fields such as AWS_SESSION_TOKEN as environment variables and config and it will work.

    hashtag
    Resharding

    In Kinesis, whenever you reshard a stream, it is done via split or merge operations on shards. If you split a shard, the shard closes and creates 2 new children shards. So if you started with shard0, and then split it, it would result in shard1 and shard2. Similarly, if you merge 2 shards, both those will close and create a child shard. So in the same example, if you merge shards 1 and 2, you'll end up with shard3 as the active shard, while shard0, shard1, shard2 will remain closed forever.

    Please check out this recipe for more details: https://dev.startree.ai/docs/pinot/recipes/github-events-stream-kinesis#resharding-kinesis-streamarrow-up-right

    In Pinot, resharding of any stream is detected by periodic task RealtimeValidationManager: docs. This runs hourly. If you rehsard, your new shards will not get detected unless:

    1. We finish ingesting from parent shards completely

    2. And after 1, the RealtimeValidationManager runs

    You will see a period where the ideal state will show all segments ONLINE, as parents have naturally completed ingesting, and we're waiting for RealtimeValidationManager to kickstart the ingestion from children.

    If you need the ingestion to happen sooner, you can manually invoke the RealtimeValidationManager: docs

    hashtag
    Limitations

    1. ShardID is of the format "shardId-000000000001". We use the numeric part as partitionId. Our partitionId variable is integer. If shardIds grow beyond Integer.MAX\_VALUE, we will overflow into the partitionId space.

    2. Segment size based thresholds for segment completion will not work. It assumes that partition "0" always exists. However, once the shard 0 is split/merged, we will no longer have partition 0.

    streamType

    This should be set to "kinesis"

    stream.kinesis.topic.name

    Schema fields:
    • primaryKeyColumns

    dedupConfig fields:

    • hashFunction

    • dedupTimeColumn

    • timeColumnName (when used as the default dedup time column)

    Attempting to update these fields will return an error:

    Recommended workaround: Create a new table with the desired configuration and reingest all data.

    Alternative (use with caution): If you must modify these fields without recreating the table, you can use the force=true query parameter on the table config update API. Before doing so, pause consumption and restart all servers. Note that this approach only guarantees consistency for newly ingested keys; existing data may remain inconsistent.

    Dedup table maintains an in-memory map from the primary key to the segment reference. So it's recommended to use a simple primary key type and avoid composite primary keys to save the memory cost. In addition, consider the hashFunction config in the Dedup config, which can be MD5 or MURMUR3, to store the 128-bit hashcode of the primary key instead. This is useful when your primary key takes more space. But keep in mind, this hash may introduce collisions, though the chance is very low.
  • Monitoring: Set up a dashboard over the metric pinot.server.dedupPrimaryKeysCount.tableName to watch the number of primary keys in a table partition. It's useful for tracking its growth which is proportional to the memory usage growth.

  • Capacity planning: It's useful to plan the capacity beforehand to ensure you will not run into resource constraints later. A simple way is to measure the amount of the primary keys in the Kafka throughput per partition and time the primary key space cost to approximate the memory usage. A heap dump is also useful to check the memory usage so far on an dedup table instance.

  • sendarrow-up-right
    partitioned replica-group assignment
    upsert table's preloading
    like upsert tables
    Next, you need to change the execution config in the job spec to the following -

    You can check out the sample job spec here.

    Finally execute the hadoop job using the command -

    Ensure environment variables PINOT_ROOT_DIR and PINOT_VERSION are set properly.

    hashtag
    Data Preprocessing before Segment Creation

    We’ve seen some requests that data should be massaged (like partitioning, sorting, resizing) before creating and pushing segments to Pinot.

    The MapReduce job called SegmentPreprocessingJob would be the best fit for this use case, regardless of whether the input data is of AVRO or ORC format.

    Check the below example to see how to use SegmentPreprocessingJob.

    In Hadoop properties, set the following to enable this job:

    In table config, specify the operations in preprocessing.operations that you'd like to enable in the MR job, and then specify the exact configs regarding those operations:

    hashtag
    preprocessing.num.reducers

    Minimum number of reducers. Optional. Fetched when partitioning gets disabled and resizing is enabled. This parameter is to avoid having too many small input files for Pinot, which leads to the case where Pinot server is holding too many small segments, causing too many threads.

    hashtag
    preprocessing.max.num.records.per.file

    Maximum number of records per reducer. Optional.Unlike, “preprocessing.num.reducers”, this parameter is to avoid having too few large input files for Pinot, which misses the advantage of muti-threading when querying. When not set, each reducer will finally generate one output file. When set (e.g. M), the original output file will be split into multiple files and each new output file contains at most M records. It does not matter whether partitioning is enabled or not.

    For more details on this MR job, refer to this documentarrow-up-right.

    Apache Hadooparrow-up-right
    local install guide
    /tmp/pinot-quick-start/batch-job-spec.yml
    executionFrameworkSpec:
      name: 'standalone'
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: '/tmp/pinot-quick-start/rawdata/'
    includeFileNamePattern: 'glob:**/*.csv'
    outputDirURI: '/tmp/pinot-quick-start/segments/'
    overwriteOutput: true
    pinotFSSpecs:
      - scheme: file
        className: org.apache.pinot.spi.filesystem.LocalPinotFS
    recordReaderSpec:
      dataFormat: 'csv'
      className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
      configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
      tableName: 'transcript'
      schemaURI: 'http://pinot-controller:9000/tables/transcript/schema'
      tableConfigURI: 'http://pinot-controller:9000/tables/transcript'
    pinotClusterSpecs:
      - controllerURI: 'http://pinot-controller:9000'
    bin/pinot-admin.sh LaunchDataIngestionJob \
        -jobSpecFile /tmp/pinot-quick-start/batch-job-spec.yml
    docker run --rm -ti \
        --network=pinot-demo \
        -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
        --name pinot-data-ingestion-job \
        apachepinot/pinot:${PINOT_VERSION} LaunchDataIngestionJob \
        -jobSpecFile /tmp/pinot-quick-start/batch-job-spec.yml
    SELECT * FROM transcript
    flights-schema.json
    {
      "schemaName": "flights",
      "enableColumnBasedNullHandling": true,
      "dimensionFieldSpecs": [
        {
          "name": "flightNumber",
          "dataType": "LONG",
          "notNull": true
        },
        {
          "name": "tags",
          "dataType": "STRING",
          "singleValueField": false,
          "defaultNullValue": "null"
        }
      ],
      "metricFieldSpecs": [
        {
          "name": "price",
          "dataType": "DOUBLE",
          "notNull": true,
          "defaultNullValue": 0
        }
      ],
      "dateTimeFieldSpecs": [
        {
          "name": "millisSinceEpoch",
          "dataType": "LONG",
          "format": "EPOCH",
          "granularity": "15:MINUTES"
        },
        {
          "name": "hoursSinceEpoch",
          "dataType": "INT",
          "notNull": true,
          "format": "EPOCH|HOURS",
          "granularity": "1:HOURS"
        },
        {
          "name": "dateString",
          "dataType": "STRING",
          "format": "SIMPLE_DATE_FORMAT|yyyy-MM-dd",
          "granularity": "1:DAYS"
        }
      ]
    }
    bin/pinot-admin.sh AddSchema -schemaFile flights-schema.json -exec
    
    OR
    
    bin/pinot-admin.sh AddTable -schemaFile flights-schema.json -tableFile flights-table.json -exec
    curl -F [email protected]  localhost:9000/schemas
    {
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "my-avro-topic",
        "stream.kafka.broker.list": "kafka:9092",
        "stream.kafka.consumer.type": "lowlevel",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
        "stream.kafka.decoder.prop.schema.registry.rest.url": "http://schema-registry:8081"
      }
    }
    {
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "my-json-topic",
        "stream.kafka.broker.list": "kafka:9092",
        "stream.kafka.consumer.type": "lowlevel",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.confluent.KafkaConfluentSchemaRegistryJsonMessageDecoder",
        "stream.kafka.decoder.prop.schema.registry.rest.url": "http://schema-registry:8081"
      }
    }
    {
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "my-protobuf-topic",
        "stream.kafka.broker.list": "kafka:9092",
        "stream.kafka.consumer.type": "lowlevel",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.protobuf.KafkaConfluentSchemaRegistryProtoBufMessageDecoder",
        "stream.kafka.decoder.prop.schema.registry.rest.url": "http://schema-registry:8081"
      }
    }
    {
      "streamConfigs": {
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
        "stream.kafka.decoder.prop.schema.registry.rest.url": "https://schema-registry:8082",
        "stream.kafka.decoder.prop.schema.registry.ssl.truststore.location": "/path/to/truststore.jks",
        "stream.kafka.decoder.prop.schema.registry.ssl.truststore.password": "changeit",
        "stream.kafka.decoder.prop.schema.registry.ssl.keystore.location": "/path/to/keystore.jks",
        "stream.kafka.decoder.prop.schema.registry.ssl.keystore.password": "changeit"
      }
    }
    SET useMultistageEngine = true;
    SET usePhysicalOptimizer = true;
    SET useLiteMode = true;
    
    EXPLAIN PLAN FOR WITH ordered_events AS (
      SELECT 
        cityName,
        tripAmount,
        ROW_NUMBER() OVER (
          ORDER BY ts DESC
        ) as row_num
      FROM userFactEvents
    ),
    filtered_events AS (
      SELECT 
        *
      FROM ordered_events
      WHERE row_num < 1000
    )
    SELECT 
      cityName,
      SUM(tripAmount) as cityTotal
    FROM filtered_events
    GROUP BY cityName
    PhysicalAggregate(group=[{0}], agg#0=[$SUM0($1)], aggType=[DIRECT])
      PhysicalFilter(condition=[<($3, 1000)])
        PhysicalWindow(window#0=[window(order by [2 DESC] rows between UNBOUNDED PRECEDING and CURRENT ROW aggs [ROW_NUMBER()])])
          PhysicalExchange(exchangeStrategy=[SINGLETON_EXCHANGE], collation=[[2 DESC]])
            PhysicalSort(fetch=[100000], collation=[[2 DESC]])  <== added by Lite Mode
              PhysicalProject(cityName=[$3], tripAmount=[$7], ts=[$9])
                PhysicalTableScan(table=[[default, userFactEvents]])
    SET useMultistageEngine=true;
    SET usePhysicalOptimizer=true;  -- enables the new Physical MSE Query Optimizer
    SET useLiteMode=true;           -- enables Lite Mode
    SET runInBroker=false;
    SELECT myTable.column1,myTable.column2,myOtherTable.column1,....
    FROM mytable INNER JOIN table2
    ON table1.matching_column = myOtherTable.matching_column;
    SELECT 
      p.userID, t.spending_val
    
    FROM promotion AS p JOIN transaction AS t 
      ON p.userID = t.userID
    
    WHERE
      p.promotion_val > 10
      AND t.transaction_type IN ('CASH', 'CREDIT')  
      AND t.transaction_epoch >= p.promotion_start_epoch
      AND t.transaction_epoch < p.promotion_end_epoch  
    SELECT myTable.column1,table1.column2,myOtherTable.column1,....
    FROM myTable LEFT JOIN myOtherTable
    ON myTable.matching_column = myOtherTable.matching_column;
    SELECT table1.column1,table1.column2,table2.column1,....
    FROM table1 
    RIGHT JOIN table2
    ON table1.matching_column = table2.matching_column;
    SELECT table1.column1,table1.column2,table2.column1,....
    FROM table1 
    FULL JOIN table2
    ON table1.matching_column = table2.matching_column;
    SELECT * 
    FROM table1 
    CROSS JOIN table2;
    SELECT myTable.column1, myOtherTable.column1
     FROM myOtherTable
     WHERE EXISTS [ join_criteria ]
    SELECT table1.strCol
     FROM  table1
     WHERE table1.intCol IN (select table2.anotherIntCol from table2 where ...)
    SELECT myTable.column1, myOtherTable.column1
     FROM myOtherTable
     WHERE NOT EXISTS [ join_criteria ]
    SELECT table1.strCol
     FROM  table1
     WHERE table1.intCol NOT IN (select table2.anotherIntCol from table2 where ...)
    SELECT *
    FROM table1 
    JOIN table2
    [ON (join_condition)]
    
    OR
    
    SELECT column_list 
    FROM table1, table2....
    WHERE table1.column_name =
    table2.column_name; 
    SELECT * FROM table1 ASOF JOIN table2 
    MATCH_CONDITION(table1.col1 <comparison_operator> table2.col1))
    ON table1.col2 = table2.col2;
    SELECT * FROM table1 LEFT ASOF JOIN table2 
    MATCH_CONDITION(table1.col1 <comparison_operator> table2.col1))
    ON table1.col2 = table2.col2;
    "task": {
      "taskTypeConfigsMap": {
        "UpsertCompactionTask": {
          "schedule": "0 */5 * ? * *",
          "bufferTimePeriod": "7d",
          "invalidRecordsThresholdPercent": "30",
          "invalidRecordsThresholdCount": "100000",
          "tableMaxNumTasks": "100",
          "validDocIdsType": "SNAPSHOT"
        }
      }
    }
    {
      "tableName": "kinesisTable",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "timestamp",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP",
        "streamConfigs": {
          "streamType": "kinesis",
          "stream.kinesis.topic.name": "<your kinesis stream name>",
          "region": "<your region>",
          "accessKey": "<your access key>",
          "secretKey": "<your secret key>",
          "shardIteratorType": "AFTER_SEQUENCE_NUMBER",
          "stream.kinesis.fetch.timeout.millis": "30000",
          "stream.kinesis.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
          "stream.kinesis.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kinesis.KinesisConsumerFactory",
          "realtime.segment.flush.threshold.rows": "1000000",
          "realtime.segment.flush.threshold.time": "6h"
        }
      },
      "metadata": {
        "customConfigs": {}
      }
    }
    schemaWithPK.json
    {
        "primaryKeyColumns": ["id"]
    }
    routing
    {
      "routing": {
        "instanceSelectorType": "strictReplicaGroup"
      }
    }
    tableConfigWithDedup.json
    { 
     ...
      "dedupConfig": { 
            "dedupEnabled": true, 
            "hashFunction": "NONE" 
       }, 
     ...
    }
    { 
     ...
      "dedupConfig": { 
            "dedupEnabled": true, 
            "hashFunction": "NONE",
            "dedupTimeColumn": "mtime",
            "metadataTTL": 30000
       }, 
     ...
    }
    { 
     ...
      "dedupConfig": { 
            "dedupEnabled": true, 
            "hashFunction": "NONE",
            "dedupTimeColumn": "mtime",
            "metadataTTL": 30000,
            "enablePreload": true
       }, 
     ...
    }
    Failed to update table '<tableName>': Cannot modify [<field>] as it may lead to data inconsistencies. Please create a new table instead.
    # executionFrameworkSpec: Defines ingestion jobs to be running.
    executionFrameworkSpec:
    
        # name: execution framework name
      name: 'hadoop'
    
      # segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentGenerationJobRunner'
    
      # segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentTarPushJobRunner'
    
      # segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentUriPushJobRunner'
    
      # segmentMetadataPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
      segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentMetadataPushJobRunner'
    
        # extraConfigs: extra configs for execution framework.
      extraConfigs:
    
        # stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
        stagingDir: your/local/dir/staging
    export PINOT_VERSION=1.4.0 #set to the Pinot version you have installed
    export PINOT_DISTRIBUTION_DIR=${PINOT_ROOT_DIR}/build/
    export HADOOP_CLIENT_OPTS="-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml"
    
    hadoop jar  \\
            ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \\
            org.apache.pinot.tools.admin.PinotAdministrator \\
            LaunchDataIngestionJob \\
            -jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/hadoopIngestionJobSpec.yaml
    enable.preprocessing = true
    preprocess.path.to.output = <output_path>
    {
        "OFFLINE": {
            "metadata": {
                "customConfigs": {
                    “preprocessing.operations”: “resize, partition, sort”, // To enable the following preprocessing operations
                    "preprocessing.max.num.records.per.file": "100",       // To enable resizing
                    "preprocessing.num.reducers": "3"                      // To enable resizing
                }
            },
            ...
            "tableIndexConfig": {
                "aggregateMetrics": false,
                "autoGeneratedInvertedIndex": false,
                "bloomFilterColumns": [],
                "createInvertedIndexDuringSegmentGeneration": false,
                "invertedIndexColumns": [],
                "loadMode": "MMAP",
                "nullHandlingEnabled": false,
                "segmentPartitionConfig": {       // To enable partitioning
                    "columnPartitionMap": {
                        "item": {
                            "functionName": "murmur",
                            "numPartitions": 4
                        }
                    }
                },
                "sortedColumn": [                // To enable sorting
                    "actorId"
                ],
                "streamConfigs": {}
            },
            "tableName": "tableName_OFFLINE",
            "tableType": "OFFLINE",
            "tenants": {
                ...
            }
        }
    }

    hashtag
    Prerequisites

    • Completed First table and schema -- the transcript schema must already exist in the cluster.

    • A running Pinot cluster. See the install guides for Local or Docker.

    • For Docker users: set the PINOT_VERSION environment variable. See the Version reference page.

    hashtag
    Steps

    hashtag
    1. Understand streaming ingestion

    Streaming ingestion lets Pinot consume data from a message queue in real time. As messages arrive in a Kafka topic, Pinot reads them and makes the rows queryable within seconds. The realtime table config specifies the Kafka broker, topic, and decoder so that Pinot knows how to connect and interpret incoming records.

    hashtag
    2. Start Kafka

    Start Kafka on port 9876 using the same ZooKeeper from the Pinot quick-start:

    bin/pinot-admin.sh StartKafka -zkAddress=localhost:2123/kafka -port 9876

    Kafka 4.0 runs in KRaft mode and does not require ZooKeeper:

    docker run \
        --network pinot-demo --name=kafka
    

    hashtag
    3. Create a Kafka topic

    Download Apache Kafkaarrow-up-right if you have not already, then create the topic:

    hashtag
    4. Save the realtime table config

    Create the file /tmp/pinot-quick-start/transcript-table-realtime.json:

    circle-info

    The Docker version uses kafka:9092 as the broker address because both the Kafka and Pinot containers are on the same pinot-demo Docker network.

    hashtag
    5. Upload the realtime table config

    As soon as the realtime table is created, Pinot begins consuming from the Kafka topic.

    circle-info

    If the transcript schema was already uploaded during First table and schema, you can omit the -schemaFile flag. Including it is safe -- Pinot will skip re-creating an identical schema.

    circle-info

    Replace pinot-controller with the actual container name of your Pinot controller if you used a different name during setup.

    hashtag
    6. Save the sample streaming data

    Create the file /tmp/pinot-quick-start/rawdata/transcript.json:

    hashtag
    7. Push data into the Kafka topic

    hashtag
    Verify

    1. Open the Query Consolearrow-up-right in your browser.

    2. Run the following query:

    1. You should see 12 rows of streaming data. Pinot ingests from Kafka in real time, so the rows appear within seconds of being pushed to the topic.

    hashtag
    Next step

    Continue to First query to learn how to write analytical queries against your Pinot tables.

    Stream ingestion (Kubernetes)

    Docker

    Start a Pinot cluster using Docker containers.

    hashtag
    Outcome

    Start a multi-component Pinot cluster using Docker, suitable for local evaluation and CI environments.

    hashtag
    Prerequisites

    • installed and running

    • Recommended Docker resource settings:

      • CPUs: 8

    hashtag
    Steps

    hashtag
    1. Set the image versions

    See the page for the current stable release.

    hashtag
    2. Pull the Pinot image

    View all available tags on .

    hashtag
    3. Start the cluster

    Create a file called docker-compose.yml with the following content:

    Launch the cluster:

    To also start Kafka for real-time streaming:

    Create a network

    Start ZooKeeper

    Start Pinot Controller

    hashtag
    Verify

    Check that all containers are running:

    You should see containers for ZooKeeper, Controller, Broker, Server, and Minion all in a healthy state. Open the Pinot Query Console at to confirm the cluster is ready.

    hashtag
    Next step

    Your cluster is running. Continue to to load data.

    Spark

    Batch ingestion of data into Apache Pinot using Apache Spark.

    Pinot supports Apache Spark 3.x as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.

    To set up Spark, do one of the following:

    • Use the Spark-Pinot Connector. For more information, see the ReadMearrow-up-right.

    • Follow the instructions below.

    You can follow the to build Pinot from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar

    If you do build Pinot from Source, you should consider opting into using the build-shaded-jar jar profile with -Pbuild-shaded-jar. While Pinot does not bundle spark into its jar, it does bundle certain hadoop libraries.

    Next, you need to change the execution config in the to the following:

    To run Spark ingestion, you need the following jars in your classpath

    • pinot-batch-ingestion-spark plugin jar - available in plugins-external directory in the package

    • pinot-all jar - available in lib directory in the package

    These jars can be specified using spark.driver.extraClassPath or any other option.

    For loading any other plugins that you want to use, use:

    The complete spark-submit command should look like this:

    Ensure environment variables PINOT_ROOT_DIR and PINOT_VERSION are set properly.

    Note: You should change the master to yarn and deploy-mode to cluster for production environments.

    circle-info

    The spark-core dependency is not included in Pinot jars since the 0.10.0 release. If you run into runtime issues, make sure your Spark environment provides the dependency, or with the matching Spark profile.

    hashtag
    Running in Cluster Mode on YARN

    If you want to run the spark job in cluster mode on YARN/EMR cluster, the following needs to be done -

    • Build Pinot from source with option -DuseProvidedHadoop

    • Copy Pinot binaries to S3, HDFS or any other distributed storage that is accessible from all nodes.

    • Copy Ingestion spec YAML file to S3, HDFS or any other distributed storage. Mention this path as part of --files

    Example

    hashtag
    FAQ

    Q - I am getting the following exception - Class has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0

    Since 0.8.0 release, Pinot binaries are compiled with JDK 11. If you are using Spark along with Hadoop 2.7+, you need to use the Java8 version of Pinot. Currently, you need to .

    Q - I am not able to find pinot-batch-ingestion-spark jar.

    Since Pinot 0.10.0, the spark plugin is located in the pinot-external directory of the binary distribution (in older versions it was in plugin).

    Q - Spark is not able to find the jars leading to java.nio.file.NoSuchFileException

    This means the classpath for spark job has not been configured properly. If you are running spark in a distributed environment such as Yarn or k8s, make sure both spark.driver.classpath and spark.executor.classpath are set. Also, the jars in driver.classpath should be added to --jars argument in spark-submit so that spark can distribute those jars to all the nodes in your cluster. You also need to take provide appropriate scheme with the file path when running the jar. In this doc, we have used local:\\ but it can be different depending on your cluster setup.

    Q - Spark job failing while pushing the segments.

    It can be because of misconfigured controllerURI in job spec yaml file. If the controllerURI is correct, make sure it is accessible from all the nodes of your YARN or k8s cluster.

    Q - My data gets overwritten during ingestion.

    Set to APPEND in the tableConfig.

    If already set to APPEND, this is likely due to a missing timeColumnName in your table config. If you can't provide a time column, use our in ingestion spec. Generally using inputFile segment name generator should fix your issue.

    Q - I am getting java.lang.RuntimeException: java.io.IOException: Failed to create directory: pinot-plugins-dir-0/plugins/*

    Removing -Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins from spark.driver.extraJavaOptions should fix this. As long as plugins are mentioned in classpath and jars argument it should not be an issue.

    Q - Getting Class not found: exception

    Check if extraClassPath arguments contain all the plugin jars for both driver and executors. Also, all the plugin jars are mentioned in the --jars argument. If both of these are correct, check if the extraClassPath contains local filesystem classpaths and not s3 or hdfs or any other distributed file system classpaths.

    Filtering with IdSet

    Learn how to write fast queries for looking up IDs in a list of values.

    circle-info

    Filtering with IdSet is only supported with the single-stage query engine (v1).

    A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but using IN doesn't perform well with large lists of IDs. For large lists of IDs, we recommend using an IdSet.

    hashtag
    Functions

    hashtag
    ID_SET

    ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )

    This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:

    • INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.

    • LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.

    • Other types - Bloom Filter

    The following parameters are used to configure the Bloom Filter:

    • expectedInsertions - Number of expected insertions for the BloomFilter, must be positive

    • fpp - False positive probability to use for the BloomFilter. Must be positive and less than 1.0.

    Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.

    hashtag
    IN_ID_SET

    IN_ID_SET(columnName, base64EncodedIdSet)

    This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.

    hashtag
    IN_SUBQUERY

    IN_SUBQUERY(columnName, subQuery)

    This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.

    hashtag
    IN__PARTITIONED__SUBQUERY

    IN_PARTITIONED_SUBQUERY(columnName, subQuery)

    This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.

    This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.

    circle-info

    The query passed to IN_SUBQUERY can be run on any table - they aren't restricted to the table used in the parent query.

    The query passed to IN__PARTITIONED__SUBQUERY must be run on the same table as the parent query.

    hashtag
    Examples

    hashtag
    Create IdSet

    You can create an IdSet of the values in the yearID column by running the following:

    idset(yearID)

    When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:

    idset(playerName)
    idset(playerName)

    We can also configure the fpp parameter:

    idset(playerName)

    hashtag
    Filter by values in IdSet

    We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for _yearID_s in the IdSet, run the following:

    hashtag
    Filter by values not in IdSet

    To return rows for _yearID_s not in the IdSet, run the following:

    hashtag
    Filter on broker

    To filter rows for _yearID_s in the IdSet on a Pinot Broker, run the following query:

    To filter rows for _yearID_s not in the IdSet on a Pinot Broker, run the following query:

    hashtag
    Filter on server

    To filter rows for _yearID_s in the IdSet on a Pinot Server, run the following query:

    To filter rows for _yearID_s not in the IdSet on a Pinot Server, run the following query:

    hashtag

    Physical Optimizer

    Describes the new Multistage Engine Physical Query Optimizer

    circle-exclamation

    The Physical Optimizer is an optional query optimizer for the multi-stage engine, included in Pinot 1.4 and currently in Beta. This Beta label applies to the Physical Optimizer specifically, not to the core multi-stage engine, which is generally available.

    We have added a new query optimizer in the Multistage Engine that computes and tracks precise Data Distribution across the entire plan before running some critical optimizations like Sort Pushdown, Aggregate Split/Pushdown, etc.

    One of the biggest features of this Optimizer is that it can eliminate Shuffles or simplify Exchanges, when applicable, for arbitrarily complex queries, without requiring any Query Hints.

    To enable this Optimizer for your MSE query, you can use the following Query Options:

    hashtag
    Key Features

    The examples below are based on the COLOCATED_JOIN Quickstart.

    hashtag
    Automatic Colocated Joins and Shuffle Simplification

    Consider the query below which consists of 3 Joins. With the new query optimizer, the entire query can run without any cross-server data exchange, since the data is partitioned by userUUID into a compatible number of partitions (see the "Setting Up Table Data Distribution" section below).

    The query plan for this query is shown below. You can see that the entire query leverages IDENTITY_EXCHANGE, which is a 1:1 Exchange as defined in Exchange Types below.

    hashtag
    Shuffle Simplification with Different Servers / Partition Count

    The new optimizer can simplify shuffles even if:

    • The Servers used by either side of a Join are different

    • The Partition Count for the join inputs are different

    In the example below, we have a Join performed across two tables: orange (left) and green (right).

    The orange table has 4 partitions and the green table has 2 partitions. The servers selected for the Orange and Green tables are [S0, S1] and [S0, S2] respectively. The Join is performed in the servers [S0, S1], because Physical Optimizer by default uses the same Workers as the leftmost input operator.

    If the hash-function used for partitioning the two tables is the same, we can leverage an Identity Exchange and skip re-partitioning the data on either side of the join. This is because S0 will consist of records from partitions and of the Orange table, which together contain all records that would make up partition modulo 2. i.e.

    Note that Identity Exchange does not imply that the servers in the sender and receiver will be the same. It only implies that there will be a 1:1 mapping from senders to receivers. In the example below, the data transfer from S2 to S1 will be over the network.

    **

    hashtag
    Automatically Skip Aggregate Exchange

    To evaluate something like GROUP BY userUUID accurately you would need to distribute records based on the userUUID column. The old query optimizer would add a Partitioning Exchange under each Aggregate, unless one used the query hint is_partitioned_by_group_by_keys.

    The Physical Optimizer can detect when data is already partitioned by the required column, and will automatically skip adding an Exchange. This has two advantages:

    • We avoid unnecessary Data Exchanges

    • We avoid splitting the Aggregate, since by default when an Aggregate exists on top of an Exchange, a copy of the Aggregate is added under the Exchange (unless is_skip_leaf_stage_group_by query hint is set)

    This optimization can be seen in action in the query example shared above. Since data is already partitioned by userUUID, all aggregations are run in DIRECT mode, i.e. without splitting the aggregate into multiple aggregates.

    hashtag
    Segment / Server Pruning

    Similar to the Single Stage Engine, if you have enabled segmentPrunerTypes in your table's Routing config, the Physical Optimizer will prune segments and servers using time, partition or other pruner types for the Leaf Stage. e.g. the following query will only select segments which satisfy the following constraint:

    If partitioning is done in a way that segments corresponding to a given partition are present on only 1 server, then the entire query above will run within a single server, simulating shard-local execution from other systems.

    hashtag
    Solve Constant Queries in Pinot Broker

    Apache Calcite is capable of detecting Filter Expressions that will always evaluate to False. In such cases, the query plan may not have any Table Scans at all. Physical Optimizer solves such queries within the Broker itself, without involving any servers.

    hashtag
    Worker Assignment

    At present, Worker Assignment follows these simple rules:

    • Leaf Stage will have workers assigned based on Table Scan and Filters, using the Routing configs set in the Table Config.

    • Other Stages will use the same workers as the left-most input stage.

    • Some Plan Nodes, such as Sort(fetch=..), may require data to be collected in a single Worker. In such a case, that stage will be run on a single Worker, which will be randomly selected from one of the input workers.

    hashtag
    Limitations

    Some of the features of the existing MSE query optimizer are not yet available in the Physical Optimizer. We aim to add support for most these in Pinot 1.5:

    • Spools.

    • Dynamic filters for semi-join

    Stream Ingestion with CLP

    Support for encoding fields with CLP during ingestion.

    circle-exclamation

    This is an experimental feature. Configuration options and usage may change frequently until it is stabilized.

    When performing stream ingestion of JSON records using , users can encode specific fields with by using a CLP-specific StreamMessageDecoder.

    CLP is a compressor designed to encode unstructured log messages in a way that makes them more compressible while retaining the ability to search them. It does this by decomposing the message into three fields:

    Amazon S3

    This guide shows you how to import data from files stored in Amazon S3.

    Enable the file system backend by including the pinot-s3 plugin. In the controller or server configuration, add the config:

    hashtag
    S3A URI scheme support

    Starting in Pinot 1.3.0, the pinot-s3 plugin supports both the s3://

    Querying Pinot

    A practical entry point for querying Pinot.

    Pinot queries run through the broker and are written in SQL. This page is the wayfinding layer for people who want to query data, understand which engine to use, and know where to look when a query needs tuning.

    hashtag
    How to start

    1. Write the query in Pinot SQL.

    Pinot Data Explorer

    Pinot Data Explorer is a user-friendly interface in Apache Pinot for interactive data exploration, querying, and visualization.

    Once you have set up a cluster, you can start exploring the data and the APIs using the Pinot Data Explorer.

    Navigate to in your browser to open the Data Explorer UI.

    hashtag
    Cluster Manager

    The first screen that you'll see when you open the Pinot Data Explorer is the Cluster Manager. The Cluster Manager provides a UI to operate and manage your cluster, giving you an overview of tenants, instances, tables, and their current status.

    bin/kafka-topics.sh --create --bootstrap-server localhost:9876 \
        --replication-factor 1 --partitions 1 --topic transcript-topic
    docker exec \
      -t kafka \
      /opt/kafka/bin/kafka-topics.sh \
      --bootstrap-server kafka:9092 \
      --partitions=1 --replication-factor=1 \
      --create --topic transcript-topic
    /tmp/pinot-quick-start/transcript-table-realtime.json
    {
      "tableName": "transcript",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "timestampInEpoch",
        "timeType": "MILLISECONDS",
        "schemaName": "transcript",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP",
        "streamConfigs": {
          "streamType": "kafka",
          "stream.kafka.topic.name": "transcript-topic",
          "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
          "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
          "stream.kafka.broker.list": "localhost:9876",
          "realtime.segment.flush.threshold.rows": "0",
          "realtime.segment.flush.threshold.time": "24h",
          "realtime.segment.flush.threshold.segment.size": "50M",
          "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
        }
      },
      "metadata": { "customConfigs": {} }
    }
    /tmp/pinot-quick-start/transcript-table-realtime.json
    {
      "tableName": "transcript",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "timestampInEpoch",
        "timeType": "MILLISECONDS",
        "schemaName": "transcript",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP",
        "streamConfigs": {
          "streamType": "kafka",
          "stream.kafka.topic.name": "transcript-topic",
          "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
          "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
          "stream.kafka.broker.list": "kafka:9092",
          "realtime.segment.flush.threshold.rows": "0",
          "realtime.segment.flush.threshold.time": "24h",
          "realtime.segment.flush.threshold.segment.size": "50M",
          "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
        }
      },
      "metadata": { "customConfigs": {} }
    }
    bin/pinot-admin.sh AddTable \
        -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
        -tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
        -exec
    docker run --rm -ti \
        --network=pinot-demo \
        -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
        --name pinot-streaming-table-creation \
        apachepinot/pinot:${PINOT_VERSION} AddTable \
        -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
        -tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
        -controllerHost pinot-controller \
        -controllerPort 9000 \
        -exec
    bin/kafka-console-producer.sh \
        --bootstrap-server localhost:9876 \
        --topic transcript-topic < /tmp/pinot-quick-start/rawdata/transcript.json
    docker exec -t kafka /opt/kafka/bin/kafka-console-producer.sh \
        --bootstrap-server localhost:9092 \
        --topic transcript-topic < /tmp/pinot-quick-start/rawdata/transcript.json
    /tmp/pinot-quick-start/rawdata/transcript.json
    {"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"Maths","score":3.8,"timestampInEpoch":1571900400000}
    {"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"History","score":3.5,"timestampInEpoch":1571900400000}
    {"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Maths","score":3.2,"timestampInEpoch":1571900400000}
    {"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Chemistry","score":3.6,"timestampInEpoch":1572418800000}
    {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Geography","score":3.8,"timestampInEpoch":1572505200000}
    {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"English","score":3.5,"timestampInEpoch":1572505200000}
    {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Maths","score":3.2,"timestampInEpoch":1572678000000}
    {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Physics","score":3.6,"timestampInEpoch":1572678000000}
    {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"Maths","score":3.8,"timestampInEpoch":1572678000000}
    {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"English","score":3.5,"timestampInEpoch":1572678000000}
    {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"History","score":3.2,"timestampInEpoch":1572854400000}
    {"studentID":212,"firstName":"Nick","lastName":"Young","gender":"Male","subject":"History","score":3.6,"timestampInEpoch":1572854400000}
    SELECT * FROM transcript
    \
    -e KAFKA_NODE_ID=1 \
    -e KAFKA_PROCESS_ROLES=broker,controller \
    -e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093 \
    -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092 \
    -e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
    -e KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT \
    -e KAFKA_CONTROLLER_QUORUM_VOTERS=1@kafka:9093 \
    -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
    -e CLUSTER_ID=MkU3OEVBNTcwNTJENDM2Qk \
    -d apache/kafka:4.0.0
    argument in the command
  • Add --jars options that contain the s3/hdfs paths to all the required plugin and pinot-all jar

  • Point classPath to spark working directory. Generally, just specifying the jar names without any paths works. Same should be done for main jar as well as the spec YAML file

  • local install guide
    job spec
    build from source
    build jdk 8 version from source
    segmentPushType
    segment name generation configs

    ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc=

    AwIBBQAAAAL/////////////////////

    AwIBBQAAAAz///////////////////////////////////////////////9///////f///9/////7///////////////+/////////////////////////////////////////////8=

    AwIBBwAAAA/////////////////////////////////////////////////////////////////////////////////////////////////////////9///////////////////////////////////////////////7//////8=

    P0P_0P0​
    P2P_2P2​
    P0P_0P0​
    (P0∪P2)mod4=(P0)mod2{(P_0 \cup P_2)}_{mod 4} = (P_0)_{mod 2}(P0​∪P2​)mod4​=(P0​)mod2​
    Memory: 16 GB
  • Swap: 4 GB

  • Disk image size: 60 GB

  • Start Pinot Broker

    Start Pinot Server

    Start Pinot Minion (optional)

    Start Kafka (optional)

    Kafka 4.0 runs in KRaft mode and does not require ZooKeeper:

    Dockerarrow-up-right
    Version reference
    Docker Hubarrow-up-right
    http://localhost:9000arrow-up-right
    First table and schema
    # executionFrameworkSpec: Defines ingestion jobs to be running.
    executionFrameworkSpec:
    
      # name: execution framework name
      name: 'spark'
    
      # segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
    
      # segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
    
      # segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
    
      #segmentMetadataPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface
      segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'
    
      # extraConfigs: extra configs for execution framework.
      extraConfigs:
    
        # stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
        stagingDir: your/local/dir/staging
    spark.driver.extraClassPath =>
    pinot-batch-ingestion-spark-${PINOT_VERSION}-shaded.jar:pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
    spark.driver.extraJavaOptions =>
    -Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins
    export PINOT_VERSION=1.4.0 #set to the Pinot version you have installed
    export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin
    
    spark-submit //
    --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand //
    --master local --deploy-mode client //
    --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins" //
    --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3/pinot-batch-ingestion-spark-3-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" //
    -conf "spark.executor.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3/pinot-batch-ingestion-spark-3-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" //
    local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar -jobSpecFile /path/to/spark_job_spec.yaml
    spark-submit //
    --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand //
    --master yarn --deploy-mode cluster //
    --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins" //
    --conf "spark.driver.extraClassPath=pinot-batch-ingestion-spark-3/pinot-batch-ingestion-spark-3-${PINOT_VERSION}-shaded.jar:pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" //
    --conf "spark.executor.extraClassPath=pinot-batch-ingestion-spark-3/pinot-batch-ingestion-spark-3-${PINOT_VERSION}-shaded.jar:pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" //
    --jars "${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3/pinot-batch-ingestion-spark-3-${PINOT_VERSION}-shaded.jar,${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar"
    --files s3://path/to/spark_job_spec.yaml
    local://pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar -jobSpecFile spark_job_spec.yaml
    SELECT ID_SET(yearID)
    FROM baseballStats
    WHERE teamID = 'WS1'
    SELECT ID_SET(playerName, 'expectedInsertions=10')
    FROM baseballStats
    WHERE teamID = 'WS1'
    SELECT ID_SET(playerName, 'expectedInsertions=100')
    FROM baseballStats
    WHERE teamID = 'WS1'
    SELECT ID_SET(playerName, 'expectedInsertions=100;fpp=0.01')
    FROM baseballStats
    WHERE teamID = 'WS1'
    SELECT yearID, count(*) 
    FROM baseballStats 
    WHERE IN_ID_SET(
     yearID,   
     'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
      ) = 1 
    GROUP BY yearID
    SELECT yearID, count(*) 
    FROM baseballStats 
    WHERE IN_ID_SET(
      yearID,   
      'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
      ) = 0 
    GROUP BY yearID
    SELECT yearID, count(*) 
    FROM baseballStats 
    WHERE IN_SUBQUERY(
      yearID, 
      'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
      ) = 1
    GROUP BY yearID  
    SELECT yearID, count(*) 
    FROM baseballStats 
    WHERE IN_SUBQUERY(
      yearID, 
      'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
      ) = 0
    GROUP BY yearID  
    SELECT yearID, count(*) 
    FROM baseballStats 
    WHERE IN_PARTITIONED_SUBQUERY(
      yearID, 
      'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
      ) = 1
    GROUP BY yearID  
    SELECT yearID, count(*) 
    FROM baseballStats 
    WHERE IN_PARTITIONED_SUBQUERY(
      yearID, 
      'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
      ) = 0
    GROUP BY yearID  
    SET useMultistageEngine=true;
    SET usePhysicalOptimizer=true;
    SET useMultistageEngine = true;
    SET usePhysicalOptimizer = true;
    
    WITH filtered_users AS (
      SELECT 
        userUUID
      FROM userAttributes
      WHERE userUUID NOT IN (
        SELECT 
          userUUID
        FROM userGroups
          WHERE groupUUID = 'group-1'
      )
      AND userUUID IN (
        SELECT
          userUUID
        FROM userGroups
          WHERE groupUUID = 'group-2'
      )
    )
    SELECT 
      userUUID,
      SUM(tripAmount)
    FROM userFactEvents
    WHERE
      userUUID IN (
        SELECT userUUID FROM filtered_users
      )
    GROUP BY userUUID
    PhysicalExchange(exchangeStrategy=[SINGLETON_EXCHANGE])
      PhysicalAggregate(group=[{1}], agg#0=[$SUM0($0)], aggType=[DIRECT])
        PhysicalJoin(condition=[=($1, $2)], joinType=[semi])
          PhysicalExchange(exchangeStrategy=[IDENTITY_EXCHANGE])
            PhysicalProject(tripAmount=[$7], userUUID=[$10])
              PhysicalTableScan(table=[[default, userFactEvents]])
          PhysicalJoin(condition=[=($0, $1)], joinType=[semi])
            PhysicalProject(userUUID=[$0])
              PhysicalFilter(condition=[IS NOT TRUE($3)])
                PhysicalJoin(condition=[=($1, $2)], joinType=[left])
                  PhysicalExchange(exchangeStrategy=[IDENTITY_EXCHANGE])
                    PhysicalProject(userUUID=[$6], userUUID0=[$6])
                      PhysicalTableScan(table=[[default, userAttributes]])
                  PhysicalExchange(exchangeStrategy=[IDENTITY_EXCHANGE])
                    PhysicalAggregate(group=[{0}], agg#0=[MIN($1)], aggType=[DIRECT])
                      PhysicalProject(userUUID=[$4], $f1=[true])
                        PhysicalFilter(condition=[=($3, _UTF-8'group-1')])
                          PhysicalTableScan(table=[[default, userGroups]])
            PhysicalExchange(exchangeStrategy=[IDENTITY_EXCHANGE])
              PhysicalProject(userUUID=[$4])
                PhysicalFilter(condition=[=($3, _UTF-8'group-2')])
                  PhysicalTableScan(table=[[default, userGroups]])
    segmentPartition = Murmur("user-1") % numPartitions
    SET useMultistageEngine = true;
    SET usePhysicalOptimizer = true;
    
    WITH user_events AS (
      SELECT
        productCode, tripAmount
      FROM
        userFactEvents
      WHERE
        userUUID = 'user-1'
      ORDER BY
        ts
      DESC
      LIMIT 100
    )
    SELECT
      productCode,
      SUM(tripAmount)
    FROM
      user_events
    GROUP BY productCode
        
    SET useMultistageEngine = true;
    SET usePhysicalOptimizer = true;
    
    SELECT
      COUNT(*)
    FROM
      userFactEvents
    WHERE
      userUUID = 'user-1' AND userUUID = 'user-2'
    export PINOT_VERSION=1.4.0
    export PINOT_IMAGE=apachepinot/pinot:${PINOT_VERSION}
    export ZK_IMAGE=zookeeper:3.9.5
    export KAFKA_IMAGE=apache/kafka:4.0.0
    docker pull apachepinot/pinot:${PINOT_VERSION}
    docker-compose.yml
    version: '3.7'
    
    services:
      pinot-zookeeper:
        image: ${ZK_IMAGE:-zookeeper:3.9.5}
        container_name: "pinot-zookeeper"
        restart: unless-stopped
        ports:
          - "2181:2181"
        environment:
          ZOOKEEPER_CLIENT_PORT: 2181
          ZOOKEEPER_TICK_TIME: 2000
        networks:
          - pinot-demo
        healthcheck:
          test: ["CMD", "zkServer.sh", "status"]
          interval: 30s
          timeout: 10s
          retries: 5
          start_period: 10s
    
      pinot-controller:
        image: ${PINOT_IMAGE:-apachepinot/pinot:1.4.0}
        command: "StartController -zkAddress pinot-zookeeper:2181"
        container_name: "pinot-controller"
        restart: unless-stopped
        ports:
          - "9000:9000"
        environment:
          JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms1G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-controller.log"
        depends_on:
          pinot-zookeeper:
            condition: service_healthy
        networks:
          - pinot-demo
        healthcheck:
          test: ["CMD-SHELL", "curl -f http://localhost:9000/health || exit 1"]
          interval: 30s
          timeout: 10s
          retries: 5
          start_period: 10s
    
      pinot-broker:
        image: ${PINOT_IMAGE:-apachepinot/pinot:1.4.0}
        command: "StartBroker -zkAddress pinot-zookeeper:2181"
        container_name: "pinot-broker"
        restart: unless-stopped
        ports:
          - "8099:8099"
        environment:
          JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-broker.log"
        depends_on:
          pinot-controller:
            condition: service_healthy
        networks:
          - pinot-demo
        healthcheck:
          test: ["CMD-SHELL", "curl -f http://localhost:8099/health || exit 1"]
          interval: 30s
          timeout: 10s
          retries: 5
          start_period: 10s
    
      pinot-server:
        image: ${PINOT_IMAGE:-apachepinot/pinot:1.4.0}
        command: "StartServer -zkAddress pinot-zookeeper:2181"
        container_name: "pinot-server"
        restart: unless-stopped
        ports:
          - "8098:8098"
        environment:
          JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log"
        depends_on:
          pinot-broker:
            condition: service_healthy
        networks:
          - pinot-demo
        healthcheck:
          test: ["CMD-SHELL", "curl -f http://localhost:8097/health/readiness || exit 1"]
          interval: 30s
          timeout: 10s
          retries: 5
          start_period: 10s
    
      pinot-minion:
        image: ${PINOT_IMAGE:-apachepinot/pinot:1.4.0}
        command: "StartMinion -zkAddress pinot-zookeeper:2181"
        restart: unless-stopped
        container_name: "pinot-minion"
        ports:
          - "6000:6000"
        depends_on:
          - pinot-broker
        networks:
          - pinot-demo
    
      pinot-kafka:
        image: ${KAFKA_IMAGE:-apache/kafka:4.0.0}
        container_name: "kafka"
        restart: unless-stopped
        ports:
          - "9092:9092"
        environment:
          KAFKA_NODE_ID: 1
          KAFKA_PROCESS_ROLES: broker,controller
          KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
          KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
          KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
          KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
          KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
          KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
          CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
        networks:
          - pinot-demo
        healthcheck:
          test: ["CMD-SHELL", "/opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server kafka:9092"]
          interval: 30s
          timeout: 10s
          retries: 5
          start_period: 10s
        deploy:
          replicas: ${KAFKA_REPLICAS:-0}
    
    networks:
      pinot-demo:
        name: pinot-demo
        driver: bridge
    docker compose --project-name pinot-demo up
    export KAFKA_REPLICAS=1
    docker compose --project-name pinot-demo up
    docker network create -d bridge pinot-demo
    docker run \
        --network=pinot-demo \
        --name pinot-zookeeper \
        --restart always \
        -p 2181:2181 \
        -d ${ZK_IMAGE}
    docker run --rm -ti \
        --network=pinot-demo \
        --name pinot-controller \
        -p 9000:9000 \
        -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms1G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-controller.log" \
        -d ${PINOT_IMAGE} StartController \
        -zkAddress pinot-zookeeper:2181
    docker container ls -a
    docker run --rm -ti \
        --network=pinot-demo \
        --name pinot-broker \
        -p 8099:8099 \
        -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-broker.log" \
        -d ${PINOT_IMAGE} StartBroker \
        -zkAddress pinot-zookeeper:2181
    docker run --rm -ti \
        --network=pinot-demo \
        --name pinot-server \
        -p 8098:8098 \
        -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log" \
        -d ${PINOT_IMAGE} StartServer \
        -zkAddress pinot-zookeeper:2181
    docker run --rm -ti \
        --network=pinot-demo \
        --name pinot-minion \
        -p 6000:6000 \
        -d ${PINOT_IMAGE} StartMinion \
        -zkAddress pinot-zookeeper:2181
    docker run --rm -ti \
        --network pinot-demo --name=kafka \
        -e KAFKA_NODE_ID=1 \
        -e KAFKA_PROCESS_ROLES=broker,controller \
        -e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093 \
        -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092 \
        -e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
        -e KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT \
        -e KAFKA_CONTROLLER_QUORUM_VOTERS=1@kafka:9093 \
        -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
        -e CLUSTER_ID=MkU3OEVBNTcwNTJENDM2Qk \
        -p 9092:9092 \
        -d ${KAFKA_IMAGE}

    the message's static text, called a log type;

  • repetitive variable values, called dictionary variables; and

  • non-repetitive variable values (called encoded variables since we encode them specially if possible).

  • Searches are similarly decomposed into queries on the individual fields.

    circle-info

    Although CLP is designed for log messages, other unstructured text like file paths may also benefit from its encoding.

    For example, consider this JSON record:

    If the user specifies the fields message and logPath should be encoded with CLP, then the StreamMessageDecoder will output:

    In the fields with the _logtype suffix, \x11 is a placeholder for an integer variable, \x12 is a placeholder for a dictionary variable, and \x13 is a placeholder for a float variable. In message_encoedVars, the float variable 0.335 is encoded as an integer using CLP's custom encoding.

    All remaining fields are processed in the same way as they are in org.apache.pinot.plugin.inputformat.json.JSONRecordExtractor. Specifically, fields in the table's schema are extracted from each record and any remaining fields are dropped.

    hashtag
    Configuration

    hashtag
    Table Index

    Assuming the user wants to encode message and logPath as in the example, they should change/add the following settings to their tableIndexConfig (we omit irrelevant settings for brevity):

    • stream.kafka.decoder.prop.fieldsForClpEncoding is a comma-separated list of names for fields that should be encoded with CLP.

    • We use variable-length dictionaries for the logtype and dictionary variables since their length can vary significantly.

    hashtag
    Schema

    For the table's schema, users should configure the CLP-encoded fields as follows (we omit irrelevant settings for brevity):

    • We use the maximum possible length for the logtype and dictionary variable columns.

    • The dictionary and encoded variable columns are multi-valued columns.

    hashtag
    Searching and decoding CLP-encoded fields

    To decode CLP-encoded fields, use CLPDECODE.

    To search CLP-encoded fields, you can combine CLPDECODE with LIKE. Note, this may decrease performance when querying a large number of rows.

    We are working to integrate efficient searches on CLP-encoded columns as another UDF. The development of this feature is being tracked in this design docarrow-up-right.

    hashtag
    CLP Forward Index V2

    Starting in Pinot 1.3.0, the CLP forward index was upgraded to V2 (CLPMutableForwardIndexV2), which is now the default for CLP-encoded columns during real-time ingestion. Key improvements include:

    hashtag
    Dynamic encoding with cardinality monitoring

    V2 monitors dictionary cardinality during ingestion and dynamically switches encoding modes:

    • CLP dictionary encoding: Used when log type and dictionary variable cardinality remains below a configurable threshold relative to the document count.

    • Raw string fallback: When cardinality exceeds the threshold (docs/cardinality ratio drops below 10), V2 automatically falls back to a raw string forward index to avoid the memory and I/O overhead of maintaining a large dictionary.

    hashtag
    Improved compression

    V2 uses fixed-byte encoding with Zstandard chunk compression instead of V1's uncompressed fixed-bit encoding. This significantly improves compression ratios for most real-world log data.

    hashtag
    Compression codec options

    You can select the compression codec for CLP-encoded columns using the compressionCodec in fieldConfig:

    Codec
    Description

    CLPV2

    CLP V2 with default ZStandard compression

    CLPV2_ZSTD

    CLP V2 with explicit ZStandard compression

    CLPV2_LZ4

    CLP V2 with LZ4 compression

    CLP

    Example field config:

    hashtag
    Immutable CLP Forward Index

    When mutable (real-time) segments are converted to immutable segments, V2 directly copies the mutable dictionary and index data without re-encoding, eliminating the serialization/deserialization overhead present in V1. The resulting immutable forward index is memory-mapped for efficient random access during queries.

    Kafka
    CLParrow-up-right
    and
    s3a://
    URI schemes. Both schemes use the same underlying AWS SDK v2 client and identical configuration — the only difference is the URI prefix. This allows Pinot to integrate with Hadoop-based ecosystems and tools that standardize on the
    s3a://
    scheme.

    To use the s3a:// scheme, specify it in your deep store paths and file system configuration:

    All configuration properties documented below work identically for both the s3 and s3a schemes.

    circle-info

    By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-3.0...

    You can configure the S3 file system using the following options:

    Configuration
    Description

    region

    The AWS Data center region in which the bucket is located

    accessKey

    (Optional) AWS access key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.

    secretKey

    (Optional) AWS secret key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.

    Each of these properties should be prefixed by pinot.[node].storage.factory.s3. where node is either controller or server depending on the config

    e.g.

    S3 Filesystem supports authentication using the DefaultCredentialsProviderChainarrow-up-right. The credential provider looks for the credentials in the following order -

    • Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)

    • Java System Properties - aws.accessKeyId and aws.secretKey

    • Web Identity Token credentials from the environment or container

    • Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI

    • Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable,

    • Instance profile credentials delivered through the Amazon EC2 metadata service

    You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups.

    hashtag
    Checksum validation

    circle-info

    Checksum configuration is available starting in Pinot 1.4.

    Starting with AWS SDK 2.30.0, the S3 client enables request and response checksum validation by default. Pinot exposes configuration properties to control this behavior.

    hashtag
    Request and response checksums

    By default, Pinot sets both requestChecksumCalculation and responseChecksumValidation to WHEN_SUPPORTED, which means the S3 client calculates checksums on uploads and validates them on downloads whenever the API supports it. This provides data integrity verification for segment files stored in your deep store.

    If you want to disable automatic checksums and only use them when the S3 API strictly requires it, set both properties to WHEN_REQUIRED:

    Value
    Behavior

    WHEN_SUPPORTED

    Calculate/validate checksums whenever the API supports it (default)

    WHEN_REQUIRED

    Only calculate/validate checksums when the API requires it

    hashtag
    LegacyMd5Plugin for S3-compatible stores

    Some S3-compatible object stores (e.g. MinIO, Ceph, or older AWS configurations) require the legacy Content-MD5 header on requests. After the AWS SDK 2.30.0 upgrade, these stores may return errors like:

    To restore the pre-2.30.0 MD5 checksum behavior, enable the useLegacyMd5Plugin option:

    This adds the LegacyMd5Plugin to the S3 client, which sends the Content-MD5 header that these stores expect.

    circle-exclamation

    Only enable useLegacyMd5Plugin if your S3-compatible store requires the legacy MD5 header. For standard AWS S3, the default checksum behavior is recommended.

    hashtag
    Examples

    hashtag
    Job spec

    hashtag
    Controller config

    hashtag
    Server config

    hashtag
    Minion config

    Amazon S3arrow-up-right

    Decide whether the single-stage engine is enough or whether you need multi-stage features such as joins and subqueries.

  • Use query options to control runtime behavior.

  • Inspect the plan or result shape when you need to debug performance.

  • hashtag
    What matters most

    Pinot SQL uses the Apache Calcite parser with the MYSQL_ANSI dialect. In practice, that means you should pay attention to identifier quoting, literal quoting, and engine-specific capabilities.

    If you are debugging a slow or surprising query, the most useful follow-up pages are:

    • SQL syntax

    • Query options

    • Query quotas

    hashtag
    When to use which engine

    Single-stage execution is the default path for straightforward filtering, aggregation, and top-K style queries.

    Use multi-stage execution when you need features that are not available in single-stage mode, such as:

    • joins

    • subqueries

    • window functions

    • more complex distributed query shapes

    As a rule of thumb: use SSE for simple filtering, aggregation, and top-K queries; use MSE when your query shape requires joins, subqueries, window functions, or other advanced relational operators. For a detailed comparison, see SSE vs MSE.

    hashtag
    Next step

    Read SQL syntax for the query language itself, then move to Query options or Explain plan when you need control or diagnostics.

    hashtag
    Related pages

    • Querying & SQL controls

    • SQL syntax

    • Explain plan

    hashtag
    Identifier vs Literal

    In Pinot SQL:

    • Double quotes(") are used to force string identifiers, e.g. column names

    • Single quotes(') are used to enclose string literals. If the string literal also contains a single quote, escape this with a single quote e.g '''Pinot''' to match the string literal 'Pinot'

    Misusing those might cause unexpected query results, like the following examples:

    • WHERE a='b' means the predicate on the column a equals to a string literal value 'b'

    • WHERE a="b" means the predicate on the column a equals to the value of the column b

    If your column names use reserved keywords (e.g. timestamp or date) or special characters, you will need to use double quotes when referring to them in queries.

    Note: Define decimal literals within quotes to preserve precision.

    hashtag
    Example Queries

    hashtag
    Selection

    hashtag
    Aggregation

    hashtag
    Grouping on Aggregation

    hashtag
    Ordering on Aggregation

    hashtag
    Filtering

    For performant filtering of IDs in a list, see Filtering with IdSet.

    hashtag
    Filtering with NULL predicate

    hashtag
    Selection (Projection)

    hashtag
    Ordering on Selection

    hashtag
    Pagination on Selection

    Note that results might not be consistent if the ORDER BY column has the same value in multiple rows.

    hashtag
    Wild-card match (in WHERE clause only)

    The example below counts rows where the column airlineName starts with U:

    Note: REGEXP_LIKE also supports case insensitive search using the i flag as the third parameter.

    hashtag
    Case-When Statement

    Pinot supports the CASE-WHEN-ELSE statement, as shown in the following two examples:

    hashtag
    UDF

    Pinot doesn't currently support injecting functions. Functions have to be implemented within Pinot, as shown below:

    For more examples, see Transform Function in Aggregation Grouping.

    hashtag
    BYTES column

    Pinot supports queries on BYTES column using hex strings. The query response also uses hex strings to represent bytes values.

    The query below fetches all the rows for a given UID:

    Pinot Cluster Manager

    If you want to view the contents of a server, click on its instance name. You'll then see the following:

    Pinot Server

    hashtag
    Table management

    To view a table, click on its name from the tables list. From the table detail screen, you can edit or delete the table, edit or adjust its schema, and perform several other operations.

    baseballStats Table

    For example, if we want to add yearID to the list of inverted indexes, click on Edit Table, add the extra column, and click Save:

    Edit Table

    hashtag
    Pause and resume consumption

    For real-time tables, the table detail screen includes a Pause/Resume Consumption button (#15657arrow-up-right). This lets you pause ingestion on a real-time table directly from the UI without issuing REST API calls, and resume it when ready. This is useful during maintenance windows or when you need to temporarily halt data ingestion.

    hashtag
    Consuming segments info

    A Consuming Segments Info button (#15623arrow-up-right) is available on real-time tables, providing a quick view of all currently consuming segments. This shows details such as the partition, current offset, and consumption state, making it easier to monitor real-time ingestion health.

    hashtag
    Reset segment

    The UI now supports a Reset Segment operation (#16078arrow-up-right), allowing you to reset a segment directly from the table detail screen. This is helpful when a segment is stuck in an error state and needs to be re-processed.

    hashtag
    Segment state filter

    A segment state filter (#16085arrow-up-right) has been added to the table detail screen. You can filter segments by their state (e.g., ONLINE, CONSUMING, ERROR) to quickly locate segments that need attention, which is especially valuable for tables with a large number of segments.

    hashtag
    Table rebalance

    The table detail screen also provides access to table rebalance operations. Several UI fixes and improvements (#15511arrow-up-right) have been made to improve the reliability and usability of the rebalance workflow, including better parameter validation and progress display.

    hashtag
    Logical table management

    Starting with Pinot 1.4 and later, the Data Explorer includes a logical table management UI (#17878arrow-up-right). Logical tables are collections of physical tables (REALTIME and OFFLINE) that can be queried as a single unified table.

    The logical tables listing is accessible from the main Tables page, alongside physical tables and schemas. From there you can:

    • Browse all logical tables in the cluster with search support.

    • View details of a logical table, including its configuration, the list of physical tables it maps to, and metadata.

    • Edit a logical table's configuration.

    • Delete a logical table with a confirmation dialog.

    For more information about logical tables, see the Logical Table Support section in the 1.4.0 release notes.

    hashtag
    Query Console

    Navigate to Query Consolearrow-up-right to see the querying interface. The Query Console lets you run SQL queries against your Pinot cluster and view the results interactively.

    We can see our baseballStats table listed on the left (you will see meetupRSVP or airlineStats if you used the streaming or the hybrid quick start). Click on the table name to display all the names along with the data types of the columns of the table.

    You can also execute a sample query select * from baseballStats limit 10 by typing it in the text box and clicking the Run Query button.

    Cmd + Enter can also be used to run the query when focused on the console.

    Here are some sample queries you can try:

    Pinot uses SQL for querying. For the complete syntax reference, see the SQL Syntax and Operators Reference. For query options, examples, and engine details, see Querying Pinot.

    hashtag
    Time-series query execution

    The Query Console also supports time-series query execution (#16305arrow-up-right), introduced as part of the Time Series Engine beta. This feature provides a dedicated interface for running and visualizing time-series queries using languages such as PromQL. It connects to a Prometheus-compatible /query_range endpoint (#16286arrow-up-right) exposed by the Pinot controller, letting you explore time-series data and inspect query execution plans directly from the UI.

    hashtag
    REST API

    The Pinot Admin UIarrow-up-right contains all the APIs that you will need to operate and manage your cluster. It provides a set of APIs for Pinot cluster management including health check, instances management, schema and table management, data segments management.

    Let's check out the tables in this cluster by going to Table -> List all tables in clusterarrow-up-right, click Try it out, and then click Execute. We can see thebaseballStats table listed here. We can also see the exact cURL call made to the controller API.

    List all tables in cluster

    You can look at the configuration of this table by going to Tables -> Get/Enable/Disable/Drop a tablearrow-up-right, click Try it out, type baseballStats in the table name, and then click Execute.

    Let's check out the schemas in the cluster by going to Schema -> List all schemas in the clusterarrow-up-right, click Try it out, and then click Execute. We can see a schema called baseballStats in this list.

    List all schemas in the cluster

    Take a look at the schema by going to Schema -> Get a schemaarrow-up-right, click Try it out, type baseballStats in the schema name, and then click Execute.

    baseballStats Schema

    Finally, let's check out the data segments in the cluster by going to Segment -> List all segmentsarrow-up-right, click Try it out, type in baseballStats in the table name, and then click Execute. There's 1 segment for this table, called baseballStats_OFFLINE_0.

    To learn how to upload your own data and schema, see Batch Ingestion or Stream ingestion.

    http://localhost:9000arrow-up-right

    Segment

    Discover the segment component in Apache Pinot for efficient data storage and querying within Pinot clusters, enabling optimized data processing and analysis.

    Pinot tables are stored in one or more independent shards called segments. A small table may be contained by a single segment, but Pinot lets tables grow to an unlimited number of segments. There are different processes for creating segments (see ingestion). Segments have time-based partitions of table data, and are stored on Pinot servers that scale horizontally as needed for both storage and computation.

    Pinot achieves this by breaking the data into smaller chunks known as segments (similar to shards/partitions in relational databases). Segments can be seen as time-based partitions.

    A segment is a horizontal shard representing a chunk of table data with some number of rows. The segment stores data for all columns of the table. Each segment packs the data in a columnar fashion, along with the dictionaries and indices for the columns. The segment is laid out in a columnar format so that it can be directly mapped into memory for serving queries.

    Columns can be single or multi-valued and the following types are supported: STRING, BOOLEAN, INT, LONG, FLOAT, DOUBLE, TIMESTAMP or BYTES. Only single-valued BIG_DECIMAL data type is supported.

    Columns may be declared to be metric or dimension (or specifically as a time dimension) in the schema. Columns can have default null values. For example, the default null value of a integer column can be 0. The default value for bytes columns must be hex-encoded before it's added to the schema.

    Pinot uses dictionary encoding to store values as a dictionary ID. Columns may be configured to be “no-dictionary” column in which case raw values are stored. Dictionary IDs are encoded using minimum number of bits for efficient storage (e.g. a column with a cardinality of 3 will use only 2 bits for each dictionary ID).

    A forward index is built for each column and compressed for efficient memory use. In addition, you can optionally configure inverted indices for any set of columns. Inverted indices take up more storage, but improve query performance. Specialized indexes like Star-Tree index are also supported. For more details, see .

    hashtag
    Creating a segment

    Once the table is configured, we can load some data. Loading data involves generating pinot segments from raw data and pushing them to the pinot cluster. Data can be loaded in batch mode or streaming mode. For more details, see the page.

    hashtag
    Load data in batch

    hashtag
    Prerequisites

    Below are instructions to generate and push segments to Pinot via standalone scripts. For a production setup, you should use frameworks such as Hadoop or Spark. For more details on setting up data ingestion jobs, see

    hashtag
    Job Spec YAML

    To generate a segment, we need to first create a job spec YAML file. This file contains all the information regarding data format, input data location, and pinot cluster coordinates. Note that this assumes that the controller is RUNNING to fetch the table config and schema. If not, you will have to configure the spec to point at their location. For full configurations, see .

    hashtag
    Create and push segment

    To create and push the segment in one go, use the following:

    Sample Console Output

    Alternately, you can separately create and then push, by changing the jobType to SegmentCreation or SegmenTarPush.

    hashtag
    Templating Ingestion Job Spec

    The Ingestion job spec supports templating with Groovy Syntax.

    This is convenient if you want to generate one ingestion job template file and schedule it on a daily basis with extra parameters updated daily.

    e.g. you could set inputDirURI with parameters to indicate the date, so that the ingestion job only processes the data for a particular date. Below is an example that templates the date for input and output directories.

    You can pass in arguments containing values for ${year}, ${month}, ${day} when kicking off the ingestion job: -values $param=value1 $param2=value2...

    This ingestion job only generates segments for date 2014-01-03

    hashtag
    Load data in streaming

    Prerequisites

    Below is an example of how to publish sample data to your stream. As soon as data is available to the real-time stream, it starts getting consumed by the real-time servers.

    hashtag
    Kafka

    Run below command to stream JSON data into Kafka topic: flights-realtime

    Run below command to stream JSON data into Kafka topic: flights-realtime

    Flink

    Batch ingestion of data into Apache Pinot using Apache Flink.

    Apache Pinot supports using Apache Flink as a processing framework to generate and upload segments. The Pinot distribution includes a PinotSinkFunctionarrow-up-right that can be integrated into Flink applications (streaming or batch) to directly write data as segments into Pinot tables.

    The PinotSinkFunction supports offline tables, realtime tables, and upsert tables (full upsert only). Data is buffered in memory and flushed as segments when the configured threshold is reached, then uploaded to the Pinot cluster.

    hashtag
    Maven Dependency

    To use the Pinot Flink Connector in your Flink job, add the following dependency to your pom.xml:

    Replace 1.5.0-SNAPSHOT with the Pinot version you're using. For the latest stable version, check the .

    Note: The connector transitively includes dependencies for:

    • pinot-controller - For controller client APIs

    • pinot-segment-writer-file-based - For segment generation

    • flink-streaming-java and flink-java

    hashtag
    Offline Table Ingestion

    hashtag
    Quick Start Example

    hashtag
    Table Configuration

    The PinotSinkFunction uses the TableConfig to determine batch ingestion settings for segment generation and upload. Here's an example table configuration:

    Required configurations:

    • outputDirURI - Directory where segments are written before upload

    • push.controllerUri - Pinot controller URL for segment upload

    For a complete executable example, refer to .

    hashtag
    Realtime Table Ingestion

    hashtag
    Non-Upsert Realtime Tables

    For standard realtime tables without upsert, use the same approach as offline tables, but specify REALTIME as the table type:

    hashtag
    Upsert Tables

    hashtag
    Full Upsert Tables

    Flink connector supports backfilling full upsert tables where each record contains all columns. The uploaded segments will correctly participate in upsert semantics based on the comparison column value.

    Requirements:

    1. Partitioning: Data must be partitioned using the same strategy as the upstream stream (e.g., Kafka)

    2. Parallelism: Flink job parallelism must match the number of upstream stream/table partitions

    3. Comparison Column: The values of the comparison column must have ordering consistent with the upstream stream. This ensures that Pinot can correctly resolve which record is the latest for a given key. See for important considerations.

    Example:

    How Partitioning Works:

    When uploading segments for upsert tables, Pinot uses a special segment naming convention that encodes the partition ID. The format is:

    Example: flink__myTable__0__1724045187__1

    Each Flink subtask generates segments for a specific partition based on its subtask index. The segments are then assigned to the same server instances that handle that partition for stream-consumed segments, ensuring correct upsert behavior across all segments.

    Configuration Options:

    You can customize segment generation using additional constructor parameters:

    hashtag
    Partial Upsert Tables

    WARNING: Flink-based upload is not recommended for partial upsert tables.

    In partial upsert tables, uploaded segments contain only a subset of columns or an intermdiate row for a primary key. If the uploaded row is not in its final state and subsequent updates arrive via the stream, the partial upsert merger may produce inconsistent results between replicas. This can lead to data inconsistency that is difficult to detect and resolve.

    For partial upsert tables, prefer stream-based ingestion only or ensure uploaded data represents the final state for each primary key.

    hashtag
    Advanced Configuration

    hashtag
    Segment Flush Control

    Control when segments are flushed and uploaded:

    hashtag
    Segment Naming

    Customize segment naming and upload time for better organization:

    hashtag
    Additional Resources

    • - Original design motivation

    • - Externally partitioned segments for upsert tables

    • - Flink connector enhancements for upsert backfill

    Complex Type (Array, Map) Handling

    Complex type handling in Apache Pinot.

    Commonly, ingested data has a complex structure. For example, Avro schemas have recordsarrow-up-right and arraysarrow-up-right while JSON supports objectsarrow-up-right and arraysarrow-up-right.

    Apache Pinot's data model supports primitive data types (including int, long, float, double, BigDecimal, string, bytes), and limited multi-value types, such as an array of primitive types. Simple data types allow Pinot to build fast indexing structures for good query performance, but does require some handling of the complex structures.

    There are two options for complex type handling:

    • Convert the complex-type data into a JSON string and then build a JSON index.

    • Use the built-in complex-type handling rules in the ingestion configuration.

    On this page, we'll show how to handle these complex-type structures with each of these two approaches. We will process some example data, consisting of the field group from the .

    This object has two child fields and the child group is a nested array with elements of object type.

    hashtag
    JSON indexing

    Apache Pinot provides a powerful to accelerate the value lookup and filtering for the column. To convert an object group with complex type to JSON, add the following to your table configuration.

    The config transformConfigs transforms the object group to a JSON string group_json, which then creates the JSON indexing with configuration jsonIndexColumns. To read the full spec, see .

    Also, note that group is a reserved keyword in SQL and therefore needs to be quoted in transformFunction.

    circle-info

    The columnName can't use the same name as any of the fields in the source JSON data, for example, if our source data contains the field group and we want to transform the data in that field before persisting it, the destination column name would need to be something different, like group_json.

    circle-info

    Note that you do not need to worry about the maxLength of the field group_json on the schema, because "JSON" data type does not have a maxLength and will not be truncated. This is true even though "JSON" is stored as a string internally.

    The schema will look like this:

    For the full specification, see .

    With this, you can start to query the nested fields under group. For more details about the supported JSON function, see ).

    hashtag
    Ingestion configurations

    Though JSON indexing is a handy way to process the complex types, there are some limitations:

    • It’s not performant to group by or order by a JSON field, because JSON_EXTRACT_SCALAR is needed to extract the values in the GROUP BY and ORDER BY clauses, which invokes the function evaluation.

    • It does not work with Pinot's such as DISTINCTCOUNTMV.

    Alternatively, from Pinot 0.8, you can use the complex-type handling in ingestion configurations to flatten and unnest the complex structure and convert them into primitive types. Then you can reduce the complex-type data into a flattened Pinot table, and query it via SQL. With the built-in processing rules, you do not need to write ETL jobs in another compute framework such as Flink or Spark.

    To process this complex type, you can add the configuration complexTypeConfig to the ingestionConfig. For example:

    With the complexTypeConfig , all the map objects will be flattened to direct fields automatically. And with unnestFields , a record with the nested collection will unnest into multiple records. For instance, the example at the beginning will transform into two rows with this configuration example.

    Note that:

    • The nested field group_id under group is flattened to group.group_id. The default value of the delimiter is . You can choose another delimiter by specifying the configuration delimiter under complexTypeConfig. This flattening rule also applies to maps in the collections to be unnested.

    You can find the full specifications of the table config and the table schema .

    You can then query the table with primitive values using the following SQL query:

    circle-info

    . is a reserved character in SQL, so you need to quote the flattened columns in the query.

    hashtag
    Infer the Pinot schema from the Avro schema and JSON data

    When there are complex structures, it can be challenging and tedious to figure out the Pinot schema manually. To help with schema inference, Pinot provides utility tools to take the Avro schema or JSON data as input and output the inferred Pinot schema.

    To infer the Pinot schema from Avro schema, you can use a command like this:

    Note you can input configurations like fieldsToUnnest similar to the ones in complexTypeConfig. And this will simulate the complex-type handling rules on the Avro schema and output the Pinot schema in the file specified in outputDir.

    Similarly, you can use the command like the following to infer the Pinot schema from a file of JSON objects.

    You can check out an example of this run in this .

    Dimension Table

    Batch ingestion of data into Apache Pinot using dimension tables.

    Dimension tables are a special kind of offline table designed for join-like enrichment of fact data at query time. They are used together with the (single-stage engine) or the (multi-stage engine) to decorate query results with reference data.

    hashtag
    When to use dimension tables

    Use a dimension table when you need to enrich a large fact table with attributes from a small, relatively static reference dataset at query time. Common examples include:

    Ingest from Apache Pulsar

    This guide shows you how to ingest a stream of records from an Apache Pulsar topic into a Pinot table.

    Pinot supports consuming data from via the pinot-pulsar plugin. You need to enable this plugin so that Pulsar specific libraries are present in the classpath.

    Enable the Pulsar plugin with the following config at the time of Pinot setup: -Dplugins.include=pinot-pulsar

    circle-info

    The pinot-pulsar

    Ingest Records with Dynamic Schemas

    Storing records with dynamic schemas in a table with a fixed schema.

    Some domains (e.g., logging) generate records where each record can have a different set of keys, whereas Pinot tables have a relatively static schema. For records with varying keys, it's impractical to store each field in its own table column. However, most (if not all) fields may be important, so fields should not be dropped unnecessarily.

    Additionally, searching patterns on such table could also be complex and change frequently. Exact match, range query, prefix/suffix match, wildcard search and aggregation functions could be used on any old or newly created keys or values.

    hashtag
    SchemaConformingTransformer

    {
      "timestamp": 1672531200000,
      "message": "INFO Task task_12 assigned to container: [ContainerID:container_15], operation took 0.335 seconds. 8 tasks remaining.",
      "logPath": "/mnt/data/application_123/container_15/stdout"
    }
    {
      "timestamp": 1672531200000,
      "message_logtype": "INFO Task \\x12 assigned to container: [ContainerID:\\x12], operation took \\x13 seconds. \\x11 tasks remaining.",
      "message_dictionaryVars": [
        "task_12",
        "container_15"
      ],
      "message_encodedVars": [
        1801439850948198735,
        8
      ],
      "logPath_logtype": "/mnt/data/\\x12/\\x12/stdout",
      "logPath_dictionaryVars": [
        "application_123",
        "container_15"
      ],
      "logPath_encodedVars": []
    }
    {
      "tableIndexConfig": {
        "streamConfigs": {
          "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.clplog.CLPLogMessageDecoder",
          "stream.kafka.decoder.prop.fieldsForClpEncoding": "message,logPath"
        },
        "varLengthDictionaryColumns": [
          "message_logtype",
          "message_dictionaryVars",
          "logPath_logtype",
          "logPath_dictionaryVars"
        ]
      }
    }
    {
      "dimensionFieldSpecs": [
        {
          "name": "message_logtype",
          "dataType": "STRING",
          "maxLength": 2147483647
        },
        {
          "name": "message_encodedVars",
          "dataType": "LONG",
          "singleValueField": false
        },
        {
          "name": "message_dictionaryVars",
          "dataType": "STRING",
          "maxLength": 2147483647,
          "singleValueField": false
        },
        {
          "name": "logpath_logtype",
          "dataType": "STRING",
          "maxLength": 2147483647
        },
        {
          "name": "logpath_encodedVars",
          "dataType": "LONG",
          "singleValueField": false
        },
        {
          "name": "logpath_dictionaryVars",
          "dataType": "STRING",
          "maxLength": 2147483647,
          "singleValueField": false
        }
      ]
    }
    {
      "fieldConfigList": [
        {
          "name": "message",
          "encodingType": "RAW",
          "compressionCodec": "CLPV2_ZSTD"
        }
      ]
    }
    -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-s3
    controller.data.dir=s3a://path/to/data/directory/
    pinot.controller.storage.factory.class.s3a=org.apache.pinot.plugin.filesystem.S3PinotFS
    pinot.controller.storage.factory.s3a.region=us-east-1
    pinot.controller.segment.fetcher.protocols=file,http,s3a
    pinot.controller.segment.fetcher.s3a.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.controller.storage.factory.s3.region=ap-southeast-1
    pinot.controller.storage.factory.s3.requestChecksumCalculation=WHEN_REQUIRED
    pinot.controller.storage.factory.s3.responseChecksumValidation=WHEN_REQUIRED
    Missing required content hash for this request: Content-MD5 or x-amz-content-sha256
    pinot.controller.storage.factory.s3.useLegacyMd5Plugin=true
    executionFrameworkSpec:
        name: 'standalone'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 's3://pinot-bucket/pinot-ingestion/batch-input/'
    outputDirURI: 's3://pinot-bucket/pinot-ingestion/batch-output/'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: s3
          className: org.apache.pinot.plugin.filesystem.S3PinotFS
          configs:
            region: 'ap-southeast-1'
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    controller.data.dir=s3://path/to/data/directory/
    controller.local.temp.dir=/path/to/local/temp/directory
    controller.enable.split.commit=true
    pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
    pinot.controller.storage.factory.s3.region=ap-southeast-1
    pinot.controller.segment.fetcher.protocols=file,http,s3
    pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.server.instance.enable.split.commit=true
    pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
    pinot.server.storage.factory.s3.region=ap-southeast-1
    pinot.server.storage.factory.s3.httpclient.maxConnections=50
    pinot.server.storage.factory.s3.httpclient.socketTimeout=30s
    pinot.server.storage.factory.s3.httpclient.connectionTimeout=2s
    pinot.server.storage.factory.s3.httpclient.connectionTimeToLive=0s
    pinot.server.storage.factory.s3.httpclient.connectionAcquisitionTimeout=10s
    pinot.server.segment.fetcher.protocols=file,http,s3
    pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.minion.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
    pinot.minion.storage.factory.s3.region=ap-southeast-1
    pinot.minion.segment.fetcher.protocols=file,http,s3
    pinot.minion.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    SET useMultistageEngine = true;
    SELECT city, COUNT(*)
    FROM stores
    GROUP BY city
    LIMIT 10;
    //default to limit 10
    SELECT * 
    FROM myTable 
    
    SELECT * 
    FROM myTable 
    LIMIT 100
    SELECT "date", "timestamp"
    FROM myTable 
    SELECT COUNT(*), MAX(foo), SUM(bar) 
    FROM myTable
    SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
    FROM myTable
    GROUP BY bar, baz 
    LIMIT 50
    SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
    FROM myTable
    GROUP BY bar, baz 
    ORDER BY bar, MAX(foo) DESC 
    LIMIT 50
    SELECT COUNT(*) 
    FROM myTable
      WHERE foo = 'foo'
      AND bar BETWEEN 1 AND 20
      OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))
    SELECT COUNT(*) 
    FROM myTable
      WHERE foo IS NOT NULL
      AND foo = 'foo'
      AND bar BETWEEN 1 AND 20
      OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))
    SELECT * 
    FROM myTable
      WHERE quux < 5
      LIMIT 50
    SELECT foo, bar 
    FROM myTable
      WHERE baz > 20
      ORDER BY bar DESC
      LIMIT 100
    SELECT foo, bar 
    FROM myTable
      WHERE baz > 20
      ORDER BY bar DESC
      LIMIT 50, 100
    SELECT COUNT(*) 
    FROM myTable
      WHERE REGEXP_LIKE(airlineName, '^U.*')
      GROUP BY airlineName LIMIT 10
    SELECT
        CASE
          WHEN price > 30 THEN 3
          WHEN price > 20 THEN 2
          WHEN price > 10 THEN 1
          ELSE 0
        END AS price_category
    FROM myTable
    SELECT
      SUM(
        CASE
          WHEN price > 30 THEN 30
          WHEN price > 20 THEN 20
          WHEN price > 10 THEN 10
          ELSE 0
        END) AS total_cost
    FROM myTable
    SELECT COUNT(*)
    FROM myTable
    GROUP BY DATETIMECONVERT(timeColumnName, '1:MILLISECONDS:EPOCH', '1:HOURS:EPOCH', '1:HOURS')
    SELECT * 
    FROM myTable
    WHERE UID = 'c8b3bce0b378fc5ce8067fc271a34892'
    select playerName, max(hits)
    from baseballStats
    group by playerName
    order by max(hits) desc
    select sum(hits), sum(homeRuns), sum(numberOfGames)
    from baseballStats
    where yearID > 2010
    select *
    from baseballStats
    order by league

    Legacy V1 (uncompressed, pass-through)

    endpoint

    (Optional) Override endpoint for s3 client.

    disableAcl

    If this is set tofalse, bucket owner is granted full access to the objects created by pinot. Default value is true.

    serverSideEncryption

    (Optional) The server-side encryption algorithm used when storing this object in Amazon S3 (Now supports aws:kms), set to null to disable SSE.

    ssekmsKeyId

    (Optional, but required when serverSideEncryption=aws:kms) Specifies the AWS KMS key ID to use for object encryption. All GET and PUT requests for an object protected by AWS KMS will fail if not made via SSL or using SigV4.

    ssekmsEncryptionContext

    (Optional) Specifies the AWS KMS Encryption Context to use for object encryption. The value of this header is a base64-encoded UTF-8 string holding JSON with the encryption context key-value pairs.

    requestChecksumCalculation

    (Optional) Controls whether checksums are calculated for request payloads. Default: WHEN_SUPPORTED. Options: WHEN_SUPPORTED, WHEN_REQUIRED.

    responseChecksumValidation

    (Optional) Controls whether checksums are validated on response payloads. Default: WHEN_SUPPORTED. Options: WHEN_SUPPORTED, WHEN_REQUIRED.

    useLegacyMd5Plugin

    (Optional) When set to true, uses the LegacyMd5Plugin to restore pre-2.30.0 MD5 checksum behavior. Default: false.

    enableCrossRegionAccess

    (Optional) If you want to copy objects b/w two buckets that lie in different regions. Defaults to true if not configured.

    Query cancellation
    Cursor pagination
    Correlation IDs
    Explain plan
    Multi-stage explain plan
    SSE vs MSE
    Multi-stage explain plan
    SSE vs MSE
    SQL syntax and operators reference
    Indexing
    ingestion overview
    Set up a cluster
    Create broker and server tenants
    Create an offline table
    Import Data.
    Ingestion Job Spec
    Set up a cluster
    Create broker and server tenants
    Create a real-time table and set up a real-time stream
    - Flink core dependencies
    Table Configuration Reference
  • Schema Configuration Reference

  • Apache Pinot releasesarrow-up-right
    FlinkQuickStart.javaarrow-up-right
    Pinot upsert comparison column docs
    UploadedRealtimeSegmentNamearrow-up-right
    Design Proposalarrow-up-right
    PR #13107arrow-up-right
    PR #13837arrow-up-right
    job-spec.yml
    executionFrameworkSpec:
      name: 'standalone'
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    
    jobType: SegmentCreationAndTarPush
    inputDirURI: 'examples/batch/baseballStats/rawdata'
    includeFileNamePattern: 'glob:**/*.csv'
    excludeFileNamePattern: 'glob:**/*.tmp'
    outputDirURI: 'examples/batch/baseballStats/segments'
    overwriteOutput: true
    
    pinotFSSpecs:
      - scheme: file
        className: org.apache.pinot.spi.filesystem.LocalPinotFS
    
    recordReaderSpec:
      dataFormat: 'csv'
      className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
      configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
      configs:
    
    tableSpec:
      tableName: 'baseballStats'
      schemaURI: 'http://localhost:9000/tables/baseballStats/schema'
      tableConfigURI: 'http://localhost:9000/tables/baseballStats'
      
    segmentNameGeneratorSpec:
    
    pinotClusterSpecs:
      - controllerURI: 'http://localhost:9000'
    
    pushJobSpec:
      pushParallelism: 2
      pushAttempts: 2
      pushRetryIntervalMillis: 1000
    docker run \
        --network=pinot-demo \
        --name pinot-data-ingestion-job \
        ${PINOT_IMAGE} LaunchDataIngestionJob \
        -jobSpecFile examples/docker/ingestion-job-specs/airlineStats.yaml
    SegmentGenerationJobSpec:
    !!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
    excludeFileNamePattern: null
    executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
      segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
      segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
    includeFileNamePattern: glob:**/*.avro
    inputDirURI: examples/batch/airlineStats/rawdata
    jobType: SegmentCreationAndTarPush
    outputDirURI: examples/batch/airlineStats/segments
    overwriteOutput: true
    pinotClusterSpecs:
    - {controllerURI: 'http://pinot-controller:9000'}
    pinotFSSpecs:
    - {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
    pushJobSpec: {pushAttempts: 2, pushParallelism: 1, pushRetryIntervalMillis: 1000,
      segmentUriPrefix: null, segmentUriSuffix: null}
    recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.avro.AvroRecordReader,
      configClassName: null, configs: null, dataFormat: avro}
    segmentNameGeneratorSpec: null
    tableSpec: {schemaURI: 'http://pinot-controller:9000/tables/airlineStats/schema',
      tableConfigURI: 'http://pinot-controller:9000/tables/airlineStats', tableName: airlineStats}
    
    Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
    Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
    Finished building StatsCollector!
    Collected stats for 403 documents
    Created dictionary for INT column: FlightNum with cardinality: 386, range: 14 to 7389
    Using fixed bytes value dictionary for column: Origin, size: 294
    Created dictionary for STRING column: Origin with cardinality: 98, max length in bytes: 3, range: ABQ to VPS
    Created dictionary for INT column: Quarter with cardinality: 1, range: 1 to 1
    Created dictionary for INT column: LateAircraftDelay with cardinality: 50, range: -2147483648 to 303
    ......
    ......
    Pushing segment: airlineStats_OFFLINE_16085_16085_29 to location: http://pinot-controller:9000 for table airlineStats
    Sending request: http://pinot-controller:9000/v2/segments?tableName=airlineStats to controller: a413b0013806, version: Unknown
    Response for pushing table airlineStats segment airlineStats_OFFLINE_16085_16085_29 to location http://pinot-controller:9000 - 200: {"status":"Successfully uploaded segment: airlineStats_OFFLINE_16085_16085_29 of table: airlineStats"}
    Pushing segment: airlineStats_OFFLINE_16084_16084_30 to location: http://pinot-controller:9000 for table airlineStats
    Sending request: http://pinot-controller:9000/v2/segments?tableName=airlineStats to controller: a413b0013806, version: Unknown
    Response for pushing table airlineStats segment airlineStats_OFFLINE_16084_16084_30 to location http://pinot-controller:9000 - 200: {"status":"Successfully uploaded segment: airlineStats_OFFLINE_16084_16084_30 of table: airlineStats"}
    bin/pinot-admin.sh LaunchDataIngestionJob \
        -jobSpecFile examples/batch/airlineStats/ingestionJobSpec.yaml
    inputDirURI: 'examples/batch/airlineStats/rawdata/${year}/${month}/${day}'
    outputDirURI: 'examples/batch/airlineStats/segments/${year}/${month}/${day}'
    docker run \
        --network=pinot-demo \
        --name pinot-data-ingestion-job \
        ${PINOT_IMAGE} LaunchDataIngestionJob \
        -jobSpecFile examples/docker/ingestion-job-specs/airlineStats.yaml
        -values year=2014 month=01 day=03
    docker run \
      --network pinot-demo \
      --name=loading-airlineStats-data-to-kafka \
      ${PINOT_IMAGE} StreamAvroIntoKafka \
      -avroFile examples/stream/airlineStats/sample_data/airlineStats_data.avro \
      -kafkaTopic flights-realtime -kafkaBrokerList kafka:9092 -zkAddress pinot-zookeeper:2181/kafka
    bin/pinot-admin.sh StreamAvroIntoKafka \
      -avroFile examples/stream/airlineStats/sample_data/airlineStats_data.avro \
      -kafkaTopic flights-realtime -kafkaBrokerList localhost:19092 -zkAddress localhost:2181/kafka
    <dependency>
      <groupId>org.apache.pinot</groupId>
      <artifactId>pinot-flink-connector</artifactId>
      <version>1.5.0-SNAPSHOT</version>
    </dependency>
    // Set up Flink environment and data source
    StreamExecutionEnvironment execEnv = StreamExecutionEnvironment.getExecutionEnvironment();
    execEnv.setParallelism(2);
    
    // Configure row type
    RowTypeInfo typeInfo = new RowTypeInfo(
        new TypeInformation[]{Types.FLOAT, Types.FLOAT, Types.STRING, Types.STRING},
        new String[]{"lon", "lat", "address", "name"});
    
    DataStream<Row> srcRows = execEnv.addSource(new FlinkKafkaConsumer<Row>(...));
    
    // Create a ControllerRequestClient to fetch Pinot schema and table config
    HttpClient httpClient = HttpClient.getInstance();
    ControllerRequestClient client = new ControllerRequestClient(
        ControllerRequestURLBuilder.baseUrl(DEFAULT_CONTROLLER_URL), httpClient);
    
    // Fetch Pinot schema
    Schema schema = PinotConnectionUtils.getSchema(client, "starbucksStores");
    // Fetch Pinot table config
    TableConfig tableConfig = PinotConnectionUtils.getTableConfig(client, "starbucksStores", "OFFLINE");
    
    // Create Flink Pinot Sink
    srcRows.addSink(new PinotSinkFunction<>(
        new FlinkRowGenericRowConverter(typeInfo),
        tableConfig,
        schema));
    execEnv.execute();
    {
      "tableName": "starbucksStores_OFFLINE",
      "tableType": "OFFLINE",
      "segmentsConfig": {
        // ...
      },
      "tenants": {
        // ...
      },
      "tableIndexConfig": {
        // ...
      },
      "ingestionConfig": {
        "batchIngestionConfig": {
          "segmentIngestionType": "APPEND",
          "segmentIngestionFrequency": "HOURLY",
          "batchConfigMaps": [
            {
              "outputDirURI": "file:///tmp/pinotoutput",
              "overwriteOutput": "false",
              "push.controllerUri": "http://localhost:9000"
            }
          ]
        }
      }
    }
    // Same setup as offline table example above...
    
    // Fetch table config for realtime table
    TableConfig tableConfig = PinotConnectionUtils.getTableConfig(client, "myTable", "REALTIME");
    
    // Same sink configuration
    srcRows.addSink(new PinotSinkFunction<>(
        new FlinkRowGenericRowConverter(typeInfo),
        tableConfig,
        schema));
    execEnv.execute();
    // Set up Flink environment
    StreamExecutionEnvironment execEnv = StreamExecutionEnvironment.getExecutionEnvironment();
    execEnv.setParallelism(2); // MUST match number of partitions in stream/table
    
    // Configure row type matching your upsert table schema
    RowTypeInfo typeInfo = new RowTypeInfo(
        new TypeInformation[]{Types.INT, Types.STRING, Types.STRING, Types.FLOAT, Types.LONG, Types.BOOLEAN},
        new String[]{"playerId", "name", "game", "score", "timestampInEpoch", "deleted"});
    
    DataStream<Row> srcRows = execEnv.addSource(new FlinkKafkaConsumer<Row>(...));
    
    // Fetch schema and table config (same as offline table example)
    // HttpClient httpClient = HttpClient.getInstance();
    // ControllerRequestClient client = ...
    Schema schema = PinotConnectionUtils.getSchema(client, "myUpsertTable");
    TableConfig tableConfig = PinotConnectionUtils.getTableConfig(client, "myUpsertTable", "REALTIME");
    
    // IMPORTANT: Partition data by primary key using the SAME logic as the stream
    srcRows.partitionCustom(
        (Partitioner<Integer>) (key, partitions) -> key % partitions,
        r -> (Integer) r.getField("playerId"))  // Primary key field
      .addSink(new PinotSinkFunction<>(
          new FlinkRowGenericRowConverter(typeInfo),
          tableConfig,
          schema));
    execEnv.execute();
    {prefix}__{tableName}__{partitionId}__{uploadTimeMs}__{sequenceId}
    new PinotSinkFunction<>(
        recordConverter,
        tableConfig,
        schema,
        segmentFlushMaxNumRecords,  // Default: 500,000, number of rows per segment
        executorPoolSize,            // Default: 5, number of threads to use to upload segment
        segmentNamePrefix,           // Default: "flink"
        segmentUploadTimeMs          // Default: current time, upload time value to encode in segment name
    )
    // Same setup as previous examples...
    
    long segmentFlushMaxNumRecords = 1000000; // Flush after 1M records
    int executorPoolSize = 10; // Thread pool size for async uploads
    
    srcRows.addSink(new PinotSinkFunction<>(
        new FlinkRowGenericRowConverter(typeInfo),
        tableConfig,
        schema,
        segmentFlushMaxNumRecords,
        executorPoolSize
    ));
    // Same setup as previous examples...
    
    String segmentNamePrefix = "flink_job_daily";
    Long segmentUploadTimeMs = 1724045185000L; // Group segments by upload run time
    
    srcRows.addSink(new PinotSinkFunction<>(
        new FlinkRowGenericRowConverter(typeInfo),
        tableConfig,
        schema,
        DEFAULT_SEGMENT_FLUSH_MAX_NUM_RECORDS,
        DEFAULT_EXECUTOR_POOL_SIZE,
        segmentNamePrefix,
        segmentUploadTimeMs
    ));
    The nested array group_topics under group is unnested into the top-level, and converts the output to a collection of two rows. Note the handling of the nested field within group_topics, and the eventual top-level field of group.group_topics.urlkey. All the collections to unnest shall be included in the configuration fieldsToUnnest.
  • Collections not specified in fieldsToUnnestwill be serialized into JSON string, except for the array of primitive values, which will be ingested as a multi-value column by default. The behavior is defined by the collectionNotUnnestedToJson config, which takes the following values:

    • NON_PRIMITIVE- Converts the array to a multi-value column. (default)

    • ALL- Converts the array of primitive values to JSON string.

    • NONE- Does not do any conversion.

  • Meetup events Quickstart examplearrow-up-right
    JSON index
    meetupRsvpJson_realtime_table_config.jsonarrow-up-right
    json_meetupRsvp_schema.jsonarrow-up-right
    guide
    multi-value column functions
    herearrow-up-right
    herearrow-up-right
    PRarrow-up-right
    Example JSON data
    Flattened/unnested data

    Looking up a human-readable team name from a team ID.

  • Enriching clickstream events with product catalog attributes.

  • Decorating transaction records with customer or store metadata.

  • If any of the following apply, a regular offline or real-time table is a better fit:

    • The reference data is large (hundreds of millions of rows or multiple gigabytes).

    • The data changes frequently and requires real-time ingestion.

    • You need time-based partitioning, retention policies, or a hybrid table setup.

    • You need to query the reference data with complex aggregations independently.

    hashtag
    How dimension tables work

    When a table is marked as a dimension table, Pinot replicates all of its segments to every server in the tenant. On each server the data is loaded into an in-memory hash map keyed by the table's primary key, which enables constant-time lookups during query execution.

    Because the data is fully replicated and held in memory, dimension tables must be small enough to fit comfortably in each server's heap. They are not intended for large datasets.

    hashtag
    Memory loading modes

    Pinot supports two loading modes controlled by the disablePreload setting in dimensionTableConfig:

    Mode

    disablePreload

    Memory usage

    Lookup speed

    Description

    Fast lookup (default)

    false

    Higher

    Faster

    All rows are fully materialized into an in-memory hash map (Object[] -> Object[]). Every column value is stored in the map for constant-time retrieval.

    Choose the memory-optimized mode when the dimension table is relatively large and you want to reduce heap pressure, at the cost of slightly slower lookups.

    hashtag
    Size limits and memory considerations

    • Cluster-level maximum size: The controller configuration property controller.dimTable.maxSize sets the maximum storage quota allowed for any single dimension table. The default is 200 MB. Table creation fails if the requested quota.storage exceeds this limit.

    • Heap impact: In fast-lookup mode, the entire table is materialized in Java heap on every server. A table that is 100 MB on disk may consume significantly more memory after deserialization. Monitor server heap usage when adding or growing dimension tables.

    • Replication overhead: Because every server in the tenant holds a full copy, adding a dimension table multiplies its memory footprint by the number of servers.

    circle-exclamation

    As a guideline, keep dimension tables under a few hundred thousand rows and well under the controller.dimTable.maxSize limit. Tables that approach or exceed available heap will cause out-of-memory errors on servers.

    hashtag
    Configuration

    hashtag
    Table configuration

    Mark a table as a dimension table by setting the following properties in the table config:

    Property
    Required
    Description

    isDimTable

    Yes

    Set to true to designate the table as a dimension table.

    ingestionConfig.batchIngestionConfig.segmentIngestionType

    Yes

    Must be set to REFRESH. Dimension tables use segment replacement rather than append semantics so that the in-memory hash map is rebuilt with the latest data.

    hashtag
    Schema configuration

    Dimension table schemas use dimensionFieldSpecs instead of metricFieldSpecs. A primaryKeyColumns array is required -- it defines the key used for lookups.

    hashtag
    Example table configuration

    hashtag
    Example schema configuration

    hashtag
    Querying with the LOOKUP function

    The primary way to use a dimension table is through the LOOKUP UDF in the single-stage query engine. This function performs a primary-key lookup against the dimension table and returns a column value.

    hashtag
    Syntax

    • dimTable -- name of the dimension table (string literal).

    • dimColToLookUp -- column to retrieve from the dimension table (string literal).

    • dimJoinKey / factJoinKey -- pairs of join keys: the dimension table column name (string literal) and the corresponding fact table column expression.

    hashtag
    Single-key lookup

    hashtag
    Composite-key lookup

    When the dimension table has a composite primary key, provide multiple key pairs in the same order as primaryKeyColumns in the schema:

    hashtag
    Multi-stage engine

    In the multi-stage engine (MSE), use a standard JOIN with the lookup join strategy hint instead of the LOOKUP UDF:

    For details, see lookup join strategy.

    hashtag
    Refresh and update strategies

    Because dimension tables use segmentIngestionType: REFRESH, uploading a new segment replaces the existing segment and triggers a full reload of the in-memory hash map on every server. There is no incremental update mechanism.

    Typical refresh patterns:

    • Scheduled batch job: Run a periodic ingestion job (e.g., daily or hourly) that rebuilds the segment from the source of truth and uploads it to Pinot.

    • On-demand refresh: Trigger a segment upload through the Pinot REST API whenever the reference data changes.

    circle-info

    During a refresh, the old hash map remains active for lookups until the new one is fully loaded. There is no query downtime during a refresh, but there is a brief period where the old data is served.

    hashtag
    Handling duplicate primary keys

    When multiple segments contain the same primary key, the default behavior is last-loaded-segment-wins (segments are ordered by creation time). Set errorOnDuplicatePrimaryKey: true in dimensionTableConfig to fail fast if duplicates are detected. With REFRESH ingestion, there is typically only one segment, so duplicates across segments are uncommon.

    hashtag
    Performance best practices

    • Keep tables small. Dimension tables are loaded entirely into memory on every server. Target thousands to low hundreds of thousands of rows.

    • Use narrow schemas. Include only the columns needed for lookups to reduce memory consumption.

    • Choose the right loading mode. Use fast lookup (default) for the best query performance. Switch to memory-optimized mode (disablePreload: true) only if heap usage is a concern.

    • Set a storage quota. Always configure quota.storage to prevent accidentally uploading oversized data.

    • Minimize refresh frequency. Each refresh triggers a full reload of the hash map. Avoid refreshing more often than necessary.

    • Monitor server heap. After adding a dimension table, check server JVM heap metrics to confirm adequate headroom.

    hashtag
    Limitations

    • Offline only. Dimension tables must be offline tables. They cannot be real-time or hybrid tables.

    • Full replication. All segments are replicated to every server in the tenant, so memory usage scales with the number of servers.

    • No incremental updates. The entire segment must be replaced on each refresh; row-level updates are not supported.

    • Primary key required. The schema must define primaryKeyColumns. Lookups without a primary key are not supported.

    • Single-stage LOOKUP UDF limitations. Dimension table column references in the LOOKUP function must be string literals, not column identifiers, because they reference a table that is not part of the query's FROM clause.

    • No time-based partitioning or retention. Dimension tables do not support segment retention policies or time-based partitioning.

    lookup UDF
    lookup join strategy
    plugin is included in the official binary distribution since Pinot 0.11.0. If you are running an older version, you can download the plugin from
    and add it to the plugins directory.

    hashtag
    Set up Pulsar table

    Here is a sample Pulsar stream config. You can use the streamConfigs section from this sample and make changes for your corresponding table.

    hashtag
    Pulsar configuration options

    You can change the following Pulsar specifc configurations for your tables

    Property
    Description

    streamType

    This should be set to "pulsar"

    stream.pulsar.topic.name

    Your pulsar topic name

    stream.pulsar.bootstrap.servers

    Comma-separated broker list for Apache Pulsar

    stream.pulsar.metadata.populate

    hashtag
    Authentication

    The Pinot-Pulsar connector supports authentication using security tokens. To generate a token, follow the instructions in Pulsar documentationarrow-up-right. Once generated, add the following property to streamConfigs to add an authentication token for each request:

    hashtag
    OAuth2 Authentication

    The Pinot-Pulsar connector supports authentication using OAuth2, for example, if connecting to a StreamNative Pulsar cluster. For more information, see how to Configure OAuth2 authentication in Pulsar clientsarrow-up-right. Once configured, you can add the following properties to streamConfigs:

    hashtag
    TLS support

    The Pinot-pulsar connector also supports TLS for encrypted connections. You can follow the official pulsar documentationarrow-up-right to enable TLS on your pulsar cluster. Once done, you can enable TLS in pulsar connector by providing the trust certificate file location generated in the previous step.

    Also, make sure to change the brokers url from pulsar://localhost:6650 to pulsar+ssl://localhost:6650 so that secure connections are used.

    For other table and stream configurations, you can headover to Table configuration Reference

    hashtag
    Supported Pulsar versions

    Pinot currently relies on Pulsar client version 4.0.x. Make sure the Pulsar broker is compatible with this client version.

    hashtag
    Extract record headers as Pinot table columns

    Pinot's Pulsar connector supports automatically extracting record headers and metadata into the Pinot table columns. Pulsar supports a large amount of per-record metadata. Reference the official Pulsar documentationarrow-up-right for the meaning of the metadata fields.

    The following table shows the mapping for record header/metadata to Pinot table column names:

    Pulsar Message
    Pinot table Column
    Comments
    Available By Default

    key : String

    __key : String

    Yes

    properties : Map<String, String>

    Each header key is listed as a separate column: __header$HeaderKeyName : String

    In order to enable the metadata extraction in a Pulsar table, set the stream config metadata.populate to true. The fields eventTime, publishTime, brokerPublishTime, and key are populated by default. If you would like to extract additional fields from the Pulsar Message, populate the metadataFields config with a comma separated list of fields to populate. The fields are referenced by the field name in the Pulsar Message. For example, setting:

    Will make the __metadata$messageId, __metadata$messageBytes, __metadata$eventTime, and __metadata$topicName, fields available for mapping to columns in the Pinot schema.

    In addition to this, if you want to use any of these columns in your table, you have to list them explicitly in your table's schema.

    For example, if you want to add only the offset and key as dimension columns in your Pinot table, it can listed in the schema as follows:

    Once the schema is updated, these columns are similar to any other pinot column. You can apply ingestion transforms and / or define indexes on them.

    circle-info

    Remember to follow the schema evolution guidelines when updating schema of an existing table!

    Apache Pulsararrow-up-right
    the Apache Pinot external repositoryarrow-up-right
    The SchemaConformingTransformerarrow-up-right is a RecordTransformerarrow-up-right that can transform records with dynamic schemas such that they can be ingested in a table with a static schema. The transformer takes record fields that don't exist in the schema and stores them in a type of catchall field. Moreover, it builds a __mergedTextIndex field and takes advantage of Lucene to fulfill text search.

    For example, consider this record:

    Let's say the table's schema contains the following fields:

    • arrayField

    • mapField

    • nestedFields

    • nestedFields.stringField

    • json_data

    • json_data_no_idx

    • __mergedTextIndex

    Without this transformer, stringField field and fields ends with _noIdx would be dropped. mapField and nestedFields fields' storage needs to rely on the global setup in complexTransformers without granular customizations. However, with this transformer, the record would be transformed into the following:

    Notice that there are 3 reserved (and configurable) fields json_data, json_data_no_idx and __mergedTextIndex. And the transformer does the following:

    • Flattens nested fields all the way to the leaf node and:

      • Conducts special treatments if necessary according to the config

      • If the key path matches the schema, put the data into the dedicated field

      • Otherwise, put them into json_data or json_data_no_idx depending on its key suffix

    • For keys in dedicated columns or json_data, puts them into __mergedTextIndex in the form of "Begin Anchor + value + Separator + key + End Anchor" to power the text matches.

    • Additional functionalities by configurations

      • Drop fields fieldPathsToDrop

      • Preserve the subtree without flattening fieldPathsToPreserveInput and fieldPathsToPreserveInputWithIndex

    hashtag
    Table Configurations

    hashtag
    SchemaConformingTransformer Configuration

    To use the transformer, add the schemaConformingTransformerConfig option in the ingestionConfig section of your table configuration, as shown in the following example.

    For example:

    Available configuration options are listed in SchemaConformingTransformerConfigarrow-up-right.

    hashtag
    Configuration of reserved fields

    Other index config of 3 reserved columns could be set like:

    Specifically, customizable json index could be set according to json index indexPaths.

    hashtag
    Power the text search

    hashtag
    Schema Design

    With the help of SchemaConformingTransformer, all data could be kept even without specifying special dedicated columns in table schema. However, to optimize the storage and various query patterns, dedicated columns should be created based on the usage:

    • Fields with frequent exact match query, e.g. region, log_level, runtime_env

    • Fields with range query, e.g. timestamp

    • High frequency fields from messages

      • Reduce json index size

      • Optimize group by queries

    hashtag
    Text Search

    After putting each key/value pairs into the __mergedTextIndex field, there will neeed to be luceneAnalyzerClass to tokenize the document and luceneQueryParserClass to query by tokens. Some example common searching patterns and their queries are:

    • Exact key/value match TEXT_MATCH(__mergedTextIndex, '"valuer:key"')

    • Wildcard value search in a key TEXT_MATCH(__mergedTextIndex, '/.* value .*:key/')

    • Key exists check TEXT_MATCH(__mergedTextIndex, '/.*:key/')

    • Global value exact match TEXT_MATCH(__mergedTextIndex, '/"value"/')

    • Global value wildcard match TEXT_MATCH(__mergedTextIndex, '/.* value .*/')

    The luceneAnalyzerClass and luceneQueryParserClass usually need to have similar delimiter set. It also needs to consider the values below.

    With given example, each key/value pair would be stored as "\u0002value\u001ekey\u0003". The prefix and suffix match on key or value need to be adjusted accordingly in the luceneQueryParserClass.

    Ingestion Aggregations

    Many data analytics use-cases only need aggregated data. For example, data used in charts can be aggregated down to one row per time bucket per dimension combination.

    Doing this results in much less storage and better query performance. Configuring this for a table is done via the Aggregation Config in the table config.

    circle-exclamation

    Note that Ingestion aggregation only works with realtime Pinot tables. Furthermore, this is done at a segment level. Cross-segment aggregation still requires query-time processing

    hashtag
    Aggregation Config

    The aggregation config controls the aggregations that happen during real-time data ingestion. Offline aggregations must be handled separately.

    Below is a description of the config, which is defined in the ingestion config of the table config.

    hashtag
    Requirements

    The following are required for ingestion aggregation to work:

    • Ingestion aggregation config is effective only for real-time tables. (There is no ingestion time aggregation support for offline tables. We need use or pre-process aggregations in the offline data flow using batch processing engines like Spark/MapReduce).

    • type must be lowLevel.

    • All metrics must have aggregation configs.

    hashtag
    Example Scenario

    Here is an example of sales data, where only the daily sales aggregates per product are needed.

    You can also find it when running RealtimeQuickStart, there is a table called dailySales

    **

    hashtag
    Example Input Data

    hashtag
    Schema

    Note that the schema only reflects the final table structure.

    hashtag
    Table Config

    From the below aggregation config example, note that price exists in the input data while total_sales exists in the Pinot Schema.

    hashtag
    Example Final Table

    **

    product_name
    sales_count
    total_sales
    daysSinceEpoch

    hashtag
    Allowed Aggregation Functions

    function name
    notes

    hashtag
    Frequently Asked Questions

    hashtag
    Why not use a Startree?

    Startrees can only be added to real-time segments after the segments has sealed, and creating startrees is CPU-intensive. Ingestion Aggregation works for consuming segments and uses no additional CPU.

    Startrees take additional memory to store, while ingestion aggregation stores less data than the original dataset.

    hashtag
    When to not use ingestion aggregation?

    If the original rows in non-aggregated form are needed, then ingestion-aggregation cannot be used.

    hashtag
    I already use the aggregateMetrics setting?

    The aggregateMetrics works the same as Ingestion Aggregation, but only allows for the SUM function.

    The current changes are backward compatible, so no need to change your table config unless you need a different aggregation function.

    hashtag
    Does this config work for offline data?

    Ingestion Aggregation only works for real-time ingestion. For offline data, the offline process needs to generate the aggregates separately.

    hashtag
    Why do all metrics need to be aggregated?

    If a metric isn't aggregated then it will result in more than one row per unique set of dimensions.

    hashtag
    Why no data show up when I enabled AggregationConfigs?

    1. Check if ingestion is normal without AggregationConfigs, this is to isolate the problem

    2. Check Pinot Server log for any warning or error log, especially related to class MutableSegmentImpland method aggregateMetrics.

    3. For JSON data, please ensure you don't double quote numbers, as they are parsed as string internally and won't be able to do the value based aggregation, e.g. sum. Using the above example, data ingestion not working with row:

    Grouping Algorithm

    In this guide we will learn about the heuristics used for trimming results in Pinot's grouping algorithm (used when processing GROUP BY queries) to make sure that the server doesn't run out of memory.

    hashtag
    SSE (Single-Stage Engine)

    ![](../../.gitbook/assets/Screenshot 2025-07-22 at 17.39.21.png)

    Group by results approximation at various stages of SSE query execution

    hashtag
    Within segment

    When grouping rows within a segment, Pinot keeps a maximum of numGroupsLimit groups per segment. This value is set to 100,000 by default and can be configured by the pinot.server.query.executor.num.groups.limit property.

    If the number of groups of a segment reaches this value, the extra groups will be ignored and the results returned may not be completely accurate. The numGroupsLimitReached property will be set to true in the query response if the value is reached.

    hashtag
    Trimming tail groups

    After the inner segment groups have been computed, the Pinot query engine optionally trims tail groups. Tail groups are ones that have a lower rank based on the ORDER BY clause used in the query.

    When segment group trim is enabled, the query engine will trim the tail groups and keep only max(minSegmentGroupTrimSize, 5 * LIMIT) , where LIMIT is the maximum number of records returned by query - usually set via LIMIT clause). Pinot keeps at least 5 * LIMIT groups when trimming tail groups to ensure the accuracy of results. Trimming is performed only when ordering and limit is specified.

    This value can be overridden on a query by query basis by passing the following option:

    hashtag
    Cross segments

    Once grouping has been done within a segment, Pinot will merge segment results and trim tail groups and keep max(minServerGroupTrimSize, 5 * LIMIT) groups if it gets more groups.

    minServerGroupTrimSize is set to 5,000 by default and can be adjusted by configuring the pinot.server.query.executor.min.server.group.trim.size property. Cross segments trim can be disabled by setting the property to -1.

    When cross segments trim is enabled, the server will trim the tail groups before sending the results back to the broker. To reduce memory usage while merging per-segment results, It will also trim the tail groups when the number of groups reaches the trimThreshold.

    trimThreshold is the upper bound of groups allowed in a server for each query to protect servers from running out of memory. To avoid too frequent trimming, the actual trim size is bounded to trimThreshold / 2. Combining this with the above equation, the actual trim size for a query is calculated as min(max(minServerGroupTrimSize, 5 * LIMIT), trimThreshold / 2).

    This configuration is set to 1,000,000 by default and can be adjusted by configuring the pinot.server.query.executor.groupby.trim.threshold property.

    A higher threshold reduces the amount of trimming done, but consumes more heap memory. If the threshold is set to more than 1,000,000,000, the server will only trim the groups once before returning the results to the broker.

    This value can be overridden on a query by query basis by passing the following option:

    hashtag
    At Broker

    When broker performs the final merge of the groups returned by various servers, there is another level of trimming that takes place. The tail groups are trimmed and max(minBrokerGroupTrimSize, 5 * LIMIT) groups are retained.

    Default value of minBrokerGroupTrimSize is set to 5000. This can be adjusted by configuring pinot.broker.min.group.trim.size property.

    hashtag
    GROUP BY behavior

    Pinot sets a default LIMIT of 10 if one isn't defined and this applies to GROUP BY queries as well. Therefore, if no limit is specified, Pinot will return 10 groups.

    Pinot will trim tail groups based on the ORDER BY clause to reduce the memory footprint and improve the query performance. It keeps at least 5 * LIMIT groups so that the results give good enough approximation in most cases. The configurable min trim size can be used to increase the groups kept to improve the accuracy but has a larger extra memory footprint.

    hashtag
    HAVING behavior

    If the query has a HAVING clause, it is applied on the merged GROUP BY results that already have the tail groups trimmed. If the HAVING clause is the opposite of the ORDER BY order, groups matching the condition might already be trimmed and not returned. e.g.

    Increase min trim size to keep more groups in these cases.

    hashtag
    Examples

    For a simple keyed aggregation query such as:

    a simplified execution plan, showing where trimming happens, looks like:

    For sake of brevity, plan above doesn't mention that actual number of groups left is min( trim_value, 5*limit ) .

    hashtag
    MSE (Multi-Stage Engine)

    Compared to the SSE, the MSE uses a similar algorithm, but there are notable differences:

    • MSE doesn't implicitly limit number of query results (to 10)

    • MSE doesn't limit number of groups when aggregating cross-segment data

    • MSE doesn't trim results by default in any stage

    The default MSE algorithm is shown on the following diagram:

    ![](../../.gitbook/assets/Screenshot 2025-07-22 at 17.43.44.png)

    Default MSE group by results approximation

    Apart from limiting number of groups on segment level, similar limit is applied at intermediate stage. Since the multi-stage engine (MSE) allows for subqueries, in an execution plan, there could be arbitrary number of stages doing intermediate aggregation between leaf (bottom-most) and top-most stages, and each stage can be implemented with many instances of AggregateOperator (shown as PinotLogicalAggregate in output). The operator limits number of distinct groups to 100,000 by default, which can be overridden with numGroupsLimit option or num_groups_limit aggregate hint. The limit applies to a single operator instance, meaning that next stage could receive a total of num_instances * num_groups_limit.

    It is possible to enable group limiting and trimming at other stages with:

    • is_enable_group_trim hint - it enables trimming at all SSE/MSE levels and group limiting at cross-segment level. minSegmentGroupTrimSize value needs to be set separately. Default value: false

    • mse_min_group_trim_size hint - triggers sorting and trimming of group by results at intermediate stage. Requires is_enable_group_trim hint. Default value: 5000

    When the above hints are used, query processing looks as follows:

    ![](../../.gitbook/assets/Screenshot 2025-07-22 at 17.39.42.png)

    Group by results trimming at various stages of MSE query execution utilizing SSE in leaf stage

    The actual processing depends on the query, which may not contain an SSE leaf stage aggregate component, and rely on AggregateOperator on all levels. Moreover, since trimming relies on order and limit propagation, it may not happen in a subquery if order by column(s) are not available.

    hashtag
    Examples

    • If hints are applied to query mentioned in SSE examples above, that is :\

      then execution plan should be as follows:\

      In the plan above trimming happens in three operators: GroupBy, CombineGroupBy and AggregateOperator (which is the physical implementation of PinotLogicalAggregate). \

    hashtag
    Configuration Parameters

    Parameter
    Default
    Query Override
    Description

    (*) SSQ - Single-Stage Query

    (**) MSQ - Multi-Stage Query

    Upload Pinot Segment Using CLI

    Upload existing Pinot segments to a controller.

    This guide explains how to upload already-built Pinot segments to a Pinot controller, which REST endpoint to call, and when to use tar push, URI push, or metadata push.

    Use this flow when your segment .tar.gz files already exist outside Pinot, for example when migrating from an old cluster, backfilling from another system, or re-registering segments that already live in deep storage.

    Before you upload, do the following:

    1. or confirm one exists that matches the segment you want to upload.

    Lookup UDF Join

    For more information about using JOINs with the multi-stage query engine, see JOINs.

    circle-info

    Lookup UDF Join is only supported with the single-stage query engine (v1). Lookup joins can be executed using in the multi-stage query engine. For more information about using JOINs with the multi-stage query engine, see .

    Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with in Pinot.

    json_meetupRsvp_realtime_table_config.json
    {
        "ingestionConfig":{
          "transformConfigs": [
            {
              "columnName": "group_json",
              "transformFunction": "jsonFormat(\"group\")"
            }
          ],
        },
        ...
        "tableIndexConfig": {
        "loadMode": "MMAP",
        "noDictionaryColumns": [
          "group_json"
        ],
        "jsonIndexColumns": [
          "group_json"
        ]
      },
    
    }
    json_meetupRsvp_realtime_table_schema.json
    {
      {
          "name": "group_json",
          "dataType": "JSON",
        }
        ...
    }
    complexTypeHandling_meetupRsvp_realtime_table_config.json
    {
      "ingestionConfig": {    
        "complexTypeConfig": {
          "delimiter": ".",
          "fieldsToUnnest": ["group.group_topics"],
          "collectionNotUnnestedToJson": "NON_PRIMITIVE"
        }
      }
    }
    SELECT "group.group_topics.urlkey", 
           "group.group_topics.topic_name", 
           "group.group_id" 
    FROM meetupRsvp
    LIMIT 10
    bin/pinot-admin.sh AvroSchemaToPinotSchema \
      -timeColumnName fields.hoursSinceEpoch \
      -avroSchemaFile /tmp/test.avsc \
      -pinotSchemaName myTable \
      -outputDir /tmp/test \
      -fieldsToUnnest entries
    bin/pinot-admin.sh JsonToPinotSchema \
      -timeColumnName hoursSinceEpoch \
      -jsonFile /tmp/test.json \
      -pinotSchemaName myTable \
      -outputDir /tmp/test \
      -fieldsToUnnest payload.commits
    {
      "OFFLINE": {
        "tableName": "dimBaseballTeams_OFFLINE",
        "tableType": "OFFLINE",
        "segmentsConfig": {
        },
        "ingestionConfig": {
          "batchIngestionConfig": {
            "segmentIngestionType": "REFRESH"
          }
        },
        "quota": {
          "storage": "200M"
        },
        "isDimTable": true,
        "dimensionTableConfig": {
          "disablePreload": false,
          "errorOnDuplicatePrimaryKey": false
        }
      }
    }
    {
      "schemaName": "dimBaseballTeams",
      "primaryKeyColumns": ["teamID"],
      "dimensionFieldSpecs": [
        {
          "dataType": "STRING",
          "name": "teamID"
        },
        {
          "dataType": "STRING",
          "name": "teamName"
        },
        {
          "dataType": "STRING",
          "name": "teamAddress"
        }
      ]
    }
    LOOKUP('dimTable', 'dimColToLookUp', 'dimJoinKey1', factJoinKey1 [, 'dimJoinKey2', factJoinKey2 ]*)
    SELECT
      playerName,
      teamID,
      LOOKUP('dimBaseballTeams', 'teamName', 'teamID', teamID) AS teamName,
      LOOKUP('dimBaseballTeams', 'teamAddress', 'teamID', teamID) AS teamAddress
    FROM baseballStats
    LIMIT 10
    SELECT
      customerId,
      LOOKUP('billing', 'city', 'customerId', customerId, 'creditHistory', creditHistory) AS city
    FROM transactions
    LIMIT 10
    SELECT /*+ lookupJoinStrategy(dim_billing) */
      t.customerId,
      b.city
    FROM transactions t
    JOIN billing b
      ON t.customerId = b.customerId
    LIMIT 10
    {
      "tableName": "pulsarTable",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "timestamp",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP",
        "streamConfigs": {
          "streamType": "pulsar",
          "stream.pulsar.topic.name": "<your pulsar topic name>",
          "stream.pulsar.bootstrap.servers": "pulsar://localhost:6650,pulsar://localhost:6651",
          "stream.pulsar.consumer.prop.auto.offset.reset" : "smallest",
          "stream.pulsar.fetch.timeout.millis": "30000",
          "stream.pulsar.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
          "stream.pulsar.consumer.factory.class.name": "org.apache.pinot.plugin.stream.pulsar.PulsarConsumerFactory",
          "realtime.segment.flush.threshold.rows": "1000000",
          "realtime.segment.flush.threshold.time": "6h"
        }
      },
      "metadata": {
        "customConfigs": {}
      }
    }
    "stream.pulsar.authenticationToken":"your-auth-token"
    "stream.pulsar.issuerUrl": "https://auth.streamnative.cloud"
    "stream.pulsar.credsFilePath": "file:///path/to/private_creds_file
    "stream.pulsar.audience": "urn:sn:pulsar:test:test-cluster"
    "stream.pulsar.tlsTrustCertsFilePath": "/path/to/ca.cert.pem"
    
    "streamConfigs": {
      ...
            "stream.pulsar.metadata.populate": "true",
            "stream.pulsar.metadata.fields": "messageId,messageIdBytes,eventTime,topicName",
      ...
    }
      "dimensionFieldSpecs": [
        {
          "name": "__key",
          "dataType": "STRING"
        },
        {
          "name": "__metadata$messageId",
          "dataType": "STRING"
        },
        ...
      ],
    {
      "arrayField":[0, 1, 2, 3],
      "stringField":"a",
      "intField_noIndex":9,
      "string_noIndex":"z",
      "message": "a",
      "mapField":{
        "arrayField":[0, 1, 2, 3],
        "stringField":"a",
        "intField_noIndex":9,
        "string_noIndex":"z"
      },
      "mapField_noIndex":{
        "arrayField":[0, 1, 2, 3],
        "stringField":"a",
      },
      "nestedFields":{
        "arrayField":[0, 1, 2, 3],
        "stringField":"a",
        "intField_noIndex":9,
        "string_noIndex":"z",
        "mapField":{
          "arrayField":[0, 1, 2, 3],
          "stringField":"a",
          "intField_noIndex":9,
          "string_noIndex":"z"
        }
      }
    }
    {
      "arrayField":[0, 1, 2, 3],
      "nestedFields.stringField":"a",
      "json_data":{
        "stringField":"a",
        "mapField":{
          "arrayField":[0, 1, 2, 3],
          "stringField":"a",
          "stringField":"aA_123"
        },
        "nestedFields":{
          "arrayField":[0, 1, 2, 3],
          "mapField":{
            "arrayField":[0, 1, 2, 3],
            "stringField":"a"
          }
        }
      },
      "json_data_no_idx":{
        "intField_noIndex":9,
        "string_noIndex":"z",
        "mapField":{
          "intField_noIndex":9,
          "string_noIndex":"z"
        },
        "mapField_noIndex":{
          "arrayField":[0, 1, 2, 3],
          "stringField":"a",
        },
        "nestedFields":{
          "intField_noIndex":9,
          "string_noIndex":"z",
          "mapField":{
            "intField_noIndex":9,
            "string_noIndex":"z"
          }
        }
      },
      "__mergedTextIndex": [
        // To be explained in following sections
      ]
    }
    "schemaConformingTransformerConfig": {
      "enableIndexableExtras": true,
      "indexableExtrasField": "json_data",
      "enableUnindexableExtras": true,
      "unindexableExtrasField": "json_data_no_idx",
      "unindexableFieldSuffix": "_noindex",
      "fieldPathsToDrop": [],
      "fieldPathsToSkipStorage": [
        "message"
      ],
      "columnNameToJsonKeyPathMap": {},
      "mergedTextIndexField": "__mergedTextIndex",
      "useAnonymousDotInFieldNames": true,
      "optimizeCaseInsensitiveSearch": false,
      "reverseTextIndexKeyValueOrder": true,
      "mergedTextIndexDocumentMaxLength": 32766,
      "mergedTextIndexBinaryDocumentDetectionMinLength": 512,
      "mergedTextIndexPathToExclude": [
        "_timestampMillisNegative",
        "__mergedTextIndex",
        "_timestampMillis"
      ],
      "fieldsToDoubleIngest": [],
      "jsonKeyValueSeparator": "\u001e",
      "mergedTextIndexBeginOfDocAnchor": "\u0002",
      "mergedTextIndexEndOfDocAnchor": "\u0003",
      "fieldPathsToPreserveInput": [],
      "fieldPathsToPreserveInputWithIndex": []
    }
    "fieldConfigList": [
      {
        "name": "json_data",
        "encodingType": "RAW",
        "indexTypes": [],
        "compressionCodec": "LZ4",
        "indexes": null,
        "properties": {
          "rawIndexWriterVersion": "4"
        },
        "tierOverwrites": null
      },
      {
        "name": "json_data_no_idx",
        "encodingType": "RAW",
        "indexTypes": [],
        "compressionCodec": "ZSTANDARD",
        "indexes": null,
        "properties": {
          "rawIndexWriterVersion": "4"
        },
        "tierOverwrites": null
      },
      {
        "name": "__mergedTextIndex",
        "encodingType": "RAW",
        "indexType": "TEXT",
        "indexTypes": [
          "TEXT"
        ],
        "compressionCodec": "LZ4",
        "indexes": null,
        "properties": {
          "enableQueryCacheForTextIndex": "false",
          "luceneAnalyzerClass": <analyzerClass>,
          "luceneAnalyzerClassArgTypes": <>,
          "luceneAnalyzerClassArgs": <>,
          "luceneMaxBufferSizeMB": "50",
          "luceneQueryParserClass": <parserClass>,
          "luceneUseCompoundFile": "true",
          "noRawDataForTextIndex": "true",
          "rawIndexWriterVersion": "4"
        },
        "tierOverwrites": null
      }
    ]
    
    "jsonIndexConfigs": {
      "json_data": {
        "disabled": false,
        "maxLevels": 3,
        "excludeArray": true,
        "disableCrossArrayUnnest": true,
        "maxValueLength": 1000,
        "skipInvalidJson": true
      }
    }
    "jsonKeyValueSeparator": "\u001e",
    "mergedTextIndexBeginOfDocAnchor": "\u0002",
    "mergedTextIndexEndOfDocAnchor": "\u0003",
  • Skip storaging the fields but still indexing it (message in the example) fieldPathsToSkipStorage

  • Skip indexing the fields unindexableFieldSuffix

  • Optimize case insensitive search optimizeCaseInsensitiveSearch

  • Map input key path to a schema name with customizations columnNameToJsonKeyPathMap

  • Support anonymous dot, {'a.b': 'c'} vs {'a': {'b': 'c}} useAnonymousDotInFieldNames

  • Truncate value by length mergedTextIndexDocumentMaxLength

  • Double ingestion to support schema evolution fieldsToDoubleIngest

  • Memory-optimized

    true

    Lower

    Slightly slower

    Only the primary key and a segment/docId reference are stored in the hash map. Column values are read from the segment on each lookup. This trades lookup speed for lower heap usage.

    quota.storage

    Recommended

    Storage quota for the table. Must not exceed the cluster-level controller.dimTable.maxSize (default 200 MB).

    dimensionTableConfig.disablePreload

    No

    Set to true to use memory-optimized mode (store only primary key and segment reference instead of full rows). Defaults to false (fast lookup).

    dimensionTableConfig.errorOnDuplicatePrimaryKey

    No

    Set to true to fail segment loading if duplicate primary keys are detected across segments. Defaults to false (last-loaded segment wins).

    set to true to populate metadata

    stream.pulsar.metadata.fields

    set to comma separated list of metadata fields

    Yes

    publishTime : Long

    __metadata$publishTime : String

    publish time as determined by the producer

    Yes

    brokerPublishTime: Optional

    __metadata$brokerPublishTime : String

    publish time as determined by the broker

    Yes

    eventTime : Long

    __metadata$eventTime : String

    Yes

    messageId : MessageId -> String

    __metadata$messageId : String

    String representation of the MessagId field. The format is ledgerId:entryId:partitionIndex

    messageId : MessageId -> bytes

    __metadata$messageBytes : String

    Base64 encoded version of the bytes returned from calling MessageId.toByteArray()

    producerName : String

    __metadata$producerName : String

    schemaVersion : byte[]

    __metadata$schemaVersion : String

    Base64 encoded value

    sequenceId : Long

    __metadata$sequenceId : String

    orderingKey : byte[]

    __metadata$orderingKey : String

    Base64 encoded value

    size : Integer

    __metadata$size : String

    topicName : String

    __metadata$topicName : String

    index : String

    __metadata$index : String

    redeliveryCount : Integer

    __metadata$redeliveryCount : String

    MSE doesn't aggregate results in the broker, pushing final aggregation processing to server(s)
    Aggregating over result of a join, e.g. \

    should produce following execution plan:\

    in which there is no leaf stage SSE operator and all aggregation stages are implemented with the MSE operator PinotLogicalAggregate. \

    SET numGroupsLimit = value;

    The maximum number of groups allowed per segment.

    pinot.server.query.executor.min.segment.group.trim.size

    -1 (disabled)

    SET minSegmentGroupTrimSize = value;

    The minimum number of groups to keep when trimming groups at the segment level.

    pinot.server.query.executor.min.server.group.trim.size

    5,000

    SET minServerGroupTrimSize = value;

    The minimum number of groups to keep when trimming groups at the server level.

    pinot.server.query.executor.groupby.trim.threshold

    1,000,000

    SET groupTrimThreshold = value;

    The number of groups to trigger the server level trim.

    pinot.broker.min.group.trim.size

    5000

    SET minBrokerGroupTrimSize = value;

    The minimum number of groups to keep when trimming groups at the broker. Applies only to SSQ(*).

    pinot.broker.mse.enable.group.trim

    false (disabled)

    /*+ aggOptions(is_enable_group_trim='value') */

    Enable group trim for the query (if possible). Applies only to MSQ(**).

    pinot.server.query.executor.mse.min.group.trim.size

    5000

    /*+ aggOptions(mse_min_group_trim_size='value') */ or SET mseMinGroupTrimSize = value;

    The number of groups to keep when trimming groups at intermediate stage. Applies only to MSQ(**).

    pinot.server.query.executor.max.execution.threads

    -1 (use all execution threads)

    SET maxExecutionThreads = value;

    The maximum number of execution threads (parallelism of segment processing) used per query.

    pinot.server.query.executor.num.groups.limit

    EXPLAIN's

    100,000

    SELECT * 
    FROM ...
    OPTION(minSegmentGroupTrimSize=value)
    SELECT * 
    FROM ...
    OPTION(groupTrimThreshold=value)
    SELECT SUM(colA) 
    FROM myTable 
    GROUP BY colB 
    HAVING SUM(colA) < 100 
    ORDER BY SUM(colA) DESC 
    LIMIT 10
    SELECT i, j, count(*) AS cnt
    FROM tab
    GROUP BY i, j
    ORDER BY i ASC, j ASC
    LIMIT 3;
    BROKER_REDUCE(sort:[i, j],limit:10) <- sort and trim groups to minBrokerGroupTrimSize
      COMBINE_GROUP_BY <- sort and trim groups to minServerGroupTrimSize
        PLAN_START
          GROUP_BY <- limit to numGroupsLimit, then sort and trim to minSegmentGroupTrimSize
            PROJECT(i, j)
              DOC_ID_SET
                FILTER_MATCH_ENTIRE_SEGMENT
    SELECT /*+ aggOptions(is_enable_group_trim='true', mse_min_group_trim_size='10') */        
    i, j, count(*) as cnt
     FROM myTable
     GROUP BY i, j
     ORDER BY i ASC, j ASC
     LIMIT 3
    LogicalSort
      PinotLogicalSortExchange(distribution=[hash])
        LogicalSort
          PinotLogicalAggregate <- aggregate up to num_groups_limit groups, then sort and trim output to group_trim_size
            PinotLogicalExchange(distribution=[hash[0, 1]])
              LeafStageCombineOperator(table=[mytable])
                StreamingInstanceResponse
                  CombineGroupBy <- aggregate up to minSegmentGroupTrimSize groups
                    GroupBy <- aggregate up to numGroupsLimit groups, optionally sort and trim to minSegmenGroupTrimSize
                      Project
                        DocIdSet
                          FilterMatchEntireSegment
    select /*+  aggOptions(is_enable_group_trim='true', mse_min_group_trim_size='3') */ 
           t1.i, t1.j, count(*) as cnt
    from tab t1
    join tab t2 on 1=1
    group by t1.i, t1.j
    order by t1.i asc, t1.j asc
    limit 5
    LogicalSort
      PinotLogicalSortExchange(distribution=[hash])
        LogicalSort
          PinotLogicalAggregate(aggType=[FINAL]) <- aggregate up to num_groups_limit groups, then sort and trim output to group_trim_size
            PinotLogicalExchange(distribution=[hash[0, 1]])
              PinotLogicalAggregate(aggType=[LEAF]) <- aggregate up to num_groups_limit groups, then sort and trim output to group_trim_size
                LogicalJoin(condition=[true])
                  PinotLogicalExchange(distribution=[random])
                    LeafStageCombineOperator(table=[mytable])
                      ...
                        FilterMatchEntireSegment
                  PinotLogicalExchange(distribution=[broadcast])
                    LeafStageCombineOperator(table=[mytable])
                      ...
                        FilterMatchEntireSegment

    All metrics must be noDictionaryColumns.

  • aggregatedFieldName must be in the Pinot schema and originalFieldName must not exist in Pinot schema

  • 18193

    truck

    1

    700.00

    18199

    car

    2

    3200.00

    18200

    truck

    1

    800.00

    18202

    car

    3

    3700.00

    18202

    DISTINCTCOUNTHLL

    Specify as DISTINCTCOUNTHLL(field, log2m), default is 12. See for how to define log2m. Cannot be changed later, a new field must be used. The schema for the output field should be BYTES type.

    DISTINCTCOUNTHLLPLUS

    Specify as DISTINCTCOUNTHLLPLUS(field, s, p). See for how to define s and p, they cannot be changed later. The schema for the output field should be BYTES type.

    SUMPRECISION

    Specify as SUMPRECISION(field, precision), precision must be defined. Used to compute the maximum possible size of the field. Cannot be changed later, a new field must be used. The schema for the output field should be BIG_DECIMAL type.

    {"customerID":205,"product_name": "car","price":"1500.00","timestamp":1571900400000}
    , the major issue here is that price number is double quoted so it won't show up. Below is a sample stacktrace:

    car

    2

    2800.00

    18193

    truck

    1

    MAX

    MIN

    SUM

    COUNT

    Merge/Rollup Task
    Stream ingestion

    2200.00

    Specify as COUNT(*)

    Create a table configuration or confirm one exists that matches the segment you want to upload.

  • If needed, upload the schema and table configs.

    1. Make sure the controller can read the segment source:

      • For tar push, the client must be able to stream the segment tar file to the controller.

      • For URI push and metadata push, the controller must be able to access the URI scheme you use. For PinotFS-backed schemes such as HDFS, S3, GCS, and ADLS, configure the matching Pinot file system. For custom schemes, implement a segment fetcher.

    hashtag
    Controller upload endpoints

    The controller exposes three upload endpoints:

    Endpoint
    Use case
    Content type
    Notes

    POST /v2/segments

    Preferred single-segment upload endpoint

    multipart/form-data or application/json

    Recommended for tar push, URI push, and metadata push

    POST /segments

    Legacy single-segment upload endpoint

    /v2/segments is the endpoint to document and use by default. The legacy /segments endpoint is still present for backward compatibility. Its JSON-based URI push path keeps the original DOWNLOAD_URI instead of moving the segment into a Pinot-chosen final location, so new integrations should use /v2/segments.

    hashtag
    Common request options

    hashtag
    Query parameters

    All three upload modes use the same query parameters:

    Query parameter
    Required
    Default
    Description

    tableName

    Recommended for single upload, required for batch upload

    None

    Table name to upload into. Pinot can sometimes derive it from the segment metadata, but you should pass it explicitly.

    tableType

    No

    Example:

    hashtag
    Headers

    Header
    Required
    Applies to
    Description

    UPLOAD_TYPE

    No for tar push, yes for URI and metadata push

    All uploads

    SEGMENT (default), URI, or METADATA

    DOWNLOAD_URI

    Yes for URI push and metadata push

    hashtag
    Push modes

    hashtag
    Tar push

    Tar push is the original and default upload mode. Use it when the client can stream the full segment tar file to the controller.

    Request shape

    • Endpoint: POST /v2/segments

    • Content type: multipart/form-data

    • Headers: UPLOAD_TYPE omitted or set to SEGMENT

    • Body: one multipart file part containing the segment .tar.gz

    What the controller does

    1. Stores the uploaded segment in the controller's segment directory or deep store.

    2. Extracts segment metadata.

    3. Adds or refreshes the segment in the target table.

    Example:

    If you prefer the Pinot CLI, pinot-admin.sh UploadSegment uses tar push for local segment directories:

    hashtag
    URI push

    URI push is best when the segment tar file already exists in deep storage or another controller-readable remote system.

    Request shape

    • Endpoint: POST /v2/segments

    • Content type: application/json

    • Headers:

      • UPLOAD_TYPE: URI

      • DOWNLOAD_URI: <segment-tar-uri>

    • Body: empty JSON payload is fine; the controller uses the headers

    What the controller does

    1. Downloads the segment tar from DOWNLOAD_URI.

    2. Stores it in the controller's segment directory or deep store.

    3. Extracts metadata.

    4. Adds or refreshes the segment in the table.

    Example:

    Use URI push only when the controller can resolve the URI scheme. If the source is on HDFS, S3, GCS, ADLS, or a custom system, configure Pinot with the appropriate Pinot file system or segment fetcher.

    hashtag
    Metadata push

    Metadata push is the most controller-efficient option when the segment tar already exists in a reachable storage system.

    Instead of uploading the full segment tar, the client uploads segment metadata and tells the controller where the tar already lives.

    Request shape

    • Endpoint: POST /v2/segments

    • Content type: multipart/form-data

    • Headers:

      • UPLOAD_TYPE: METADATA

      • DOWNLOAD_URI: <segment-tar-uri>

      • Optional: COPY_SEGMENT_TO_DEEP_STORE: true

    • Body: one multipart file part containing the metadata tarball for the segment

    The metadata tarball contains the segment metadata files, typically creation.meta and metadata.properties.

    What the controller does

    1. Reads the uploaded metadata bundle.

    2. Uses DOWNLOAD_URI as the segment download location.

    3. Adds or refreshes the segment in the table without downloading the full tar just to inspect metadata.

    If you set COPY_SEGMENT_TO_DEEP_STORE: true, the controller copies the segment from DOWNLOAD_URI into Pinot deep store and stores the final deep-store URI in segment metadata. This is useful when the ingestion job writes to a staging location instead of the final deep-store path.

    Example:

    COPY_SEGMENT_TO_DEEP_STORE is only useful for metadata push. The staging URI and Pinot deep store should use the same storage scheme because the copy happens through PinotFS.

    hashtag
    Batch metadata push

    If you need to metadata-push many segments in one call, use POST /segments/batchUpload.

    Request shape

    • Endpoint: POST /segments/batchUpload

    • Content type: multipart/form-data

    • Query parameters: tableName and tableType are required

    • Header: UPLOAD_TYPE: METADATA

    • Body: one multipart part containing an uber tarball with:

      • each segment's creation.meta

      • each segment's metadata.properties

    This endpoint is only for metadata push.

    hashtag
    Job types and Pinot Admin mapping

    If you are pushing from a batch ingestion job, the jobType maps to controller upload mode like this:

    Job type
    Push mode
    Controller endpoint

    SegmentTarPush or SegmentCreationAndTarPush

    Tar push

    POST /v2/segments

    SegmentUriPush or SegmentCreationAndUriPush

    URI push

    POST /v2/segments

    For ingestion jobs, define the push behavior in the ingestion job spec. Example:

    Then launch it with:

    hashtag
    Choosing the right mode

    Mode
    Use it when
    Tradeoff

    Tar push

    The client has the segment tar locally and can upload it directly

    Largest payload sent to controller

    URI push

    The segment tar already exists at a controller-readable URI

    Controller still downloads the full segment tar

    For production clusters with deep store configured, SegmentCreationAndMetadataPush is generally the preferred ingestion-job mode.

    Create a schema configuration
    hashtag
    Syntax

    The UDF function syntax is listed as below:

    • dimTable Name of the dim table to perform the lookup on.

    • dimColToLookUp The column name of the dim table to be retrieved to decorate our result.

    • dimJoinKey The column name on which we want to perform the lookup i.e. the join column name for dim table.

    • factJoinKey The column name on which we want to perform the lookup against e.g. the join column name for fact table

    Noted that:

    1. all the dim-table-related expressions are expressed as literal strings, this is the LOOKUP UDF syntax limitation: we cannot express column identifier which doesn't exist in the query's main table, which is the factTable table.

    2. the syntax definition of [ '''dimJoinKey''', factJoinKey ]* indicates that if there are multiple dim partition columns, there should be multiple join key pair expressed.

    hashtag
    Examples

    Here are some of the examples

    hashtag
    Single-partition-key-column Example

    Consider the table baseballStats

    Column
    Type

    playerID

    STRING

    yearID

    INT

    teamID

    STRING

    and dim table dimBaseballTeams

    Column
    Type

    teamID

    STRING

    teamName

    STRING

    teamAddress

    STRING

    several acceptable queries are:

    hashtag
    Dim-Fact LOOKUP example

    playerName
    teamID
    teamName
    teamAddress

    David Allan

    BOS

    Boston Red Caps/Beaneaters (from 1876–1900) or Boston Red Sox (since 1953)

    4 Jersey Street, Boston, MA

    David Allan

    hashtag
    Self LOOKUP example

    teamID
    nameFromLocal
    nameFromLookup

    ANA

    Anaheim Angels

    Anaheim Angels

    ARI

    Arizona Diamondbacks

    Arizona Diamondbacks

    hashtag
    Complex-partition-key-columns Example

    Consider a single dimension table with schema:

    BILLING SCHEMA

    Column
    Type

    customerId

    INT

    creditHistory

    STRING

    firstName

    STRING

    hashtag
    Self LOOKUP example

    customerId
    missedPayment
    lookedupCity

    341

    Paid

    Palo Alto

    374

    Paid

    Mountain View

    hashtag
    Usage FAQ

    • The data return type of the UDF will be that of the dimColToLookUp column type.

    • when multiple primary key columns are used for the dimension table (e.g. composite primary key), ensure that the order of keys appearing in the lookup() UDF is the same as the order defined in the primaryKeyColumns from the dimension table schema.

    query hints
    JOINs
    a dimension table

    Logical Table

    Learn about Logical Tables in Apache Pinot, which provide a unified query interface over multiple physical tables for flexible data organization.

    A logical table in Pinot provides a unified query interface over multiple physical tables. Instead of querying individual tables separately, users can query a single logical table that transparently routes the query to all underlying physical tables and aggregates the results.

    hashtag
    Overview

    Logical tables are useful for:

    • Geographic/Regional partitioning: Split data by region (e.g., ordersUS, ordersEU, ordersAPAC) while providing a unified orders table for queries

    • Table partitioning strategies: Organize data across multiple physical tables based on business logic

    • Time-based table splitting: Combine historical and recent data from different physical tables

    circle-info

    Logical tables require that all underlying physical tables share the same schema structure. A schema with the same name as the logical table must be created before creating the logical table.

    hashtag
    How It Works

    When you query a logical table, Pinot:

    1. Resolves the logical table name to its list of physical tables

    2. Routes the query to all relevant physical tables (both offline and realtime)

    3. Aggregates results from all physical tables

    For hybrid logical tables (containing both offline and realtime physical tables), Pinot uses a configurable time boundary strategy to determine which segments to query from each table type, avoiding duplicate data.

    hashtag
    Segment Pruning Optimization

    Pinot performs automatic cross-table segment pruning when querying logical tables. Instead of pruning segments independently for each physical table, segment pruning operates once across all physical tables collectively. This optimization is particularly beneficial for queries using ORDER BY with LIMIT, where the SelectionQuerySegmentPruner can now prune segments across the entire logical table.

    For example, with a logical table spanning three physical tables (US, EU, APAC), a query like:

    Previously, the pruner would prune segments within each physical table independently, potentially returning more segments than necessary. Now, pruning happens across all physical tables together, allowing the pruner to identify and return only the minimum set of segments needed to satisfy the query requirements.

    Key benefits:

    • Improved query performance by reducing segments processed

    • Automatic optimization with no configuration changes required

    • Particularly effective for ORDER BY + LIMIT queries across logical tables

    hashtag
    Logical Table Configuration

    A logical table configuration defines the mapping between the logical table and its physical tables.

    hashtag
    Configuration Properties

    Property
    Description
    Required

    hashtag
    Example Configuration

    hashtag
    Hybrid Logical Table Configuration

    For logical tables that combine both offline and realtime physical tables:

    hashtag
    Creating a Logical Table

    hashtag
    Step 1: Create the Schema

    Create a schema that matches the structure of your physical tables:

    Upload the schema:

    hashtag
    Step 2: Create the Logical Table

    hashtag
    Managing Logical Tables

    hashtag
    List Logical Tables

    hashtag
    Get Logical Table Configuration

    hashtag
    Update Logical Table

    hashtag
    Delete Logical Table

    circle-exclamation

    Deleting a logical table only removes the logical table configuration. The underlying physical tables and their data are not affected.

    hashtag
    Querying Logical Tables

    Query a logical table just like any other Pinot table:

    Logical tables work with both the single-stage and multi-stage query engines.

    hashtag
    Time Boundary Configuration

    For hybrid logical tables that contain both offline and realtime physical tables, you must configure a time boundary strategy to avoid querying duplicate data.

    hashtag
    Available Strategies

    Strategy
    Description

    hashtag
    Configuration Example

    The includedTables parameter specifies which physical tables should be considered when computing the time boundary.

    hashtag
    Query Configuration

    Logical tables support query-level configurations:

    Property
    Description

    hashtag
    Quota Configuration

    Apply rate limiting to logical tables:

    circle-info

    Storage quota (quota.storage) is not supported for logical tables since they don't store data directly.

    hashtag
    Managing Logical Tables via the Controller UI

    The Pinot Controller UI provides full CRUD management for logical tables, accessible directly from the main Tables page.

    hashtag
    Accessing Logical Tables

    1. Open the Controller UI (default: http://<controller-host>:9000).

    2. Navigate to Tables in the left sidebar.

    3. The Tables page displays physical tables and logical tables in separate sections.

    hashtag
    Supported Operations

    Operation
    Description
    circle-info

    All operations are also available via the REST API at /logicalTables/{tableName} using GET, PUT, and DELETE.

    hashtag
    Quick Start Example

    Try the logical table quickstart to see the feature in action:

    This quickstart:

    1. Creates three physical tables: ordersUS_OFFLINE, ordersEU_OFFLINE, and ordersAPAC_OFFLINE

    2. Creates a logical table orders that unifies all three

    hashtag
    Validation Rules

    When creating or updating a logical table, Pinot validates:

    • Table name does not end with _OFFLINE or _REALTIME

    • All physical tables exist (unless marked as multiCluster)

    • Physical tables are in the same database as the logical table

    hashtag
    Limitations

    • All physical tables must have compatible schemas

    • Storage quota is not supported

    • Physical tables in the same logical table should ideally have consistent indexing for optimal query performance

    hashtag
    Pluggable LogicalTableConfig Serialization

    By default, LogicalTableConfig is serialized to and deserialized from ZooKeeper using a built-in JSON format. For advanced use cases requiring a custom storage format, implement LogicalTableConfigSerDe and register it via LogicalTableConfigSerDeProvider.

    hashtag
    When to Use This

    • You need a compact binary format for deployments with a very large number of logical tables

    • Your ZooKeeper schema requires a specific non-default encoding

    • You are integrating Pinot with an external metadata system with its own serialization requirements

    hashtag
    Implementation

    Step 1: Implement the LogicalTableConfigSerDe interface:

    Step 2: Implement LogicalTableConfigSerDeProvider to return your custom SerDe.

    Step 3: Register the provider using the Java Service Provider Interface (SPI) by creating the file:

    containing the fully-qualified class name of your provider implementation.

    circle-info

    This is an advanced extension point for specialized deployments. Most users should rely on the default JSON-based serialization.

    hashtag
    See Also

    Explain Plan

    Query execution within Pinot is modeled as a sequence of operators that are executed in a pipelined manner to produce the final result. The EXPLAIN PLAN FOR syntax can be used to obtain the execution plan of a query, which can be useful to further optimize them.

    circle-exclamation

    The explain plan output format is still under development and may change in future releases. This under-development label applies to the explain plan output format specifically, not to the core multi-stage engine, which is generally available. Pinot explain plans are human-readable and are intended to be used for debugging and optimization purposes. This is especially important when using the explain plan in automated scripts or tools. The explain plan, even the ones returned as tables or JSON, are not guaranteed to be stable across releases.

    Pinot supports different type of explain plans depending on the query engine and the granularity or details we want to obtain.

    hashtag
    Different plans for different segments

    Segments are the basic unit of data storage and processing in Pinot. When a query is executed, it is executed on each segment and the results are merged together. Not all segments have the data distribution, indexes, etc. Therefore the query engine may decide to execute the query differently on different segments. This includes:

    • Segments that were not refreshed since indexes were added or removed on the table config.

    • Realtime segments that are being ingested, where some indexes (like range indexes) cannot be used.

    • Data distribution, specially min and max values for columns, which can affect the query plan.

    Given a Pinot query can touch thousands of segments, Pinot tries to minimize the number of different queries shown when explaining a query. By default, Pinot tries to analyze the plan for each segment and returns a simplified plan. How this simplification is done depends on the query engine, you can read more about that below.

    There is a verbose mode that can be used to show the plan for each segment. This mode is activated by setting the explainPlanVerbose query option to true, prefixing SET explainPlanVerbose=true; to the explain plan sentence.

    hashtag
    Explain on multi-stage query engine

    Following the more complex nature of the multi-stage query engine, its explain plan can be customized to get a plan specialized on different aspects of the query execution.

    There are 3 different types of explain plans for the multi-stage query engine:

    Mode
    Syntax by default
    Syntax if segment plan is enabled
    Description
    circle-info

    The syntax used to select each explain plan mode is confusing and it may be changed in the future.

    hashtag
    Segment plan

    The plan with segments is a detailed representation of the query execution plan that includes the segment specific information, like data distribution, indexes, etc.

    This mode was introduced in Pinot 1.3.0 and it is planned to be the default in future releases. Meanwhile it can be used by setting the explainAskingServers query option to true, prefixing SET explainAskingServers=true; to the explain plan sentence. Alternatively this mode can be activated by default by changing the broker configuration pinot.query.multistage.explain.include.segment.plan to true.

    Independently of how it is activated, once this mode is enabled, EXPLAIN PLAN FOR syntax will include segment information.

    hashtag
    Verbose and brief mode

    As explained in Different plans for different segments, by default Pinot tries to minimize the number of different query shown when explaining a query. In multi-stage, the brief mode includes all different plans, but each equivalent plan is aggregated. For example, if the same plan is executed on 100 segments, the brief mode will show it only once and stats like the number of docs will be summed.

    In the verbose mode, one plan is shown per segment, including the segment name and all the segment specific information. This may be useful to know which segments are not using indexes, or which segments are using a different data distribution.

    hashtag
    Example

    Returns

    hashtag
    Logical Plan

    The logical plan is a high-level representation of the query execution plan. This plan is calculated on the broker without asking the servers for their segment specific plans. This means that the logical plan does not include the segment specific information, like data distribution, indexes, etc.

    In Pinot 1.3.0, the logical plan is enabled by default and can be obtained by using EXPLAIN PLAN FOR syntax. Optionally, the segment plan can be enabled by default, in which case the logical plan can be obtained by using EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR syntax.

    circle-info

    The recommended way to ask for logical plan is to use EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR given this syntax is available in all versions of Pinot, independently of the configuration.

    hashtag
    Example:

    Returns:

    hashtag
    Workers plan

    circle-info

    There have been some discussion about how to name this explain mode and it may change in future versions. The term worker is leaking an implementation detail that is not explained anywhere else in the user documentation.

    The workers plan is a detailed representation of the query execution plan that includes information on how the query is distributed among different servers and workers inside them. This plan does not include the segment specific information, like data distribution, indexes, etc. and it is probably the less useful of the plans for normal use cases.

    Their main use case is to try to reduce data shuffling between workers by verifying that, for example, a join is executed in colocated fashion.

    hashtag
    Example

    Returns:

    hashtag
    Interpreting multi-stage explain plans

    Multi-stage plans are more complex than single-stage plans. This section explains how to interpret them.

    You can use the EXPLAIN PLAN syntax to obtain the logical plan of a query. There are different formats for the output, but all of them represent the logical plan of the query.

    The query

    Can produce the following output:

    Each node in the tree represents an operation, and each operator has attributes. For example the LogicalJoin operator has a condition attribute that specifies the join condition and a joinType.

    hashtag
    Understanding indexed references

    Expressions like $2 are indexed references into the input row for each operator. To understand them, look at the operator's children to see which attributes are being referenced, usually starting from the leaf operators.

    For example, LogicalTableScan always returns the whole row of the table, so its attributes are the columns of the table:

    The LogicalProject operator selects columns o_custkey and o_shippriority (at positions $5 and $10 in the table row) and generates a row with two columns. The PinotLogicalExchange distributes rows using hash[0], meaning the hash of the first column from LogicalProject — which is o_custkey.

    hashtag
    Virtual rows in joins

    The LogicalJoin operator receives rows from two upstream stages. The virtual row seen by the join is the concatenation of the left-hand side plus the right-hand side.

    In the example above, the left stage sends [c_address, c_custkey] and the right stage sends [o_custkey, o_shippriority]. The join sees a row with columns [c_address, c_custkey, o_custkey, o_shippriority]. The condition =($1, $2) joins on c_custkey and o_custkey. The join passes through all columns unchanged, so its downstream LogicalProject selecting $0 and $3 produces [c_address, o_shippriority].

    hashtag
    LogicalSort without ORDER BY

    A LogicalSort operator can appear even when the SQL query has no ORDER BY. In relational algebra, a sort node is used to express LIMIT. When no sort condition is specified, no actual sorting is performed — only the row limit is applied.

    hashtag
    Explain on single stage query engine

    circle-info

    Explain plan for single stage query engine is described in deep in

    Explain plan for single stage query engine is simpler and less customized, but returns the information in a tabular format. For example, the query EXPLAIN PLAN FOR SELECT playerID, playerName FROM baseballStats.

    Returns the following table:

    Where Operator column describes the operator that Pinot will run whereas the Operator_Id and Parent_Id columns show the parent-child relationship between operators, which forms the execution tree. For example, the plan above should be understood as:

    \

    Quick Start Examples

    This section describes quick start commands that launch all Pinot components in a single process.

    Pinot ships with QuickStart commands that launch Pinot components in a single process and import pre-built datasets. These quick start examples are a good place if you're just getting started with Pinot. The examples begin with the example, after the following notes:

    • Prerequisites

      You must have either or . The examples are available in each option and work the same. The decision of which to choose depends on your installation preference and how you generally like to work. If you don't know which to choose, using Docker will make your cleanup easier after you are done with the examples.

    {
      "tableConfig": {
        "tableName": "...",
        "ingestionConfig": {
          "aggregationConfigs": [{
            "columnName": "aggregatedFieldName",
            "aggregationFunction": "<aggregationFunction>(<originalFieldName>)"
          }]
        }
      }
    }
    {"customerID":205,"product_name": "car","price":1500.00,"timestamp":1571900400000}
    {"customerID":206,"product_name": "truck","price":2200.00,"timestamp":1571900400000}
    {"customerID":207,"product_name": "car","price":1300.00,"timestamp":1571900400000}
    {"customerID":208,"product_name": "truck","price":700.00,"timestamp":1572418800000}
    {"customerID":209,"product_name": "car","price":1100.00,"timestamp":1572505200000}
    {"customerID":210,"product_name": "car","price":2100.00,"timestamp":1572505200000}
    {"customerID":211,"product_name": "truck","price":800.00,"timestamp":1572678000000}
    {"customerID":212,"product_name": "car","price":800.00,"timestamp":1572678000000}
    {"customerID":213,"product_name": "car","price":1900.00,"timestamp":1572678000000}
    {"customerID":214,"product_name": "car","price":1000.00,"timestamp":1572678000000}
    {
      "schemaName": "dailySales",
      "dimensionFieldSpecs": [
        {
          "name": "product_name",
          "dataType": "STRING"
        }
      ],
      "metricFieldSpecs": [
        {
          "name": "sales_count",
          "dataType": "LONG"
        },
        {
          "name": "total_sales",
          "dataType": "DOUBLE"
        }
      ],
      "dateTimeFieldSpecs": [
        {
          "name": "daysSinceEpoch",
          "dataType": "LONG",
          "format": "1:MILLISECONDS:EPOCH",
          "granularity": "1:MILLISECONDS"
        }
      ]
    }
    {
      "tableName": "daily_sales",
      "ingestionConfig": {
        "transformConfigs": [
          {
            "columnName": "daysSinceEpoch",
            "transformFunction": "toEpochDays(\"timestamp\")"
          }
        ],
        "aggregationConfigs": [
          {
            "columnName": "total_sales",
            "aggregationFunction": "SUM(price)"
          },
          {
            "columnName": "sales_count", 
            "aggregationFunction": "COUNT(*)"
          }
        ]
      }
      "tableIndexConfig": {
        "noDictionaryColumns": [
          "sales_count",
          "total_sales"
        ]
      }
    }
    2024/11/04 00:24:27.760 ERROR [RealtimeSegmentDataManager_dailySales__0__0__20241104T0824Z] [dailySales__0__0__20241104T0824Z] Caught exception while indexing the record at offset: 9 , row: {
      "fieldToValueMap" : {
        "price" : "1000.00",
        "daysSinceEpoch" : 18202,
        "sales_count" : 0,
        "total_sales" : 0.0,
        "product_name" : "car",
        "timestamp" : 1572678000000
      },
      "nullValueFields" : [ "sales_count", "total_sales" ]
    }
    java.lang.ClassCastException: class java.lang.String cannot be cast to class java.lang.Number (java.lang.String and java.lang.Number are in module java.base of loader 'bootstrap')
    	at org.apache.pinot.segment.local.aggregator.SumValueAggregator.applyRawValue(SumValueAggregator.java:25) ~[classes/:?]
    	at org.apache.pinot.segment.local.indexsegment.mutable.MutableSegmentImpl.aggregateMetrics(MutableSegmentImpl.java:855) ~[classes/:?]
    	at org.apache.pinot.segment.local.indexsegment.mutable.MutableSegmentImpl.index(MutableSegmentImpl.java:577) ~[classes/:?]
    	at org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager.processStreamEvents(RealtimeSegmentDataManager.java:641) ~[classes/:?]
    	at org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager.consumeLoop(RealtimeSegmentDataManager.java:477) ~[classes/:?]
    	at org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager$PartitionConsumer.run(RealtimeSegmentDataManager.java:734) ~[classes/:?]
    	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
    
    pinot-admin.sh AddTable \\
      -tableConfigFile /path/to/table-config.json \\
      -schemaFile /path/to/table-schema.json -exec
    POST /v2/segments?tableName=myTable&tableType=OFFLINE&enableParallelPushProtection=false&allowRefresh=true
    curl -X POST "http://localhost:9000/v2/segments?tableName=myTable&tableType=OFFLINE" \\
      -F "file=@/path/to/myTable_2024-01-01_2024-01-02_0.tar.gz"
    pinot-admin.sh UploadSegment \\
      -controllerHost localhost \\
      -controllerPort 9000 \\
      -segmentDir /path/to/local/dir \\
      -tableName myTable
    curl -X POST "http://localhost:9000/v2/segments?tableName=myTable&tableType=OFFLINE" \\
      -H "Content-Type: application/json" \\
      -H "UPLOAD_TYPE: URI" \\
      -H "DOWNLOAD_URI: s3://bucket/pinot-segments/myTable_2024-01-01_2024-01-02_0.tar.gz" \\
      -d '{}'
    curl -X POST "http://localhost:9000/v2/segments?tableName=myTable&tableType=OFFLINE" \\
      -H "UPLOAD_TYPE: METADATA" \\
      -H "DOWNLOAD_URI: s3://staging-bucket/segments/myTable_2024-01-01_2024-01-02_0.tar.gz" \\
      -H "COPY_SEGMENT_TO_DEEP_STORE: true" \\
      -F "file=@/path/to/myTable_2024-01-01_2024-01-02_0.metadata.tar.gz"
    executionFrameworkSpec:
      name: standalone
      segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
      segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
      segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner
      segmentMetadataPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner
    
    jobType: SegmentCreationAndMetadataPush
    
    pinotClusterSpecs:
      - controllerURI: http://localhost:9000
    
    pushJobSpec:
      pushAttempts: 2
      pushRetryIntervalMillis: 1000
      copyToDeepStoreForMetadataPush: true
    pinot-admin.sh LaunchDataIngestionJob \\
      -jobSpecFile /path/to/job-spec.yaml
    lookupUDFSpec:
        LOOKUP
        '('
        '''dimTable'''
        '''dimColToLookup'''
        [ '''dimJoinKey''', factJoinKey ]*
        ')'
    SELECT
      playerName,
      teamID,
      LOOKUP('dimBaseballTeams', 'teamName', 'teamID', teamID) AS teamName,
      LOOKUP('dimBaseballTeams', 'teamAddress', 'teamID', teamID) AS teamAddress
    FROM baseballStats
    SELECT 
      teamID, 
      teamName AS nameFromLocal,
      LOOKUP('dimBaseballTeams', 'teamName', 'teamID', teamID) AS nameFromLookup
    FROM dimBaseballTeams
    select 
      customerId,
      missedPayment, 
      LOOKUP('billing', 'city', 'customerId', customerId, 'creditHistory', creditHistory) AS lookedupCity 
    from billing

    an all_segments_metadata file mapping segment names to DOWNLOAD_URI values

    multipart/form-data or application/json

    Still supported, but prefer /v2/segments

    POST /segments/batchUpload

    Batch metadata push

    multipart/form-data

    Only supports metadata push for multiple segments

    OFFLINE

    OFFLINE or REALTIME

    enableParallelPushProtection

    No

    false

    Reject concurrent uploads for the same segment

    allowRefresh

    No

    true

    Allow an existing segment to be refreshed instead of failing the upload

    URI push, metadata push

    Source URI of the segment tar file

    COPY_SEGMENT_TO_DEEP_STORE

    No

    Metadata push

    If true, controller copies the segment from the source URI into Pinot deep store and rewrites the stored download URI

    CRYPTER

    No

    All uploads

    Crypter class name if the uploaded payload is encrypted

    SegmentMetadataPush or SegmentCreationAndMetadataPush

    Metadata push

    POST /v2/segments

    SegmentMetadataPush with batchSegmentUpload: true

    Batch metadata push

    POST /segments/batchUpload

    Metadata push

    The segment tar already exists remotely and you want the lightest controller-side registration path

    Requires a metadata bundle and a valid DOWNLOAD_URI

    league

    STRING

    playerName

    STRING

    playerStint

    INT

    numberOfGames

    INT

    numberOfGamesAsBatter

    INT

    AtBatting

    INT

    runs

    INT

    CHA

    null

    null

    David Allan

    SEA

    Seattle Mariners (since 1977) or Seattle Pilots (1969)

    1250 First Avenue South, Seattle, WA

    David Allan

    SEA

    Seattle Mariners (since 1977) or Seattle Pilots (1969)

    1250 First Avenue South, Seattle, WA

    ATL

    Atlanta Braves

    Atlanta Braves

    BAL

    Baltimore Orioles (original- 1901–1902 current- since 1954)

    Baltimore Orioles (original- 1901–1902 current- since 1954)

    lastName

    STRING

    isCarOwner

    BOOLEAN

    city

    STRING

    maritalStatus

    STRING

    buildingType

    STRING

    missedPayment

    STRING

    billingMonth

    STRING

    398

    Paid

    Palo Alto

    427

    Paid

    Cupertino

    435

    Paid

    Cupertino

    function reference
    function reference
    Returns a unified result set to the client
    Single-table behavior remains unchanged

    Map of physical table names to their configurations

    Yes

    refOfflineTableName

    Reference offline table for table config metadata

    Required if offline tables exist

    refRealtimeTableName

    Reference realtime table for table config metadata

    Required if realtime tables exist

    query

    Query configuration (timeout, response size limits, etc.)

    No

    quota

    Quota configuration for rate limiting

    No

    timeBoundaryConfig

    Time boundary configuration for hybrid tables

    Required for hybrid logical tables

    Click a logical table name to open its detail page, which shows:

    • Current configuration (JSON)

    • Physical table mappings

    Demonstrates queries on both physical and logical tables
  • Schema with the same name as the logical table exists

  • Broker tenant exists

  • Reference table names (refOfflineTableName, refRealtimeTableName) are set correctly

  • Time boundary config is provided for hybrid tables

  • tableName

    Name of the logical table

    Yes

    brokerTenant

    The broker tenant to use for routing

    Yes

    min

    Uses the minimum time boundary from the specified tables

    timeoutMs

    Query timeout in milliseconds

    disableGroovy

    Disable Groovy functions in queries

    maxServerResponseSizeBytes

    Maximum response size from each server

    maxQueryResponseSizeBytes

    List

    View all logical tables with search and filter

    View

    Inspect the logical table's configuration and physical table assignments

    Update

    Edit the logical table configuration in-place

    Delete

    Table Configuration
    Schema Configuration

    physicalTableConfigMap

    Maximum total query response size

    Remove a logical table from the cluster

    SELECT * FROM orders ORDER BY createdTime DESC LIMIT 10
    {
      "tableName": "orders",
      "brokerTenant": "DefaultTenant",
      "physicalTableConfigMap": {
        "ordersUS_OFFLINE": {},
        "ordersEU_OFFLINE": {},
        "ordersAPAC_OFFLINE": {}
      },
      "refOfflineTableName": "ordersUS_OFFLINE"
    }
    {
      "tableName": "events",
      "brokerTenant": "DefaultTenant",
      "physicalTableConfigMap": {
        "eventsHistorical_OFFLINE": {},
        "eventsRecent_OFFLINE": {},
        "eventsLive_REALTIME": {}
      },
      "refOfflineTableName": "eventsHistorical_OFFLINE",
      "refRealtimeTableName": "eventsLive_REALTIME",
      "timeBoundaryConfig": {
        "boundaryStrategy": "min",
        "parameters": {
          "includedTables": ["eventsRecent_OFFLINE"]
        }
      }
    }
    {
      "schemaName": "orders",
      "dimensionFieldSpecs": [
        { "name": "orderId", "dataType": "STRING" },
        { "name": "customerId", "dataType": "STRING" },
        { "name": "region", "dataType": "STRING" },
        { "name": "productId", "dataType": "STRING" },
        { "name": "status", "dataType": "STRING" }
      ]
    }
    curl -F schemaName=@orders_schema.json localhost:9000/schemas
    curl -X POST -H 'Content-Type: application/json' \
      -d '{
        "tableName": "orders",
        "brokerTenant": "DefaultTenant",
        "physicalTableConfigMap": {
          "ordersUS_OFFLINE": {},
          "ordersEU_OFFLINE": {},
          "ordersAPAC_OFFLINE": {}
        },
        "refOfflineTableName": "ordersUS_OFFLINE"
      }' \
      http://localhost:9000/logicalTables
    curl http://localhost:9000/logicalTables
    curl http://localhost:9000/logicalTables/{tableName}
    curl -X PUT -H 'Content-Type: application/json' \
      -d '{
        "tableName": "orders",
        "brokerTenant": "DefaultTenant",
        "physicalTableConfigMap": {
          "ordersUS_OFFLINE": {},
          "ordersEU_OFFLINE": {},
          "ordersAPAC_OFFLINE": {},
          "ordersANZ_OFFLINE": {}
        },
        "refOfflineTableName": "ordersUS_OFFLINE"
      }' \
      http://localhost:9000/logicalTables/orders
    curl -X DELETE http://localhost:9000/logicalTables/{tableName}
    -- Query the logical table
    SELECT COUNT(*) FROM orders
    
    -- Filter by region
    SELECT orderId, customerId, region, status
    FROM orders
    WHERE region = 'us'
    LIMIT 10
    
    -- Aggregate across all regions
    SELECT region, COUNT(*) as orderCount
    FROM orders
    GROUP BY region
    ORDER BY region
    {
      "timeBoundaryConfig": {
        "boundaryStrategy": "min",
        "parameters": {
          "includedTables": ["eventsRecent_OFFLINE"]
        }
      }
    }
    {
      "tableName": "orders",
      "brokerTenant": "DefaultTenant",
      "physicalTableConfigMap": { ... },
      "refOfflineTableName": "ordersUS_OFFLINE",
      "query": {
        "timeoutMs": 30000,
        "disableGroovy": true,
        "maxServerResponseSizeBytes": 1000000,
        "maxQueryResponseSizeBytes": 5000000
      }
    }
    {
      "tableName": "orders",
      "brokerTenant": "DefaultTenant",
      "physicalTableConfigMap": { ... },
      "refOfflineTableName": "ordersUS_OFFLINE",
      "quota": {
        "maxQueriesPerSecond": 100
      }
    }
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type LOGICAL_TABLE
    ./bin/pinot-admin.sh QuickStart -type LOGICAL_TABLE
    public class MyCustomSerDe implements LogicalTableConfigSerDe {
        @Override
        public byte[] serialize(LogicalTableConfig config) { /* ... */ }
    
        @Override
        public LogicalTableConfig deserialize(byte[] bytes) { /* ... */ }
    }
    META-INF/services/org.apache.pinot.spi.config.table.logical.LogicalTableConfigSerDeProvider

    Simplest multi-stage plan. No index or data shuffle information.

    Workers plan

    EXPLAIN IMPLEMENTATION PLAN FOR

    EXPLAIN IMPLEMENTATION PLAN FOR

    Used to understand data shuffle between servers. Note: The name of this mode is open to discussion and may change in the future.

    Segment plan

    SET explainAskingServers=true; EXPLAIN PLAN FOR

    EXPLAIN PLAN FOR

    Includes the segment specific information (like indexes).

    Logical plan

    EXPLAIN PLAN FOR or EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR

    explain-plan.md

    EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR

    Pinot versions in examples

    The Docker-based examples on this page use pinot:latest, which instructs Docker to pull and use the most recent release of Apache Pinot. If you prefer to use a specific release instead, you can designate it by replacing latest with the release number, like this: pinot:0.12.1.

    The local install-based examples that are run using the launcher scripts will use the Apache Pinot version you installed.

  • Stopping a running example

    To stop a running example, enter Ctrl+C in the same terminal where you ran the docker run command to start the example.

  • circle-exclamation

    macOS Monterey Users

    By default the Airplay receiver server runs on port 7000, which is also the port used by the Pinot Server in the Quick Start. You may see the following error when running these examples:

    If you disable the Airplay receiver server and try again, you shouldn't see this error message anymore.

    hashtag
    Command Options

    All QuickStart commands support the following optional parameters in addition to -type:

    Option
    Aliases
    Description

    -type

    The quickstart type to run (see sections below).

    -tmpDir

    -quickstartDir, -dataDir

    Directory to store quickstart data. Use this to persist data across restarts so that tables and segments are reloaded from disk instead of being regenerated.

    Example: Persist data across restarts

    Example: Use an external ZooKeeper and custom config

    Example: Load custom tables into an empty cluster

    hashtag
    Batch Processing

    This example demonstrates how to do batch processing with Pinot. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates the baseballStats table

    • Launches a standalone data ingestion job that builds one segment for a given CSV data file for the baseballStats table and pushes the segment to the Pinot Controller.

    • Issues sample queries to Pinot

    hashtag
    Batch JSON

    This example demonstrates how to import and query JSON documents in Pinot. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates the githubEvents table

    • Launches a standalone data ingestion job that builds one segment for a given JSON data file for the githubEvents table and pushes the segment to the Pinot Controller.

    • Issues sample queries to Pinot

    hashtag
    Batch with complex data types

    This example demonstrates how to do batch processing in Pinot where the the data items have complex fields that need to be unnested. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates the githubEvents table

    • Launches a standalone data ingestion job that builds one segment for a given JSON data file for the githubEvents table and pushes the segment to the Pinot Controller.

    • Issues sample queries to Pinot

    hashtag
    Streaming

    This example demonstrates how to do stream processing with Pinot. The command:

    • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates meetupRsvp table

    • Launches a meetup stream

    • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot.

    • Issues sample queries to Pinot

    hashtag
    Streaming JSON

    This example demonstrates how to do stream processing with JSON documents in Pinot. The command:

    • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates meetupRsvp table

    • Launches a meetup stream

    • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot

    • Issues sample queries to Pinot

    hashtag
    Streaming with minion cleanup

    This example demonstrates how to do stream processing in Pinot with RealtimeToOfflineSegmentsTask and MergeRollupTask minion tasks continuously optimizing segments as data gets ingested. The command:

    • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, Pinot Minion, and Pinot Server.

    • Creates githubEvents table

    • Launches a GitHub events stream

    • Publishes data to a Kafka topic githubEvents that is subscribed to by Pinot.

    • Issues sample queries to Pinot

    hashtag
    Streaming with complex data types

    This example demonstrates how to do stream processing in Pinot where the stream contains items that have complex fields that need to be unnested. The command:

    • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, Pinot Minion, and Pinot Server.

    • Creates meetupRsvp table

    • Launches a meetup stream

    • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot.

    • Issues sample queries to Pinot

    hashtag
    Upsert

    This example demonstrates how to do stream processing with upsert with Pinot. The command:

    • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates meetupRsvp table

    • Launches a meetup stream

    • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot

    • Issues sample queries to Pinot

    hashtag
    Upsert JSON

    This example demonstrates how to do stream processing with upsert with JSON documents in Pinot. The command:

    • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates meetupRsvp table

    • Launches a meetup stream

    • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot

    • Issues sample queries to Pinot

    hashtag
    Hybrid

    This example demonstrates how to do hybrid stream and batch processing with Pinot. The command:

    1. Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    2. Creates airlineStats table

    3. Launches a standalone data ingestion job that builds segments under a given directory of Avro files for the airlineStats table and pushes the segments to the Pinot Controller.

    4. Launches a stream of flights stats

    5. Publishes data to a Kafka topic airlineStatsEvents that is subscribed to by Pinot.

    6. Issues sample queries to Pinot

    hashtag
    Join

    This example demonstrates how to do joins in Pinot using the Lookup UDF. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server in the same container.

    • Creates the baseballStats table

    • Launches a data ingestion job that builds one segment for a given CSV data file for the baseballStats table and pushes the segment to the Pinot Controller.

    • Creates the dimBaseballTeams table

    • Launches a data ingestion job that builds one segment for a given CSV data file for the dimBaseballStats table and pushes the segment to the Pinot Controller.

    • Issues sample queries to Pinot

    hashtag
    Logical Table

    This example demonstrates how to use logical tables in Pinot, which provide a unified query interface over multiple physical tables. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, Pinot Server, and Pinot Minion.

    • Creates three physical tables (ordersUS_OFFLINE, ordersEU_OFFLINE, ordersAPAC_OFFLINE) representing regional order data

    • Creates a logical table (orders) that provides a unified view over all regional tables

    • Issues sample queries to both physical and logical tables

    For more details on logical tables, see Logical Table.

    hashtag
    Empty

    This example starts a bare Pinot cluster with no tables or data loaded. Use this when you want to set up your own tables and schemas from scratch. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • No tables or data are created

    hashtag
    Multi-Stage Query Engine

    This example demonstrates the multi-stage query engine with self-joins, dimension table joins, and vector distance queries. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates the baseballStats table and a fine food reviews table

    • Launches data ingestion jobs to build segments and push them to the Pinot Controller.

    • Issues sample multi-stage queries including joins and vector distance queries

    hashtag
    Partial Upsert

    This example demonstrates how to do stream processing with partial upsert in Pinot, where individual fields can be updated independently while preserving other column values. The command:

    • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates a realtime table with partial upsert enabled

    • Publishes data to a Kafka topic that is subscribed to by Pinot

    • Issues sample queries to Pinot

    hashtag
    Geospatial

    This example demonstrates geospatial indexing and query capabilities in Pinot. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates a table with geospatial indexes

    • Launches a data ingestion job and pushes segments to the Pinot Controller.

    • Issues sample geospatial queries to Pinot

    hashtag
    Null Handling

    This example demonstrates null value handling features in Pinot. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates a table containing null values

    • Launches a data ingestion job and pushes segments to the Pinot Controller.

    • Issues sample queries demonstrating IS NULL, IS NOT NULL, and aggregate behavior with nulls

    hashtag
    TPC-H

    This example loads the 8 TPC-H benchmark tables (customer, lineitem, nation, orders, part, partsupp, region, supplier) for multi-stage query testing. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates all 8 TPC-H tables

    • Launches data ingestion jobs to build segments for each table and pushes them to the Pinot Controller.

    • Issues sample TPC-H benchmark queries using the multi-stage query engine

    hashtag
    Colocated Join

    This example demonstrates colocated join operations using the multi-stage query engine with various partition configurations and parallelism hints. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates tables with matching partition configurations for colocated joins

    • Launches data ingestion jobs and pushes segments to the Pinot Controller.

    • Issues sample colocated join queries

    hashtag
    Lookup Join

    This example demonstrates the lookup join strategy using dimension tables with the multi-stage query engine. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates fact and dimension tables

    • Launches data ingestion jobs and pushes segments to the Pinot Controller.

    • Issues sample lookup join queries

    hashtag
    Auth

    This example demonstrates how to run Pinot with basic authentication enabled. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server with basic auth configured.

    • Creates tables and loads data with authentication enabled

    • Issues sample authenticated queries to Pinot

    hashtag
    Sorted Column

    This example demonstrates sorted column indexing in Pinot with a generated dataset containing sorted columns. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates a table with sorted column configuration

    • Generates a 100,000-row dataset and ingests it into Pinot

    • Issues sample queries demonstrating sorted index performance

    hashtag
    Timestamp Index

    This example demonstrates timestamp index functionality, showing timestamp extraction at different granularities and dateTrunc bucketing. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates the airlineStats table with timestamp indexes

    • Launches a data ingestion job and pushes segments to the Pinot Controller.

    • Issues sample queries demonstrating timestamp extraction and bucketing

    hashtag
    GitHub Events

    This example sets up a streaming demo using GitHub events data. The command:

    • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

    • Creates a pullRequestMergedEvents realtime table

    • Publishes GitHub event data to a Kafka topic that is subscribed to by Pinot

    • Issues sample analytical queries on the GitHub event data

    hashtag
    Multi-Cluster

    This example demonstrates cross-cluster querying via logical tables by initializing two independent Pinot clusters. The command:

    • Starts two independent Pinot clusters, each with their own Zookeeper, Controller, Broker, and Server.

    • Creates physical tables in each cluster

    • Creates a logical table that spans both clusters

    • Issues sample cross-cluster queries

    hashtag
    Batch with Multi-Directory (Tiered Storage)

    This example demonstrates multi-directory (tiered storage) support with hot and cold tiers. The command:

    • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server with tiered storage configured.

    • Creates the airlineStats table with hot and cold storage tiers

    • Launches a data ingestion job and pushes segments to the Pinot Controller.

    • Issues sample queries that run across storage tiers

    hashtag
    Time Series

    circle-info

    For production use, you should ideally implement your own Time Series Language Plugin. The one included in the Pinot distribution is only for demonstration purposes.

    This examples demonstrates Pinot's Time Series Engine, which supports running pluggable Time Series Query Languages via a Language Plugin architecture. The default Pinot binary includes a toy Time Series Query Language using the same name as Uber's language "m3ql". You can try the following query as an example:

    **

    Batch Processing
    installed Pinot locally
    have Docker installed if you want to use the Pinot Docker image

    Null Value Support

    circle-exclamation

    For historical reasons, null support is disabled in Apache Pinot by default. This is expected to be changed in future versions.

    For historical reasons, null support is disabled by default in Apache Pinot. When null support is disabled, all columns are treated as not null. Predicates like IS NOT NULL evaluates to true, and IS NULL evaluates to false. Aggregation functions like COUNT, SUM, AVG, MODE, etc. treat all columns as not null.

    For example, the predicate in the query below matches all records.

    To handle null values in your data, you must:

    1. Indicate Pinot to store null values in your data before ingesting the data. See .

    2. Use one of the . By default Pinot will use a where only IS NULL and IS NOT NULL predicates are supported, but the can be enabled.

    The following table summarizes the behavior of null handling support in Pinot:

    disabled (default)
    basic (enabled at ingestion time)
    advanced (enabled at query time)

    hashtag
    How Pinot stores null values

    Pinot always stores column values in a . Forward index never stores null values but have to store a value for each row. Therefore independent of the null handling configuration, Pinot always stores a default value for nulls rows in the forward index. The default value used in a column can be specified in the configuration by setting the defaultNullValue field spec. The defaultNullValue depends on the type of data.

    circle-info

    Remember that in the JSON used as table configuration, defaultNullValue must always be a String. If the column type is not String, Pinot will convert that value to the column type automatically.

    hashtag
    Disabled null handling

    By default, Pinot does not store null values at all. This means that by default whenever a null value is ingested, Pinot stores the default null value (defined above) instead.

    In order to store null values the table has to be configured to do so as explained below.

    hashtag
    Store nulls at ingestion time

    When null storing is enabled, Pinot creates a new index called the null index or null vector index. This index stores the document IDs of the rows that have null values for the column.

    triangle-exclamation

    Although null storing can be enabled after data has been ingested, data ingested before this mode is enabled will not store the null index and therefore it will be treated as not null.

    Null support is configured per table. You can configure one table to store nulls, and configure another table to not store nulls. There are two ways to define null storing support in Pinot:

    1. , where each column in a table is configured as nullable or not nullable. We recommend enabling null storing support by column. This is the only way to support null handling in the .

    2. , where all columns in the table are considered nullable. This is how null values were handled before Pinot 1.1.0 and now deprecated.

    circle-info

    Remember that Column based null storing has priority over Table based null storing. In case both modes are enabled, Column based null storing will be used.

    hashtag
    Column based null storing

    We recommend configuring column based null storing, which lets you specify null handling per column and supports null handling in the multi-stage query engine.

    To enable column based null handling:

    1. Set to true in the schema configuration before ingesting data.

    2. Then specify which columns are not nullable using the notNull field spec, which defaults to false.

    hashtag
    Table based null storing

    This is the only way to enable null storing in Pinot before 1.1.0, but it is deprecated since then. Table based null storing is more expensive in terms of disk space and query performance than column based null storing. Also, it is not possible to support null handling in multi-stage query engine using table based null storing.

    When table based null storing is enabled, all columns will be considered nullable. To enable this mode you need to:

    1. Enable the nullHandlingEnabled configuration in

    2. Disable in the schema.

    circle-exclamation

    Remember nullHandlingEnabled table configuration enables table based null handling while enableNullHandling is the query option that enables advanced null handling at query time. See for more information.

    As an example:

    hashtag
    Null handling at query time

    To enable basic null handling by at query time, enable Pinot to . Advanced null handling support can be optionally enabled.

    circle-info

    The multi-stage query engine requires column based null storing. Tables with table based null storing are considered not nullable.

    If you are converting from null support for the single-stage query engine, you can modify your schema to set enableColumnBasedNullHandling. There is no need to change your table config to remove or set nullHandlingEnabled to false. In fact we recommend to keep it as true to make it clear that the table may contain nulls. Also, when converting:

    • No reingestion is needed.

    • If the columns are changed from nullable to not nullable and there is a value that was previously null, the default value will be used instead.

    hashtag
    Basic null support

    The basic null support is automatically enabled when null values are stored on a segment (see ).

    In this mode, Pinot is able to handle simple predicates like IS NULL or IS NOT NULL. Other transformation functions (like CASE, COALESCE, +, etc.) and aggregations functions (like COUNT, SUM, AVG, etc.) will use the default value specified in the schema for null values.

    For example, in the following table:

    rowId
    col1

    If the default value for col1 is 1, the following query:

    Will return the following result:

    rowId
    col1

    While

    While return the following:

    rowId
    col1

    And queries like

    Will return

    rowId
    col1

    Also

    count
    mode

    Given that neither count or mode function will ignore null values as expected but read instead the default value (in this case 1) stored in the forward index.

    hashtag
    Advanced null handling support

    Advanced null handling has two requirements:

    1. Segments must store null values (see ).

    2. The query must enable null handling by setting the enableNullHandling to true.

    The later can be done in one of the following ways:

    • Set enableNullHandling=true at the beginning of the query.

    • If using JDBC, set the connection option enableNullHandling=true (either in the URL or as a property).

    Alternatively, if you want to enable advanced null handling for all queries by default, the broker configuration pinot.broker.query.enable.null.handling can be set to true. Individual queries can override this to false using the enableNullHandling query option if required.

    circle-exclamation

    Even though they have similar names, the nullHandlingEnabled table configuration and the enableNullHandling query option are different. Remember that the nullHandlingEnabled table configuration modifies how segments are stored and the enableNullHandling query option modifies how queries are executed.

    When the enableNullHandling option is set to true, the Pinot query engine uses a different execution path that interprets nulls in a standard SQL way. This means that IS NULL and IS NOT NULL predicates will evaluate to true or false according to whether a null is detected (like in basic null support mode) but also aggregation functions like COUNT, SUM, AVG, MODE, etc. will deal with null values as expected (usually ignoring null values).

    In this mode, some indexes may not be usable, and queries may be significantly more expensive. Performance degradation impacts all the columns in the table, including columns in the query that do not contain null values. This degradation happens even when table uses column based null storing.

    hashtag
    Examples queries

    hashtag
    Select Query

    hashtag
    Filter Query

    hashtag
    Aggregate Query

    hashtag
    Aggregate Filter Query

    hashtag
    Group By Query

    hashtag
    Order By Query

    hashtag
    Transform Query

    hashtag
    Appendix: Workarounds to handle null values without storing nulls

    If you're not able to generate the null index for your use case, you may filter for null values using a default value specified in your schema or a specific value included in your query.

    circle-info

    The following example queries work when the null value is not used in a dataset. Unexpected values may be returned if the specified null value is a valid value in the dataset.

    hashtag
    Filter for default null value(s) specified in your schema

    1. Specify a default null value (defaultNullValue) in your for dimension fields, (dimensionFieldSpecs), metric fields (metricFieldSpecs), and date time fields (dateTimeFieldSpecs).

    2. Ingest the data.

    hashtag
    Filter for a specific value in your query

    Filter for a specific value in your query that will not be included in the dataset. For example, to calculate the average age, use -1 to indicate the value of Age is null.

    • Rewrite the following query:

    • To cover null values as follows:

    Table

    Explore the table component in Apache Pinot, a fundamental building block for organizing and managing data in Pinot clusters, enabling effective data processing and analysis.

    Pinot stores data in tables. A Pinot table is conceptually identical to a relational database table with rows and columns. Columns have the same name and data type, known as the table's .

    Pinot schemas are defined in a JSON file. Because that schema definition is in its own file, multiple tables can share a single schema. Each table can have a unique name, indexing strategy, partitioning, data sources, and other metadata.

    Pinot table types include:

    • real-time: Ingests data from a streaming source like Apache Kafka®

    -- SET explainAskingServer= true is required if 
    -- pinot.query.multistage.explain.include.segment.plan is false, 
    -- optional otherise
    SET explainAskingServers=true;
    EXPLAIN PLAN FOR
    SELECT DISTINCT deviceOS, groupUUID
    FROM userAttributes AS a
    JOIN userGroups AS g
    ON a.userUUID = g.userUUID
    WHERE g.groupUUID = 'group-1'
    LIMIT 100
    Execution Plan
    LogicalSort(offset=[0], fetch=[100])
      PinotLogicalSortExchange(distribution=[hash], collation=[[]], isSortOnSender=[false], isSortOnReceiver=[false])
        LogicalSort(fetch=[100])
          PinotLogicalAggregate(group=[{0, 1}])
            PinotLogicalExchange(distribution=[hash[0, 1]])
              PinotLogicalAggregate(group=[{0, 2}])
                LogicalJoin(condition=[=($1, $3)], joinType=[inner])
                  PinotLogicalExchange(distribution=[hash[1]])
                    LeafStageCombineOperator(table=[userAttributes])
                      StreamingInstanceResponse
                        StreamingCombineSelect
                          SelectStreaming(table=[userAttributes], totalDocs=[10000])
                            Project(columns=[[deviceOS, userUUID]])
                              DocIdSet(maxDocs=[40000])
                                FilterMatchEntireSegment(numDocs=[10000])
                  PinotLogicalExchange(distribution=[hash[1]])
                    LeafStageCombineOperator(table=[userGroups])
                      StreamingInstanceResponse
                        StreamingCombineSelect
                          SelectStreaming(table=[userGroups], totalDocs=[2478])
                            Project(columns=[[groupUUID, userUUID]])
                              DocIdSet(maxDocs=[50000])
                                FilterInvertedIndex(predicate=[groupUUID = 'group-1'], indexLookUp=[inverted_index], operator=[EQ])
                          SelectStreaming(segment=[userGroups_OFFLINE_4], table=[userGroups], totalDocs=[4])
                            Project(columns=[[groupUUID, userUUID]])
                              DocIdSet(maxDocs=[10000])
                                FilterEmpty
                          SelectStreaming(segment=[userGroups_OFFLINE_6], table=[userGroups], totalDocs=[4])
                            Project(columns=[[groupUUID, userUUID]])
                              DocIdSet(maxDocs=[10000])
                                FilterMatchEntireSegment(numDocs=[4])
    -- WITHOUT IMPLENTATION qualifier can be used to ensure logical plan is used
    -- It can be used in any version of Pinot even when segment plan is enabled by default
    EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR 
    SELECT DISTINCT deviceOS, groupUUID
    FROM userAttributes AS a
    JOIN userGroups AS g
    ON a.userUUID = g.userUUID
    WHERE g.groupUUID = 'group-1'
    LIMIT 100
    Execution Plan
    LogicalSort(offset=[0], fetch=[100])
      PinotLogicalSortExchange(distribution=[hash], collation=[[]], isSortOnSender=[false], isSortOnReceiver=[false])
        LogicalSort(fetch=[100])
          PinotLogicalAggregate(group=[{0, 1}])
            PinotLogicalExchange(distribution=[hash[0, 1]])
              PinotLogicalAggregate(group=[{0, 2}])
                LogicalJoin(condition=[=($1, $3)], joinType=[inner])
                  PinotLogicalExchange(distribution=[hash[1]])
                    LogicalProject(deviceOS=[$4], userUUID=[$6])
                      LogicalTableScan(table=[[default, userAttributes]])
                  PinotLogicalExchange(distribution=[hash[1]])
                    LogicalProject(groupUUID=[$3], userUUID=[$4])
                      LogicalFilter(condition=[=($3, _UTF-8'group-1')])
                        LogicalTableScan(table=[[default, userGroups]])
    EXPLAIN IMPLEMENTATION PLAN FOR
    SELECT DISTINCT deviceOS, groupUUID
    FROM userAttributes AS a
    JOIN userGroups AS g
    ON a.userUUID = g.userUUID
    WHERE g.groupUUID = 'group-1'
    LIMIT 100
    0]@192.168.0.98:54196|[0] MAIL_RECEIVE(BROADCAST_DISTRIBUTED)
    ├── [1]@192.168.0.98:54227|[3] MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@192.168.0.98:54196|[0]} (Subtree Omitted)
    ├── [1]@192.168.0.98:54220|[2] MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@192.168.0.98:54196|[0]} (Subtree Omitted)
    ├── [1]@192.168.0.98:54214|[1] MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@192.168.0.98:54196|[0]} (Subtree Omitted)
    └── [1]@192.168.0.98:54206|[0] MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@192.168.0.98:54196|[0]}
        └── [1]@192.168.0.98:54206|[0] SORT LIMIT 100
            └── [1]@192.168.0.98:54206|[0] MAIL_RECEIVE(HASH_DISTRIBUTED)
                ├── [2]@192.168.0.98:54227|[3] MAIL_SEND(HASH_DISTRIBUTED)->{[1]@192.168.0.98:54207|[0],[1]@192.168.0.98:54215|[1],[1]@192.168.0.98:54221|[2],[1]@192.168.0.98:54228|[3]} (Subtree Omitted)
                ├── [2]@192.168.0.98:54220|[2] MAIL_SEND(HASH_DISTRIBUTED)->{[1]@192.168.0.98:54207|[0],[1]@192.168.0.98:54215|[1],[1]@192.168.0.98:54221|[2],[1]@192.168.0.98:54228|[3]} (Subtree Omitted)
                ├── [2]@192.168.0.98:54214|[1] MAIL_SEND(HASH_DISTRIBUTED)->{[1]@192.168.0.98:54207|[0],[1]@192.168.0.98:54215|[1],[1]@192.168.0.98:54221|[2],[1]@192.168.0.98:54228|[3]} (Subtree Omitted)
                └── [2]@192.168.0.98:54206|[0] MAIL_SEND(HASH_DISTRIBUTED)->{[1]@192.168.0.98:54207|[0],[1]@192.168.0.98:54215|[1],[1]@192.168.0.98:54221|[2],[1]@192.168.0.98:54228|[3]}
                    └── [2]@192.168.0.98:54206|[0] SORT LIMIT 100
                        └── [2]@192.168.0.98:54206|[0] AGGREGATE_FINAL
                            └── [2]@192.168.0.98:54206|[0] MAIL_RECEIVE(HASH_DISTRIBUTED)
                                ├── [3]@192.168.0.98:54227|[3] MAIL_SEND(HASH_DISTRIBUTED)->{[2]@192.168.0.98:54207|[0],[2]@192.168.0.98:54215|[1],[2]@192.168.0.98:54221|[2],[2]@192.168.0.98:54228|[3]} (Subtree Omitted)
                                ├── [3]@192.168.0.98:54220|[2] MAIL_SEND(HASH_DISTRIBUTED)->{[2]@192.168.0.98:54207|[0],[2]@192.168.0.98:54215|[1],[2]@192.168.0.98:54221|[2],[2]@192.168.0.98:54228|[3]} (Subtree Omitted)
                                ├── [3]@192.168.0.98:54214|[1] MAIL_SEND(HASH_DISTRIBUTED)->{[2]@192.168.0.98:54207|[0],[2]@192.168.0.98:54215|[1],[2]@192.168.0.98:54221|[2],[2]@192.168.0.98:54228|[3]} (Subtree Omitted)
                                └── [3]@192.168.0.98:54206|[0] MAIL_SEND(HASH_DISTRIBUTED)->{[2]@192.168.0.98:54207|[0],[2]@192.168.0.98:54215|[1],[2]@192.168.0.98:54221|[2],[2]@192.168.0.98:54228|[3]}
                                    └── [3]@192.168.0.98:54206|[0] AGGREGATE_LEAF
                                        └── [3]@192.168.0.98:54206|[0] JOIN
                                            ├── [3]@192.168.0.98:54206|[0] MAIL_RECEIVE(HASH_DISTRIBUTED)
                                            │   ├── [4]@192.168.0.98:54227|[1] MAIL_SEND(HASH_DISTRIBUTED)->{[3]@192.168.0.98:54207|[0],[3]@192.168.0.98:54215|[1],[3]@192.168.0.98:54221|[2],[3]@192.168.0.98:54228|[3]} (Subtree Omitted)
                                            │   └── [4]@192.168.0.98:54214|[0] MAIL_SEND(HASH_DISTRIBUTED)->{[3]@192.168.0.98:54207|[0],[3]@192.168.0.98:54215|[1],[3]@192.168.0.98:54221|[2],[3]@192.168.0.98:54228|[3]}
                                            │       └── [4]@192.168.0.98:54214|[0] PROJECT
                                            │           └── [4]@192.168.0.98:54214|[0] TABLE SCAN (userAttributes) null
                                            └── [3]@192.168.0.98:54206|[0] MAIL_RECEIVE(HASH_DISTRIBUTED)
                                                ├── [5]@192.168.0.98:54227|[1] MAIL_SEND(HASH_DISTRIBUTED)->{[3]@192.168.0.98:54207|[0],[3]@192.168.0.98:54215|[1],[3]@192.168.0.98:54221|[2],[3]@192.168.0.98:54228|[3]} (Subtree Omitted)
                                                └── [5]@192.168.0.98:54214|[0] MAIL_SEND(HASH_DISTRIBUTED)->{[3]@192.168.0.98:54207|[0],[3]@192.168.0.98:54215|[1],[3]@192.168.0.98:54221|[2],[3]@192.168.0.98:54228|[3]}
                                                    └── [5]@192.168.0.98:54214|[0] PROJECT
                                                        └── [5]@192.168.0.98:54214|[0] FILTER
                                                            └── [5]@192.168.0.98:54214|[0] TABLE SCAN (userGroups) null
    explain plan for
    select customer.c_address, orders.o_shippriority
    from customer
    join orders
        on customer.c_custkey = orders.o_custkey
    limit 10
    LogicalSort(offset=[0], fetch=[10])
      PinotLogicalSortExchange(distribution=[hash], collation=[[]], isSortOnSender=[false], isSortOnReceiver=[false])
        LogicalSort(fetch=[10])
          LogicalProject(c_address=[$0], o_shippriority=[$3])
            LogicalJoin(condition=[=($1, $2)], joinType=[inner])
              PinotLogicalExchange(distribution=[hash[1]])
                LogicalProject(c_address=[$4], c_custkey=[$6])
                  LogicalTableScan(table=[[default, customer]])
              PinotLogicalExchange(distribution=[hash[0]])
                LogicalProject(o_custkey=[$5], o_shippriority=[$10])
                  LogicalTableScan(table=[[default, orders]])
             PinotLogicalExchange(distribution=[hash[0]])
                LogicalProject(o_custkey=[$5], o_shippriority=[$10])
                  LogicalTableScan(table=[[default, orders]])
    +---------------------------------------------|------------|---------|
    | Operator                                    | Operator_Id|Parent_Id|
    +---------------------------------------------|------------|---------|
    |BROKER_REDUCE(limit:10)                      | 1          | 0       |
    |COMBINE_SELECT                               | 2          | 1       |
    |PLAN_START(numSegmentsForThisPlan:1)         | -1         | -1      |
    |SELECT(selectList:playerID, playerName)      | 3          | 2       |
    |TRANSFORM_PASSTHROUGH(playerID, playerName)  | 4          | 3       |
    |PROJECT(playerName, playerID)                | 5          | 4       |
    |DOC_ID_SET                                   | 6          | 5       |
    |FILTER_MATCH_ENTIRE_SEGMENT(docs:97889)      | 7          | 6       |
    +---------------------------------------------|------------|---------|
    BROKER_REDUCE(limit:10)
    └── COMBINE_SELECT
        └── PLAN_START(numSegmentsForThisPlan:1)
            └── SELECT(selectList:playerID, playerName)
                └── TRANSFORM_PASSTHROUGH(playerID, playerName)
                    └── PROJECT(playerName, playerID)
                        └── DOC_ID_SET
                            └── FILTER_MATCH_ENTIRE_SEGMENT(docs:97889)
    Failed to start a Pinot [SERVER]
    java.lang.RuntimeException: java.net.BindException: Address already in use
    	at org.apache.pinot.core.transport.QueryServer.start(QueryServer.java:103) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da21132906]
    	at org.apache.pinot.server.starter.ServerInstance.start(ServerInstance.java:158) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da21132906]
    	at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:110) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da2113
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type batch
    ./bin/pinot-admin.sh QuickStart -type batch
    pinot-admin QuickStart -type batch
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type batch_json_index
    ./bin/pinot-admin.sh QuickStart -type batch_json_index
    pinot-admin QuickStart -type batch_json_index
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type batch_complex_type
    ./bin/pinot-admin.sh QuickStart -type batch_complex_type
    pinot-admin QuickStart -type batch_complex_type
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type stream
    ./bin/pinot-admin.sh QuickStart -type stream
    pinot-admin QuickStart -type stream
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type stream_json_index
    ./bin/pinot-admin.sh QuickStart -type stream_json_index
    pinot-admin QuickStart -type stream_json_index
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type realtime_minion
    ./bin/pinot-admin.sh QuickStart -type realtime_minion
    pinot-admin QuickStart -type realtime_minion
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type stream_complex_type
    ./bin/pinot-admin.sh QuickStart -type stream_complex_type
    pinot-admin QuickStart -type stream_complex_type
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type upsert
    ./bin/pinot-admin.sh QuickStart -type upsert
    pinot-admin QuickStart -type upsert
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type upsert_json_index
    ./bin/pinot-admin.sh QuickStart -type upsert_json_index
    pinot-admin QuickStart -type upsert_json_index
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type hybrid
    ./bin/pinot-admin.sh QuickStart -type hybrid
    pinot-admin QuickStart -type hybrid
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type join
    ./bin/pinot-admin.sh QuickStart -type join
    pinot-admin QuickStart -type join
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type LOGICAL_TABLE
    ./bin/pinot-admin.sh QuickStart -type LOGICAL_TABLE
    pinot-admin QuickStart -type LOGICAL_TABLE
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type EMPTY
    ./bin/pinot-admin.sh QuickStart -type EMPTY
    pinot-admin QuickStart -type EMPTY
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type MULTI_STAGE
    ./bin/pinot-admin.sh QuickStart -type MULTI_STAGE
    pinot-admin QuickStart -type MULTI_STAGE
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type PARTIAL_UPSERT
    ./bin/pinot-admin.sh QuickStart -type PARTIAL_UPSERT
    pinot-admin QuickStart -type PARTIAL_UPSERT
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type GEOSPATIAL
    ./bin/pinot-admin.sh QuickStart -type GEOSPATIAL
    pinot-admin QuickStart -type GEOSPATIAL
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type NULL_HANDLING
    ./bin/pinot-admin.sh QuickStart -type NULL_HANDLING
    pinot-admin QuickStart -type NULL_HANDLING
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type TPCH
    ./bin/pinot-admin.sh QuickStart -type TPCH
    pinot-admin QuickStart -type TPCH
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type COLOCATED_JOIN
    ./bin/pinot-admin.sh QuickStart -type COLOCATED_JOIN
    pinot-admin QuickStart -type COLOCATED_JOIN
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type LOOKUP_JOIN
    ./bin/pinot-admin.sh QuickStart -type LOOKUP_JOIN
    pinot-admin QuickStart -type LOOKUP_JOIN
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type AUTH
    ./bin/pinot-admin.sh QuickStart -type AUTH
    pinot-admin QuickStart -type AUTH
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type SORTED
    ./bin/pinot-admin.sh QuickStart -type SORTED
    pinot-admin QuickStart -type SORTED
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type TIMESTAMP
    ./bin/pinot-admin.sh QuickStart -type TIMESTAMP
    pinot-admin QuickStart -type TIMESTAMP
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type GITHUB_EVENTS
    ./bin/pinot-admin.sh QuickStart -type GITHUB_EVENTS
    pinot-admin QuickStart -type GITHUB_EVENTS
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type MULTI_CLUSTER
    ./bin/pinot-admin.sh QuickStart -type MULTI_CLUSTER
    pinot-admin QuickStart -type MULTI_CLUSTER
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type BATCH_MULTIDIR
    ./bin/pinot-admin.sh QuickStart -type BATCH_MULTIDIR
    pinot-admin QuickStart -type BATCH_MULTIDIR
    docker run \
        -p 9000:9000 \
        apachepinot/pinot:latest QuickStart \
        -type time_series
    ./bin/pinot-admin.sh QuickStart -type time_series
    pinot-admin QuickStart -type time_series
    # First run: quickstart generates data in the specified directory
    ./bin/pinot-admin.sh QuickStart -type batch -dataDir /tmp/pinot-quick-start
    
    # Subsequent runs: quickstart reloads existing data from disk
    ./bin/pinot-admin.sh QuickStart -type batch -dataDir /tmp/pinot-quick-start
    ./bin/pinot-admin.sh QuickStart -type batch \
        -zkAddress localhost:2181 \
        -configFile /path/to/pinot-quickstart.conf
    ./bin/pinot-admin.sh QuickStart -type EMPTY \
        -bootstrapTableDir /path/to/my-table-dir
    fetch{table="meetupRsvp_REALTIME",filter="",ts_column="__metadata$recordTimestamp",ts_unit="MILLISECONDS",value="1"}
    | sum{rsvp_count}
    | transformNull{0}
    | keepLastValue{}

    -bootstrapTableDir

    A list of directories, each containing a table schema, table config, and raw data. Use this with -type EMPTY or -type GENERIC to load your own tables into the quickstart cluster.

    -configFile

    -configFilePath

    Path to a properties file that overrides default Pinot configuration values (controller, broker, server, etc.).

    -zkAddress

    -zkUrl, -zkExternalAddress

    URL for an external ZooKeeper instance (e.g. localhost:2181) instead of using the default embedded instance.

    -kafkaBrokerList

    Kafka broker list for streaming quickstarts (e.g. localhost:9092). Use this to connect to an external Kafka cluster instead of the embedded one.

    spinner

    depends on data

    depends on data

    Transformation functions

    use default value

    use default value

    null aware

    Null aware aggregations

    use default value

    use default value

    null aware

    4

    null

    4

    2

    To filter out the specified default null value, for example, you could write a query like the following:

    IS NULL

    always false

    depends on data

    depends on data

    IS NOT NULL

    0

    null

    1

    1

    2

    2

    3

    1

    1

    2

    2

    3

    2

    0

    2

    1

    2

    2

    3

    3

    0

    null

    1

    1

    4

    null

    5

    1

    Store nulls at ingestion time
    null handling modes at query time
    basic support mode
    advanced null handling support
    forward index
    schema
    Column based null storing
    multi-stage query engine
    Table based null storing
    enableColumnBasedNullHandling
    tableIndexConfig.nullHandlingEnabled
    enableColumnBasedNullHandling
    advanced null handling support
    store nulls at ingestion time
    storing nulls at ingestion time
    storing nulls at ingestion time
    query option
    schema

    always true

    2

    3

    offline: Loads data from a batch source

  • hybrid: Loads data from both a batch source and a streaming source

  • Pinot breaks a table into multiple segments and stores these segments in a deep-store such as Hadoop Distributed File System (HDFS) as well as Pinot servers.

    In the Pinot cluster, a table is modeled as a Helix resourcearrow-up-right and each segment of a table is modeled as a Helix Partitionarrow-up-right.

    circle-info

    Table naming in Pinot follows typical naming conventions, such as starting names with a letter, not ending with an underscore, and using only alphanumeric characters.

    Pinot supports the following types of tables:

    Type
    Description

    Offline

    Offline tables ingest pre-built Pinot segments from external data stores and are generally used for batch ingestion.

    Real-time

    Real-time tables ingest data from streams (such as Kafka) and build segments from the consumed data.

    Hybrid

    Hybrid Pinot tables have both real-time as well as offline tables under the hood. By default, all tables in Pinot are hybrid.

    circle-info

    The user querying the database does not need to know the type of the table. They only need to specify the table name in the query.

    e.g. regardless of whether we have an offline table myTable_OFFLINE, a real-time table myTable_REALTIME, or a hybrid table containing both of these, the query will be:

    Table configuration is used to define the table properties, such as name, type, indexing, routing, and retention. It is written in JSON format and is stored in Zookeeper, along with the table schema.

    Use the following properties to make your tables faster or leaner:

    • Segment

    • Indexing

    • Tenants

    hashtag
    Segments

    A table is comprised of small chunks of data known as segments. Learn more about how Pinot creates and manages segments here.

    For offline tables, segments are built outside of Pinot and uploaded using a distributed executor such as Spark or Hadoop. For details, see Batch Ingestion.

    For real-time tables, segments are built in a specific interval inside Pinot. You can tune the following for the real-time segments.

    hashtag
    Flush

    The Pinot real-time consumer ingests the data, creates the segment, and then flushes the in-memory segment to disk. Pinot allows you to configure when to flush the segment in the following ways:

    • Number of consumed rows: After consuming the specified number of rows from the stream, Pinot will persist the segment to disk.

    • Number of rows per segment: Pinot learns and then estimates the number of rows that need to be consumed. The learning phase starts by setting the number of rows to 100,000 (this value can be changed) and adjusts it to reach the appropriate segment size. Because Pinot corrects the estimate as it goes along, the segment size might go significantly over the correct size during the learning phase. You should set this value to optimize the performance of queries.

    • Max time duration to wait: Pinot consumers wait for the configured time duration after which segments are persisted to the disk.

    Replicas A segment can have multiple replicas to provide higher availability. You can configure the number of replicas for a table segment using the CLI.

    Completion Mode By default, if the in-memory segment in the non-winner server is equivalent to the committed segment, then the non-winner server builds and replaces the segment. If the available segment is not equivalent to the committed segment, the server just downloads the committed segment from the controller.

    However, in certain scenarios, the segment build can get very memory-intensive. In these cases, you might want to enforce the non-committer servers to just download the segment from the controller instead of building it again. You can do this by setting completionMode: "DOWNLOAD" in the table configuration.

    For details, see Completion Config.

    Download Scheme

    A Pinot server might fail to download segments from the deep store, such as HDFS, after its completion. However, you can configure servers to download these segments from peer servers instead of the deep store. Currently, only HTTP and HTTPS download schemes are supported. More methods, such as gRPC/Thrift, are planned be added in the future.

    For more details about peer segment download during real-time ingestion, refer to this design doc on bypass deep store for segment completion.arrow-up-right

    hashtag
    Indexing

    You can create multiple indices on a table to increase the performance of the queries. The following types of indices are supported:

    • Forward Index

      • Dictionary-encoded forward index with bit compression

      • Raw value forward index

      • Sorted forward index with run-length encoding

      • Bitmap inverted index

      • Sorted inverted index

    For more details on each indexing mechanism and corresponding configurations, see Indexing.

    Set up Bloomfilters on columns to make queries faster. You can also keep segments in off-heap instead of on-heap memory for faster queries.

    hashtag
    Pre-aggregation

    Aggregate the real-time stream data as it is consumed to reduce segment sizes. We add the metric column values of all rows that have the same values for all dimension and time columns and create a single row in the segment. This feature is only available on REALTIME tables.

    The only supported aggregation is SUM. The columns to pre-aggregate need to satisfy the following requirements:

    • All metrics should be listed in noDictionaryColumns.

    • No multi-value dimensions

    • All dimension columns are treated to have a dictionary, even if they appear as noDictionaryColumns in the config.

    The following table config snippet shows an example of enabling pre-aggregation during real-time ingestion:

    hashtag
    Tenants

    Each table is associated with a tenant. A segment resides on the server, which has the same tenant as itself. For details, see Tenant.

    Optionally, override if a table should move to a server with different tenant based on segment status. The example below adds a tagOverrideConfig under the tenants section for real-time tables to override tags for consuming and completed segments.

    In the above example, the consuming segments will still be assigned to serverTenantName_REALTIME hosts, but once they are completed, the segments will be moved to serverTenantName_OFFLINE.

    You can specify the full name of any tag in this section. For example, you could decide that completed segments for this table should be in Pinot servers tagged as allTables_COMPLETED). To learn more about, see the Moving Completed Segments section.

    hashtag
    Hybrid table

    A hybrid table is a table composed of two tables, one offline and one real-time, that share the same name. In a hybrid table, offline segments can be pushed periodically. The retention on the offline table can be set to a high value because segments are coming in on a periodic basis, whereas the retention on the real-time part can be small.

    Once an offline segment is pushed to cover a recent time period, the brokers automatically switch to using the offline table for segments for that time period and use the real-time table only for data not available in the offline table.

    To learn how time boundaries work for hybrid tables, see Broker.

    A typical use case for hybrid tables is pushing deduplicated, cleaned-up data into an offline table every day while consuming real-time data as it arrives. Data can remain in offline tables for as long as a few years, while the real-time data would be cleaned every few days.

    hashtag
    Examples

    Create a table config for your data, or see examplesarrow-up-right for all possible batch/streaming tables.

    Prerequisites

    • Set up the cluster

    • Create broker and server tenants

    hashtag
    Offline table creation

    Sample console output

    Check out the table config in the Rest APIarrow-up-right to make sure it was successfully uploaded.

    hashtag
    Streaming table creation

    Start Kafka

    Create a Kafka topic

    Create a streaming table

    Sample output

    Start Kafka-Zookeeper

    Start Kafka

    Create stream table

    Check out the table config in the Rest APIarrow-up-right to make sure it was successfully uploaded.

    hashtag
    Logical table

    A logical table provides a unified query interface over multiple physical tables. This is useful for geographic partitioning, table sharding strategies, or creating abstraction layers over complex table hierarchies.

    For details, see Logical Table.

    hashtag
    Hybrid table creation

    To create a hybrid table, you have to create the offline and real-time tables individually. You don't need to create a separate hybrid table.

    schema

    HDFS

    This guide shows you how to configure HDFS for use with Pinot, including data import and deep storage.

    Enable the Hadoop distributed file system (HDFS)arrow-up-right using the pinot-hdfs plugin. In the controller or server, add the config:

    circle-info

    By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-3.0...

    HDFS implementation provides the following options:

    • hadoop.conf.path: Absolute path of the directory containing Hadoop XML configuration files, such as hdfs-site.xml, core-site.xml .

    • hadoop.write.checksum: Create checksum while pushing an object. Default is false

    Each of these properties should be prefixed by pinot.[node].storage.factory.class.hdfs. where node is either controller or server depending on the config

    The kerberos configs should be used only if your Hadoop installation is secured with Kerberos. Refer to the for information on how to secure Hadoop using Kerberos.

    You must provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.

    hashtag
    Push HDFS segment to Pinot Controller

    To push HDFS segment files to Pinot controller, send the HDFS path of your newly created segment files to the Pinot Controller. The controller will download the files.

    This curl example requests tells the controller to download segment files to the proper table:

    hashtag
    Examples

    hashtag
    Job spec

    Standalone Job:

    Hadoop Job:

    hashtag
    Controller config

    hashtag
    Server config

    hashtag
    Minion config

    hashtag
    HDFS as deep storage

    To use HDFS as deep storage, configure each Pinot component with the HDFS plugin and the appropriate storage factory and segment fetcher properties. The sections below provide complete configuration and startup examples for each component.

    hashtag
    Server setup

    hashtag
    Configuration

    hashtag
    Executable

    hashtag
    Controller setup

    hashtag
    Configuration

    hashtag
    Executable

    hashtag
    Broker setup

    hashtag
    Configuration

    hashtag
    Executable

    hashtag
    Kerberos authentication

    When using HDFS with Kerberos security enabled, Pinot provides two ways to authenticate:

    hashtag
    1. Automatic authentication (recommended)

    By configuring the storage.factory Kerberos properties shown above, Pinot will automatically handle Kerberos authentication using the specified keytab and principal. This eliminates the need for manual kinit commands and ensures continuous authentication even after ticket expiration.

    hashtag
    Why these properties are required

    The storage.factory Kerberos properties serve a critical purpose in Pinot's HDFS integration:

    For Controller:

    • The controller uses controller.data.dir to store segment metadata and other data in HDFS

    • When controller.data.dir points to an HDFS path (e.g., hdfs://namenode:8020/pinot/data), the HadoopPinotFS plugin needs Kerberos credentials to access it

    For Server:

    • The server uses HadoopPinotFS for various HDFS operations including segment downloads and deep storage access

    • When servers need to access segments stored in HDFS deep storage, they require valid Kerberos credentials

    • The storage.factory properties provide persistent authentication that survives across server restarts and ticket expirations

    hashtag
    Understanding the two sets of Kerberos properties

    You may notice two sets of Kerberos properties in the configuration:

    1. storage.factory properties (recommended):

      • pinot.controller.storage.factory.hdfs.hadoop.kerberos.principal

      • pinot.controller.storage.factory.hdfs.hadoop.kerberos.keytab

    hashtag
    Benefits of automatic authentication

    • Eliminates the need to run kinit commands manually, reducing operational overhead and human error

    • Kerberos tickets typically expire after 24 hours (configurable); with keytab-based authentication, Pinot automatically renews tickets internally, preventing service disruptions

    • Keytab files provide secure, long-term credentials without storing passwords in scripts or configuration

    hashtag
    2. Manual authentication (legacy)

    Alternatively, you can manually authenticate using kinit before starting Pinot components:

    Limitations of manual authentication:

    • Ticket expiration: Kerberos tickets typically expire after 24 hours, requiring re-authentication

    • Service interruption: If tickets expire while Pinot is running, HDFS operations will fail until re-authentication

    • Operational burden: Requires monitoring and manual intervention, especially problematic for 24/7 production systems

    circle-exclamation

    Manual authentication is not recommended for production environments. Always use the storage.factory Kerberos properties for production deployments.

    hashtag
    Troubleshooting

    hashtag
    HDFS FileSystem issues

    If you receive an error that says No FileSystem for scheme"hdfs", the problem is likely to be a class loading issue.

    To fix, try adding the following property to core-site.xml:

    fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem

    And then export /opt/pinot/lib/hadoop-common-<release-version>.jar in the classpath.

    hashtag
    Kerberos authentication issues

    hashtag
    Error: "Failed to authenticate with Kerberos"

    Possible causes:

    1. Incorrect keytab path: Ensure the keytab file path is absolute and accessible by the Pinot process

    2. Wrong principal name: Verify the principal name matches the one in the keytab file

    3. Keytab file permissions: The keytab file must be readable by the user running Pinot (typically chmod 400 or chmod 600

    Solution:

    hashtag
    Error: "GSSException: No valid credentials provided"

    Cause: This typically occurs when the storage.factory Kerberos properties are not set, the keytab file path is incorrect or the file doesn't exist, or the Kerberos configuration (krb5.conf) is not properly configured.

    Solution:

    1. Verify all storage.factory Kerberos properties are correctly set in the configuration

    2. Ensure the keytab file exists and has correct permissions

    3. Check that /etc/krb5.conf (or $JAVA_HOME/jre/lib/security/krb5.conf

    hashtag
    Error: "Unable to obtain Kerberos password" or "Clock skew too great"

    Cause: Time synchronization issue between Pinot server and Kerberos KDC.

    Solution:

    Kerberos requires clock synchronization within 5 minutes (default) between client and KDC.

    hashtag
    Error: "HDFS operation fails after running for several hours"

    Cause: This typically indicates that manual kinit was used instead of storage.factory properties, and Kerberos tickets have expired (default 24 hours).

    Solution:

    1. Configure storage.factory Kerberos properties to enable automatic ticket renewal

    2. Remove any manual kinit from startup scripts

    3. Restart Pinot components to apply the configuration

    hashtag
    Verifying Kerberos configuration

    To verify your Kerberos setup is working correctly:

    hashtag
    Best practices

    1. Use absolute paths for keytab files in configuration

    2. Secure keytab files with appropriate permissions (400 or 600)

    3. Use service principals (e.g., pinot/hostname@REALM) rather than user principals for production

    Architecture

    Understand how the components of Apache Pinot™ work together to create a scalable OLAP database that can deliver low-latency, high-concurrency queries at scale.

    Apache Pinot™ is a distributed OLAP database designed to serve real-time, user-facing use cases, which means handling large volumes of data and many concurrent queries with very low query latencies. Pinot supports the following requirements:

    • Ultra low-latency queries (as low as 10ms P95)

    • High query concurrency (as many as 100,000 queries per second)

    • High data freshness (streaming data available for query immediately upon ingestion)

    • Large data volume (up to petabytes)

    hashtag
    Distributed design principles

    To accommodate large data volumes with stringent latency and concurrency requirements, Pinot is designed as a distributed database that supports the following requirements:

    • Highly available: Pinot has no single point of failure. When tables are configured for replication, and a node goes down, the cluster is able to continue processing queries.

    • Horizontally scalable: Operators can scale a Pinot cluster by adding new nodes when the workload increases. There are even two node types ( and ) to scale query volume, query complexity, and data size independently.

    • Immutable data

    hashtag
    Core components

    As described in the Pinot , Pinot has four node types:

    hashtag
    Apache Helix and ZooKeeper

    Distributed systems do not maintain themselves, and in fact require sophisticated scheduling and resource management to function. Pinot uses for this purpose. Helix exists as an independent project, but it was designed by the original creators of Pinot for Pinot's own cluster management purposes, so the architectures of the two systems are well-aligned. Helix takes the form of a process on the controller, plus embedded agents on the brokers and servers. It uses as a fault-tolerant, strongly consistent, durable state store.

    Helix maintains a picture of the intended state of the cluster, including the number of servers and brokers, the configuration and schema of all tables, connections to streaming ingest sources, currently executing batch ingestion jobs, the assignment of table segments to the servers in the cluster, and more. All of these configuration items are potentially mutable quantities, since operators routinely change table schemas, add or remove streaming ingest sources, begin new batch ingestion jobs, and so on. Additionally, physical cluster state may change as servers and brokers fail or suffer network partition. Helix works constantly to drive the actual state of the cluster to match the intended state, pushing configuration changes to brokers and servers as needed.

    There are three physical node types in a Helix cluster:

    • Participant: These nodes do things, like store data or perform computation. Participants host resources, which are Helix's fundamental storage abstraction. Because Pinot servers store segment data, they are participants.

    • Spectator: These nodes see things, observing the evolving state of the participants through events pushed to the spectator. Because Pinot brokers need to know which servers host which segments, they are spectators.

    In addition, Helix defines two logical components to express its storage abstraction:

    • Partition. A unit of data storage that lives on at least one participant. Partitions may be replicated across multiple participants. A Pinot segment is a partition.

    • Resource. A logical collection of partitions, providing a single view over a potentially large set of data stored across a distributed system. A Pinot table is a resource.

    In summary, the Pinot architecture maps onto Helix components as follows:

    Pinot Component
    Helix Component

    Helix uses ZooKeeper to maintain cluster state. ZooKeeper sends Helix spectators notifications of changes in cluster state (which correspond to changes in ZNodes). Zookeeper stores the following information about the cluster:

    Resource
    Stored Properties

    Zookeeper, as a first-class citizen of a Pinot cluster, may use the well-known ZNode structure for operations and troubleshooting purposes. Be advised that this structure can change in future Pinot releases.

    hashtag
    Controller

    The Pinot schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of and ). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

    hashtag
    Fault tolerance

    Only one controller can be active at a time, so when multiple controllers are present in a cluster, they elect a leader. When that controller instance becomes unavailable, the remaining instances automatically elect a new leader. Leader election is achieved using Apache Helix. A Pinot cluster can serve queries without an active controller, but it can't perform any metadata-modifying operations, like adding a table or consuming a new segment.

    hashtag
    Controller REST interface

    The controller provides a REST interface that allows read and write access to all logical storage resources (e.g., servers, brokers, tables, and segments). See for more information on the web-based admin tool.

    hashtag
    Broker

    The responsibility is to route queries to the appropriate instances, or in the case of multi-stage queries, to compute a complete query plan and distribute it to the servers required to execute it. The broker collects and merges the responses from all servers into a final result, then sends the result back to the requesting client. The broker exposes an HTTP endpoint that accepts SQL queries in JSON format and returns the response in JSON.

    Each broker maintains a query routing table. The routing table maps segments to the servers that store them. (When replication is configured on a table, each segment is stored on more than one server.) The broker computes multiple routing tables depending on the configured strategy for a table. The default strategy is to balance the query load across all available servers.

    circle-info

    Advanced routing strategies are available, such as replica-aware routing, partition-based routing, and minimal server selection routing.

    hashtag
    Query processing

    Every query processed by a broker uses the single-stage engine or the . For single-stage queries, the broker does the following:

    • Computes query routes based on the routing strategy defined in the configuration.

    • Computes the list of segments to query on each . (See for further details on this process.)

    • Sends the query to each of those servers for local execution against their segments.

    For multi-stage queries, the broker performs the following:

    • Computes a query plan that runs on multiple sets of servers. The servers selected for the first stage are selected based on the segments required to execute the query, which are determined in a process similar to single-stage queries.

    • Sends the relevant portions of the query plan to one or more servers in the cluster for each stage of the query plan.

    • The servers that received query plans each execute their part of the query. For more details on this process, read about the .

    hashtag
    Server

    host on locally attached storage and process queries on those segments. By convention, operators speak of "real-time" and "offline" servers, although there is no difference in the server process itself or even its configuration that distinguishes between the two. This is merely a convention reflected in the assignment strategy to confine the two different kinds of workloads to two groups of physical instances, since the performance-limiting factors differ between the two kinds of workloads. For example, offline servers might optimize for larger storage capacity, whereas real-time servers might optimize for memory and CPU cores.

    hashtag
    Offline servers

    Offline servers host segments created by ingesting batch data. The controller writes these segments to the offline server according to the table's replication factor and segment assignment strategy. Typically, the controller writes new segments to the , and affected servers download the segment from deep store. The controller then notifies brokers that a new segment exists, and is available to participate in queries.

    Because offline tables tend to have long retention periods, offline servers tend to scale based on the size of the data they store.

    hashtag
    Real-time servers

    Real-time servers ingest data from streaming sources, like Apache Kafka®, Apache Pulsar®, or AWS Kinesis. Streaming data ends up in conventional segment files just like batch data, but is first accumulated in an in-memory data structure known as a consuming segment. Each message consumed from a streaming source is written immediately to the relevant consuming segment, and is available for query processing from the consuming segment immediately, since consuming segments participate in query processing as first-class citizens. Consuming segments get flushed to disk periodically based on a completion threshold, which can be calculated by row count, ingestion time, or segment size. A flushed segment on a real-time table is called a completed segment, and is functionally equivalent to a segment created during offline ingest.

    Real-time servers tend to be scaled based on the rate at which they ingest streaming data.

    hashtag
    Minion

    A Pinot is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minion tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.

    Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function without minions, they are typically present to support routine tasks like ingesting batch data.

    hashtag
    Data ingestion overview

    Pinot exist in two varieties: offline (or batch) and real-time. Offline tables contain data from batch sources like CSV, Avro, or Parquet files, and real-time tables contain data from streaming sources like like Apache Kafka®, Apache Pulsar®, or AWS Kinesis.

    hashtag
    Offline (batch) ingest

    Pinot ingests batch data using an , which follows a process like this:

    1. The job transforms a raw data source (such as a CSV file) into . This is a potentially complex process resulting in a file that is typically several hundred megabytes in size.

    2. The job then transfers the file to the cluster's and notifies the that a new segment exists.

    3. The controller (in its capacity as a Helix controller) updates the ideal state of the cluster in its cluster metadata map.

    hashtag
    Real-time ingest

    Ingestion is established at the time a real-time table is created, and continues as long as the table exists. When the controller receives the metadata update to create a new real-time table, the table configuration specifies the source of the streaming input data—often a topic in a Kafka cluster. This kicks off a process like this:

    1. The controller picks one or more servers to act as direct consumers of the streaming input source.

    2. The controller creates consuming segments for the new table. It does this by creating an entry in the global metadata map for a new consuming segment for each of the real-time servers selected in step 1.

    3. Through Helix functionality on the controller and the relevant servers, the servers proceed to create consuming segments in memory and establish a connection to the streaming input source. When this input source is Kafka, each server acts as a Kafka consumer directly, with no other components involved in the integration.

    Complex Type Examples (Unnest)

    Additional examples that demonstrate handling of complex types.

    Additional examples that demonstrate handling of complex types.

    hashtag
    Unnest Root Level Collection

    In this example, we would look at un-nesting json records that are batched together as part of a single key at the root level. We will make use of the configs to persist the individual student records as separate rows in Pinot.

    Batch Ingestion Guide

    Batch ingestion of data into Apache Pinot.

    With batch ingestion you create a table using data already present in a file system such as S3. This is particularly useful when you want to use Pinot to query across large data with minimal latency or to test out new features using a simple data file.

    hashtag
    Choosing a Batch Ingestion Mode

    Pinot provides several batch ingestion modes. Use the table below to pick the one that fits your environment and data scale.

    select count(*) from my_table where column IS NOT NULL
    {
      "schemaName": "my_table",
      "enableColumnBasedNullHandling": true,
      "dimensionFieldSpecs": [
        {
          "name": "notNullColumn",
          "dataType": "STRING",
          "notNull": true
        },
        {
          "name": "explicitNullableColumn",
          "dataType": "STRING",
          "notNull": false
        },
        {
          "name": "implicitNullableColumn",
          "dataType": "STRING"
        }
      ]
    }
    {
      "tableIndexConfig": {
        "nullHandlingEnabled": true
      }
    }
    select $docId as rowId, col1 from my_table where col1 IS NOT NULL
    select $docId as rowId, col1 + 1 as result from my_table
    select $docId as rowId, col1 from my_table where col1 = 1
    select count(col1)  as count, mode(col1) as mode from my_table
        select count(*) from my_table where column <> 'default_null_value'
        select avg(Age) from my_table
        select avg(Age) from my_table WHERE Age <> -1
    select count(*)
    from myTable
    docker run \
        --network=pinot-demo \
        --name pinot-batch-table-creation \
        ${PINOT_IMAGE} AddTable \
        -schemaFile examples/batch/airlineStats/airlineStats_schema.json \
        -tableConfigFile examples/batch/airlineStats/airlineStats_offline_table_config.json \
        -controllerHost pinot-controller \
        -controllerPort 9000 \
        -exec
    Executing command: AddTable -tableConfigFile examples/batch/airlineStats/airlineStats_offline_table_config.json -schemaFile examples/batch/airlineStats/airlineStats_schema.json -controllerHost pinot-controller -controllerPort 9000 -exec
    Sending request: http://pinot-controller:9000/schemas to controller: a413b0013806, version: Unknown
    {"status":"Table airlineStats_OFFLINE succesfully added"}
    bin/pinot-admin.sh AddTable \
        -schemaFile examples/batch/airlineStats/airlineStats_schema.json \
        -tableConfigFile examples/batch/airlineStats/airlineStats_offline_table_config.json \
        -exec
    # add schema
    curl -F schemaName=@airlineStats_schema.json  localhost:9000/schemas
    
    # add table
    curl -i -X POST -H 'Content-Type: application/json' \
        -d @airlineStats_offline_table_config.json localhost:9000/tables
    docker run \
        --network pinot-demo --name=kafka \
        -e KAFKA_NODE_ID=1 \
        -e KAFKA_PROCESS_ROLES=broker,controller \
        -e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093 \
        -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092 \
        -e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
        -e KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT \
        -e KAFKA_CONTROLLER_QUORUM_VOTERS=1@kafka:9093 \
        -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
        -e CLUSTER_ID=MkU3OEVBNTcwNTJENDM2Qk \
        -d apache/kafka:4.0.0
    docker exec \
      -t kafka \
      /opt/kafka/bin/kafka-topics.sh \
      --bootstrap-server kafka:9092 \
      --partitions=1 --replication-factor=1 \
      --create --topic flights-realtime
    docker run \
        --network=pinot-demo \
        --name pinot-streaming-table-creation \
        ${PINOT_IMAGE} AddTable \
        -schemaFile examples/stream/airlineStats/airlineStats_schema.json \
        -tableConfigFile examples/docker/table-configs/airlineStats_realtime_table_config.json \
        -controllerHost pinot-controller \
        -controllerPort 9000 \
        -exec
    Executing command: AddTable -tableConfigFile examples/docker/table-configs/airlineStats_realtime_table_config.json -schemaFile examples/stream/airlineStats/airlineStats_schema.json -controllerHost pinot-controller -controllerPort 9000 -exec
    Sending request: http://pinot-controller:9000/schemas to controller: 8fbe601012f3, version: Unknown
    {"status":"Table airlineStats_REALTIME succesfully added"}
    bin/pinot-admin.sh StartZookeeper -zkPort 2181
    bin/pinot-admin.sh  StartKafka -zkAddress=localhost:2181/kafka -port 19092
    pinot-table-realtime.json
        "tableIndexConfig": { 
          "noDictionaryColumns": ["metric1", "metric2"],
          "aggregateMetrics": true,
          ...
        }
      "broker": "brokerTenantName",
      "server": "serverTenantName",
      "tagOverrideConfig" : {
        "realtimeConsuming" : "serverTenantName_REALTIME"
        "realtimeCompleted" : "serverTenantName_OFFLINE"
      }
    }
    "OFFLINE": {
        "tableName": "pinotTable", 
        "tableType": "OFFLINE", 
        "segmentsConfig": {
          ... 
        }, 
        "tableIndexConfig": { 
          ... 
        },  
        "tenants": {
          "broker": "myBrokerTenant", 
          "server": "myServerTenant"
        },
        "metadata": {
          ...
        }
      },
      "REALTIME": { 
        "tableName": "pinotTable", 
        "tableType": "REALTIME", 
        "segmentsConfig": {
          ...
        }, 
        "tableIndexConfig": { 
          ... 
          "streamConfigs": {
            ...
          },  
        },  
        "tenants": {
          "broker": "myBrokerTenant", 
          "server": "myServerTenant"
        },
        "metadata": {
        ...
        }
      }
    }
    -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-hdfs
    Inverted Index
    Star-tree Index
    Range Index
    Text Index
    Geospatial
    hadoop.kerberos.principle
  • hadoop.kerberos.keytab

  • Without storage.factory Kerberos properties, the controller would fail to read/write to HDFS, causing segment upload and metadata operations to fail
  • These properties enable the HadoopPinotFS plugin to programmatically authenticate using the keytab file

  • pinot.server.storage.factory.hdfs.hadoop.kerberos.principal

  • pinot.server.storage.factory.hdfs.hadoop.kerberos.keytab

  • Purpose: These properties configure Kerberos authentication for the HadoopPinotFS storage factory, which handles controller and server deep storage operations and general HDFS filesystem operations through the storage factory.

    Why needed: The storage factory is initialized at startup and used throughout the component's lifecycle for HDFS access. Without these properties, any HDFS operation through the storage factory would fail with authentication errors.

  • segment.fetcher properties (legacy, for backward compatibility):

    • pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle (note: typo "principle" instead of "principal" maintained for compatibility)

    • pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab

    • pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle

    • pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab

    Purpose: These configure Kerberos for the segment fetcher component specifically.

    Why both are needed: While there is some functional overlap, having both ensures complete coverage of all HDFS access patterns, backward compatibility with existing deployments, and independent operation of the segment fetcher.

  • Automation challenges: Difficult to integrate into automated deployment pipelines

    )
    ) is properly configured with your Kerberos realm settings

    Monitor Kerberos ticket expiration in logs to ensure automatic renewal is working

  • Keep keytab files backed up in secure locations

  • Test configuration in a non-production environment first

  • Hadoop in secure mode documentationarrow-up-right
    bin/pinot-admin.sh AddTable \
        -schemaFile examples/stream/airlineStats/airlineStats_schema.json \
        -tableConfigFile examples/stream/airlineStats/airlineStats_realtime_table_config.json \
        -exec
    export HADOOP_HOME=/local/hadoop/
    export HADOOP_VERSION=2.7.1
    export HADOOP_GUAVA_VERSION=11.0.2
    export HADOOP_GSON_VERSION=2.2.4
    export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
    curl -X POST -H "UPLOAD_TYPE:URI" -H "DOWNLOAD_URI:hdfs://nameservice1/hadoop/path/to/segment/file.
    executionFrameworkSpec:
        name: 'standalone'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 'hdfs:///path/to/input/directory/'
    outputDirURI: 'hdfs:///path/to/output/directory/'
    includeFileNamePath: 'glob:**/*.csv'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: hdfs
          className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
          configs:
            hadoop.conf.path: 'path/to/conf/directory/'
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    executionFrameworkSpec:
        name: 'hadoop'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentUriPushJobRunner'
        extraConfigs:
          stagingDir: 'hdfs:///path/to/staging/directory/'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 'hdfs:///path/to/input/directory/'
    outputDirURI: 'hdfs:///path/to/output/directory/'
    includeFileNamePath: 'glob:**/*.csv'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: hdfs
          className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
          configs:
            hadoop.conf.path: '/etc/hadoop/conf/'
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    controller.data.dir=hdfs://path/to/data/directory/
    controller.local.temp.dir=/path/to/local/temp/directory
    controller.enable.split.commit=true
    pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
    pinot.controller.storage.factory.hdfs.hadoop.conf.path=path/to/conf/directory/
    pinot.controller.segment.fetcher.protocols=file,http,hdfs
    pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
    pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
    pinot.server.instance.enable.split.commit=true
    pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
    pinot.server.storage.factory.hdfs.hadoop.conf.path=path/to/conf/directory/
    pinot.server.segment.fetcher.protocols=file,http,hdfs
    pinot.server.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
    pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
    storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
    storage.factory.hdfs.hadoop.conf.path=path/to/conf/directory
    segment.fetcher.protocols=file,http,hdfs
    segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
    segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
    pinot.server.instance.enable.split.commit=true
    pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
    pinot.server.storage.factory.hdfs.hadoop.conf.path=/path/to/hadoop/conf/directory/
    # For server, instructing the HadoopPinotFS plugin to use the specified keytab and principal when accessing HDFS paths
    pinot.server.storage.factory.hdfs.hadoop.kerberos.principle=<hdfs-principle>
    pinot.server.storage.factory.hdfs.hadoop.kerberos.keytab=<hdfs-keytab>
    pinot.server.segment.fetcher.protocols=file,http,hdfs
    pinot.server.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
    pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
    pinot.set.instance.id.to.hostname=true
    pinot.server.instance.dataDir=/path/in/local/filesystem/for/pinot/data/server/index
    pinot.server.instance.segmentTarDir=/path/in/local/filesystem/for/pinot/data/server/segment
    pinot.server.grpc.enable=true
    pinot.server.grpc.port=8090
    export HADOOP_HOME=/path/to/hadoop/home
    export HADOOP_VERSION=2.7.1
    export HADOOP_GUAVA_VERSION=11.0.2
    export HADOOP_GSON_VERSION=2.2.4
    export GC_LOG_LOCATION=/path/to/gc/log/file
    export PINOT_VERSION=0.10.0
    export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin/
    export SERVER_CONF_DIR=/path/to/pinot/conf/dir/
    export ZOOKEEPER_ADDRESS=localhost:2181
    
    
    export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
    export JAVA_OPTS="-Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-server.log"
    ${PINOT_DISTRIBUTION_DIR}/bin/start-server.sh  -zkAddress ${ZOOKEEPER_ADDRESS} -configFileName ${SERVER_CONF_DIR}/server.conf
    controller.data.dir=hdfs://path/in/hdfs/for/controller/segment
    controller.local.temp.dir=/tmp/pinot/
    controller.zk.str=<ZOOKEEPER_HOST:ZOOKEEPER_PORT>
    controller.enable.split.commit=true
    controller.access.protocols.http.port=9000
    controller.helix.cluster.name=PinotCluster
    pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
    pinot.controller.storage.factory.hdfs.hadoop.conf.path=/path/to/hadoop/conf/directory/
    # For controller, instructing the HadoopPinotFS plugin to use the specified keytab and principal when accessing the HDFS path defined in controller.data.dir
    pinot.controller.storage.factory.hdfs.hadoop.kerberos.principle=<hdfs-principle>
    pinot.controller.storage.factory.hdfs.hadoop.kerberos.keytab=<hdfs-keytab>
    pinot.controller.segment.fetcher.protocols=file,http,hdfs
    pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
    pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
    controller.vip.port=9000
    controller.port=9000
    pinot.set.instance.id.to.hostname=true
    pinot.server.grpc.enable=true
    export HADOOP_HOME=/path/to/hadoop/home
    export HADOOP_VERSION=2.7.1
    export HADOOP_GUAVA_VERSION=11.0.2
    export HADOOP_GSON_VERSION=2.2.4
    export GC_LOG_LOCATION=/path/to/gc/log/file
    export PINOT_VERSION=0.10.0
    export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin/
    export SERVER_CONF_DIR=/path/to/pinot/conf/dir/
    export ZOOKEEPER_ADDRESS=localhost:2181
    
    
    export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
    export JAVA_OPTS="-Xms8G -Xmx12G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-controller.log"
    ${PINOT_DISTRIBUTION_DIR}/bin/start-controller.sh -configFileName ${SERVER_CONF_DIR}/controller.conf
    pinot.set.instance.id.to.hostname=true
    pinot.server.grpc.enable=true
    export HADOOP_HOME=/path/to/hadoop/home
    export HADOOP_VERSION=2.7.1
    export HADOOP_GUAVA_VERSION=11.0.2
    export HADOOP_GSON_VERSION=2.2.4
    export GC_LOG_LOCATION=/path/to/gc/log/file
    export PINOT_VERSION=0.10.0
    export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin/
    export SERVER_CONF_DIR=/path/to/pinot/conf/dir/
    export ZOOKEEPER_ADDRESS=localhost:2181
    
    
    export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
    export JAVA_OPTS="-Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-broker.log"
    ${PINOT_DISTRIBUTION_DIR}/bin/start-broker.sh -zkAddress ${ZOOKEEPER_ADDRESS} -configFileName  ${SERVER_CONF_DIR}/broker.conf
    kinit -kt <your kerberos keytab> <your kerberos principal>
    # Verify keytab contains the correct principal
    klist -kt /path/to/your.keytab
    
    # Test authentication manually
    kinit -kt /path/to/your.keytab [email protected]
    
    # Check if authentication succeeded
    klist
    # Check time synchronization
    date
    # Ensure NTP is running and synchronized
    sudo systemctl status ntpd
    # Or for chrony
    sudo systemctl status chronyd
    # 1. Test keytab authentication
    kinit -kt /path/to/your.keytab [email protected]
    
    # 2. Verify you can list HDFS directories
    hdfs dfs -ls /
    
    # 3. Check Pinot logs for authentication messages
    tail -f /path/to/pinot/logs/pinot-controller.log | grep -i kerberos
    tail -f /path/to/pinot/logs/pinot-server.log | grep -i kerberos
    
    # 4. Look for successful authentication messages like:
    # "Login successful for user <principal> using keytab file <keytab-path>"
    : Pinot assumes all stored data is immutable, which helps simplify the parts of the system that handle data storage and replication. However, Pinot still supports upserts on streaming entity data and background purges of data to comply with data privacy regulations.
  • Dynamic configuration changes: Operations like adding new tables, expanding a cluster, ingesting data, modifying an existing table, and adding indexes do not impact query availability or performance.

  • Minion
    Controller: This node observes and manages the state of participant nodes. The controller is responsible for coordinating all state transitions in the cluster and ensures that state constraints are satisfied while maintaining cluster stability.

    Broker

    A Helix Spectator that observes the cluster for changes in the state of segments and servers. To support multi-tenancy, brokers are also modeled as Helix Participants.

    Minion

    Helix Participant that performs computation rather than storing data

    Receives the results from each server and merges them.

  • Sends the query result to the client.

  • The broker receives a complete result set from the final stage of the query, which is always a single server.

  • The broker sends the query result to the client.

  • The controller then assigns the segment to one or more "offline" servers (depending on replication factor) and notifies them that new segments are available.

  • The servers then download the newly created segments directly from the deep store.

  • The cluster's brokers, which watch for state changes as Helix spectators, detect the new segments and update their segment routing tables accordingly. The cluster is now able to query the new offline segments.

  • Through Helix functionality on the controller and all of the cluster's brokers, the brokers become aware of the consuming segments, and begin including them in query routing immediately.

  • The consuming servers simultaneously begin consuming messages from the streaming input source, storing them in the consuming segment.

  • When a server decides its consuming segment is complete, it commits the in-memory consuming segment to a conventional segment file, uploads it to the deep store, and notifies the controller.

  • The controller and the server create a new consuming segment to continue real-time ingestion.

  • The controller marks the newly committed segment as online. Brokers then discover the new segment through the Helix notification mechanism, allowing them to route queries to it in the usual fashion.

  • Segment

    Helix Partition

    Table

    Helix Resource

    Controller

    Helix Controller or Helix agent that drives the overall state of the cluster

    Server

    Controller

    - Controller that is assigned as the current leader

    Servers and Brokers

    - List of servers and brokers - Configuration of all current servers and brokers - Health status of all current servers and brokers

    Tables

    - List of tables - Table configurations - Table schema - List of the table's segments

    Segment

    servers
    brokers
    Components
    Controller
    Broker
    Server
    Apache Helixarrow-up-right
    Apache ZooKeeperarrow-up-right
    controller
    real-time tables
    offline tables
    Pinot Data Explorer
    broker's
    server
    routing
    multi-stage engine
    table
    server
    routing
    multi-stage engine
    Servers
    segments
    table
    deep store
    minion
    tables
    ingestion job
    segments
    deep store
    controller
    Pinot's Zookeeper Browser UI

    Helix Participant

    - Exact server locations of a segment - State of each segment (online/offline/error/consuming) - Metadata about each segment

    hashtag
    Sample JSON record

    hashtag
    Pinot Schema

    The Pinot schema for this example would look as follows.

    hashtag
    Pinot Table Configuration

    The Pinot table configuration for this schema would look as follows.

    hashtag
    Data in Pinot

    Post ingestion, the student records would appear as separate records in Pinot. Note that the nested field scores is captured as a JSON field.

    Unnested Student Records

    hashtag
    Unnest sibling collections

    In this example, we would look at un-nesting the sibling collections "student" and "teacher".

    hashtag
    Sample JSON Record

    hashtag
    Pinot Schema

    hashtag
    Pinot Table configuration

    hashtag
    Data in Pinot

    Unnested student records

    hashtag
    Unnest nested collection

    In this example, we would look at un-nesting the nested collection "students.grades".

    hashtag
    Sample JSON Record

    hashtag
    Pinot Schema

    hashtag
    Pinot Table configuration

    hashtag
    Data in Pinot

    Unnest Nested Collection

    hashtag
    Unnest Multi Level Array

    In this example, we would look at un-nesting the array "finalExam" which is located within the array "students".

    hashtag
    Sample JSON Record

    hashtag
    Pinot Schema

    hashtag
    Pinot Table configuration

    hashtag
    Data in Pinot

    Unnested Multi Level Array

    hashtag
    Convert inner collections

    In this example, the inner collection "grades" is converted into a multi value string column.

    hashtag
    Sample JSON Record

    hashtag
    Pinot Schema

    hashtag
    Pinot Table configuration

    hashtag
    Data in Pinot

    Converted Inner Collection

    hashtag
    Primitive Array Converted to JSON String

    In this example, the array of primitives "extra_curricular" is converted to a Json string.

    hashtag
    Sample JSON Record

    hashtag
    Pinot Schema

    hashtag
    Pinot Table configuration

    hashtag
    Data in Pinot

    Primitives Converted to JSON

    hashtag
    Unnest JsonArrayString collections

    In this example, the data is STRING type and the content is string encoded JSON ARRAY .

    In this case, the Unnest won't happen automatically on a STRING field.

    Users need to first convert the STRING field to ARRAY or MAP field then perform the unnest.

    Here are the steps:

    1. use enrichmentConfigs to create the intermediate column recordArray with the function: jsonStringToListOrMap(data_for_unnesting)

    1. configure complexTypeConfig to unnest the intermediate field recordArray to generate the field recordArray||name

    hashtag
    Sample Record

    hashtag
    Pinot Schema

    circle-exclamation

    Note the field to ingest is recordArray||name not data_for_unnesting||name

    hashtag
    Pinot Table Configuration

    hashtag
    Data in Pinot

    **

    ComplexType
    hashtag
    Decision Guide
    Mode
    Best For
    Infrastructure
    Data Scale
    Status

    Standalone

    Dev/test, small jobs, scripted pipelines

    None (single JVM)

    Up to a few GB

    Recommended for dev

    hashtag
    When to Use Each Mode

    Standalone is the simplest option and requires no distributed computing framework. It runs segment generation in a single JVM process, making it ideal for development, testing, and small production jobs where data volumes are modest (up to a few GB). It is also well suited for scripted CI/CD pipelines.

    Spark 3 is the recommended choice for production batch ingestion at scale. It distributes segment generation across a Spark 3.x cluster, enabling you to process datasets ranging from gigabytes to terabytes and beyond. If you are setting up a new Spark-based pipeline, use this mode.

    Hadoop uses MapReduce to generate segments on a Hadoop cluster. It is considered legacy and is primarily useful if you have existing MapReduce infrastructure and pipelines that you cannot migrate away from.

    Flink is a good fit for organizations that already run Apache Flink. It supports both batch and streaming modes and is especially useful for backfilling offline tables or bootstrapping upsert tables, since the Flink connector can write partitioned segments that participate correctly in upsert semantics.

    LaunchDataIngestionJob is a CLI convenience wrapper that invokes the Standalone runner under the hood. Use it when you want to trigger ingestion from a shell command or cron job without writing custom code.

    hashtag
    Maven Artifact Coordinates

    All artifacts use the group ID org.apache.pinot. Replace ${pinot.version} with your Pinot release version.

    Mode
    Artifact ID
    Notes

    Standalone

    pinot-batch-ingestion-standalone

    Included in the Pinot binary distribution

    Spark 3

    pinot-batch-ingestion-spark-3

    Located in plugins-external/pinot-batch-ingestion/

    Example Maven dependency for Spark 3:


    hashtag
    Getting Started

    To ingest data from a filesystem, perform the following steps, which are described in more detail in this page:

    1. Create schema configuration

    2. Create table configuration

    3. Upload schema and table configs

    4. Upload data

    Batch ingestion currently supports the following mechanisms to upload the data:

    • Standalone

    • Hadoop

    • Spark

    Here's an example using standalone local processing.

    First, create a table using the following CSV data.

    hashtag
    Create schema configuration

    In our data, the only column on which aggregations can be performed is score. Secondly, timestampInEpoch is the only timestamp column. So, on our schema, we keep score as metric and timestampInEpoch as timestamp column.

    Here, we have also defined two extra fields: format and granularity. The format specifies the formatting of our timestamp column in the data source. Currently, it's in milliseconds, so we've specified 1:MILLISECONDS:EPOCH.

    hashtag
    Create table configuration

    We define a table transcript and map the schema created in the previous step to the table. For batch data, we keep the tableType as OFFLINE.

    hashtag
    Upload schema and table configs

    Now that we have both the configs, upload them and create a table by running the following command:

    Check out the table config and schema in the \[Rest API] to make sure it was successfully uploaded.

    hashtag
    Upload data

    We now have an empty table in Pinot. Next, upload the CSV file to this empty table.

    A table is composed of multiple segments. The segments can be created in the following three ways:

    • Minion based ingestion\

    • Upload API\

    • Ingestion jobs

    hashtag
    Minion-based ingestion

    Refer to SegmentGenerationAndPushTask

    hashtag
    Upload API

    There are 2 controller APIs that can be used for a quick ingestion test using a small file.

    triangle-exclamation

    When these APIs are invoked, the controller has to download the file and build the segment locally.

    Hence, these APIs are NOT meant for production environments and for large input files.

    hashtag
    /ingestFromFile

    This API creates a segment using the given file and pushes it to Pinot. All steps happen on the controller.

    Example usage:

    To upload a JSON file data.json to a table called foo_OFFLINE, use below command

    Note that query params need to be URLEncoded. For example, {"inputFormat":"json"} in the command below needs to be converted to %7B%22inputFormat%22%3A%22json%22%7D.

    The batchConfigMapStr can be used to pass in additional properties needed for decoding the file. For example, in case of csv, you may need to provide the delimiter

    hashtag
    /ingestFromURI

    This API creates a segment using file at the given URI and pushes it to Pinot. Properties to access the FS need to be provided in the batchConfigMap. All steps happen on the controller. Example usage:

    hashtag
    Ingestion jobs

    Segments can be created and uploaded using tasks known as DataIngestionJobs. A job also needs a config of its own. We call this config the JobSpec.

    For our CSV file and table, the JobSpec should look like this:

    For more detail, refer to Ingestion job spec.

    Now that we have the job spec for our table transcript, we can trigger the job using the following command:

    Once the job successfully finishes, head over to the \[query console] and start playing with the data.

    hashtag
    Segment push job type

    There are 3 ways to upload a Pinot segment:

    • Segment tar push

    • Segment URI push

    • Segment metadata push

    hashtag
    Segment tar push

    This is the original and default push mechanism.

    Tar push requires the segment to be stored locally or can be opened as an InputStream on PinotFS. So we can stream the entire segment tar file to the controller.

    The push job will:

    1. Upload the entire segment tar file to the Pinot controller.

    Pinot controller will:

    1. Save the segment into the controller segment directory(Local or any PinotFS).

    2. Extract segment metadata.

    3. Add the segment to the table.

    hashtag
    Segment URI push

    This push mechanism requires the segment tar file stored on a deep store with a globally accessible segment tar URI.

    URI push is light-weight on the client-side, and the controller side requires equivalent work as the tar push.

    The push job will:

    1. POST this segment tar URI to the Pinot controller.

    Pinot controller will:

    1. Download segment from the URI and save it to controller segment directory (local or any PinotFS).

    2. Extract segment metadata.

    3. Add the segment to the table.

    hashtag
    Segment metadata push

    This push mechanism also requires the segment tar file stored on a deep store with a globally accessible segment tar URI.

    Metadata push is light-weight on the controller side, there is no deep store download involves from the controller side.

    The push job will:

    1. Download the segment based on URI.

    2. Extract metadata.

    3. Upload metadata to the Pinot Controller.

    Pinot Controller will:

    1. Add the segment to the table based on the metadata.

    4. Segment Metadata Push with copyToDeepStore

    This extends the original Segment Metadata Push for cases, where the segments are pushed to a location not used as deep store. The ingestion job can still do metadata push but ask Pinot Controller to copy the segments into deep store. Those use cases usually happen when the ingestion jobs don't have direct access to deep store but still want to use metadata push for its efficiency, thus using a staging location to keep the segments temporarily.

    NOTE: the staging location and deep store have to use same storage scheme, like both on s3. This is because the copy is done via PinotFS.copyDir interface that assumes so; but also because this does copy at storage system side, so segments don't need to go through Pinot Controller at all.

    To make this work, grant Pinot controllers access to the staging location. For example on AWS, this may require adding an access policy like this example for the controller EC2 instances:

    Then use metadata push to add one extra config like this one:

    hashtag
    Consistent data push and rollback

    Pinot supports atomic update on segment level, which means that when data consisting of multiple segments are pushed to a table, as segments are replaced one at a time, queries to the broker during this upload phase may produce inconsistent results due to interleaving of old and new data.

    See consistent-push-and-rollback.md for how to enable this feature.

    hashtag
    Segment fetchers

    When Pinot segment files are created in external systems (Hadoop/spark/etc), there are several ways to push those data to the Pinot controller and server:

    1. Push segment to shared NFS and let pinot pull segment files from the location of that NFS. See Segment URI Push.

    2. Push segment to a Web server and let pinot pull segment files from the Web server with HTTP/HTTPS link. See Segment URI Push.

    3. Push segment to PinotFS(HDFS/S3/GCS/ADLS) and let pinot pull segment files from PinotFS URI. See Segment URI Push and Segment Metadata Push.

    4. Push segment to other systems and implement your own segment fetcher to pull data from those systems.

    The first three options are supported out of the box within the Pinot package. As long your remote jobs send Pinot controller with the corresponding URI to the files, it will pick up the file and allocate it to proper Pinot servers and brokers. To enable Pinot support for PinotFS, you'll need to provide PinotFS configuration and proper Hadoop dependencies.

    hashtag
    Persistence

    By default, Pinot does not come with a storage layer, so all the data sent, won't be stored in case of a system crash. In order to persistently store the generated segments, you will need to change controller and server configs to add deep storage. Checkout File systems for all the info and related configs.

    hashtag
    Tuning

    hashtag
    Standalone

    Since pinot is written in Java, you can set the following basic Java configurations to tune the segment runner job -

    • Log4j2 file location with -Dlog4j2.configurationFile

    • Plugin directory location with -Dplugins.dir=/opt/pinot/plugins

    • JVM props, like -Xmx8g -Xms4G

    If you are using the docker, you can set the following under JAVA_OPTS variable.

    hashtag
    Hadoop

    You can set -D mapreduce.map.memory.mb=8192 to set the mapper memory size when submitting the Hadoop job.

    hashtag
    Spark

    You can add config spark.executor.memory to tune the memory usage for segment creation when submitting the Spark job.

    Supported Data Formats

    This section contains a collection of guides that will show you how to import data from a Pinot-supported input format.

    Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time spent doing serialization-deserialization and speed up the ingestion.

    hashtag
    Configuring input formats

    To change the input format, adjust the recordReaderSpec config in the ingestion job specification.

    The configuration consists of the following keys:

    • dataFormat: Name of the data format to consume.

    • className: Name of the class that implements the RecordReader interface. This class is used for parsing the data.

    hashtag
    Supported input formats

    Pinot supports multiple input formats out of the box. Specify the corresponding readers and the associated custom configurations to switch between formats.

    hashtag
    CSV

    CSV Record Reader supports the following configs:

    • fileFormat: default, rfc4180, excel, tdf, mysql

    circle-info

    Your CSV file may have raw text fields that cannot be reliably delimited using any character. In this case, explicitly set the multiValueDelimeter field to empty in the ingestion config. multiValueDelimiter: ''

    hashtag
    Avro

    The Avro record reader converts the data in file to a GenericRecord. A Java class or .avro file is not required. By default, the Avro record reader only supports primitive types. To enable support for rest of the Avro data types, set enableLogicalTypes to true .

    We use the following conversion table to translate between Avro and Pinot data types. The conversions are done using the offical Avro methods present in org.apache.avro.Conversions.

    Avro Data Type
    Pinot Data Type
    Comment

    hashtag
    JSON

    hashtag
    Thrift

    circle-info

    Thrift requires the generated class using .thrift file to parse the data. The .class file should be available in the Pinot's classpath. You can put the files in the lib/ folder of Pinot distribution directory.

    hashtag
    Parquet

    Since 0.11.0 release, the Parquet record reader determines whether to use ParquetAvroRecordReader or ParquetNativeRecordReader to read records. The reader looks for the parquet.avro.schema or avro.schema key in the parquet file footer, and if present, uses the Avro reader.

    You can change the record reader manually in case of a misconfiguration.

    circle-exclamation

    For the support of DECIMAL and other parquet native data types, always use ParquetNativeRecordReader.

    For ParquetAvroRecordReader , you can refer to the for the type conversions.

    hashtag
    ORC

    ORC record reader supports the following data types -

    ORC Data Type
    Java Data Type
    circle-info

    In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.

    hashtag
    Protocol Buffers

    The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc) from the .proto file using the command -

    hashtag
    Apache Arrow

    The Arrow input format plugin supports reading data in . This is useful for ingesting data from systems that produce Arrow-formatted output.

    circle-check

    The pinot-arrow plugin is included in the standard Pinot binary distribution (tarball and Docker image). The ArrowMessageDecoder is available out of the box, and no additional installation steps are required to use Apache Arrow format for data ingestion.

    For stream ingestion, the Arrow decoder converts Arrow columnar batches to Pinot rows:

    Configuration properties:

    Property
    Default
    Description

    The decoder handles Arrow type conversions automatically: Text → String, LocalDateTime → Timestamp, Arrow Maps → flattened Map<String, Object>, and Arrow Lists → List<Object>. Dictionary-encoded columns are also supported.

    Ingestion Transformations

    Raw source data often needs to undergo some transformations before it is pushed to Pinot.

    Transformations include extracting records from nested objects, applying simple transform functions on certain columns, filtering out unwanted columns, as well as more advanced operations like joining between datasets.

    A preprocessing job is usually needed to perform these operations. In streaming data sources, you might write a Samza job and create an intermediate topic to store the transformed data.

    For simple transformations, this can result in inconsistencies in the batch/stream data source and increase maintenance and operator overhead.

    To make things easier, Pinot supports transformations that can be applied via the .

    //This is an example ZNode config for EXTERNAL VIEW in Helix
    {
      "id" : "baseballStats_OFFLINE",
      "simpleFields" : {
        ...
      },
      "mapFields" : {
        "baseballStats_OFFLINE_0" : {
          "Server_10.1.10.82_7000" : "ONLINE"
        }
      },
      ...
    }
    // Query: select count(*) from baseballStats limit 10
    
    // RESPONSE
    // ========
    {
        "resultTable": {
            "dataSchema": {
                "columnDataTypes": ["LONG"],
                "columnNames": ["count(*)"]
            },
            "rows": [
                [97889]
            ]
        },
        "exceptions": [],
        "numServersQueried": 1,
        "numServersResponded": 1,
        "numSegmentsQueried": 1,
        "numSegmentsProcessed": 1,
        "numSegmentsMatched": 1,
        "numConsumingSegmentsQueried": 0,
        "numDocsScanned": 97889,
        "numEntriesScannedInFilter": 0,
        "numEntriesScannedPostFilter": 0,
        "numGroupsLimitReached": false,
        "totalDocs": 97889,
        "timeUsedMs": 5,
        "segmentStatistics": [],
        "traceInfo": {},
        "minConsumingFreshnessTimeMs": 0
    }
    {
      "students": [
        {
          "firstName": "Jane",
          "id": "100",
          "scores": {
            "physics": 91,
            "chemistry": 93,
            "maths": 99
          }
        },
        {
          "firstName": "John",
          "id": "101",
          "scores": {
            "physics": 97,
            "chemistry": 98,
            "maths": 99
          }
        },
        {
          "firstName": "Jen",
          "id": "102",
          "scores": {
            "physics": 96,
            "chemistry": 95,
            "maths": 100
          }
        }
      ]
    }
    {
      "schemaName": "students001",
      "enableColumnBasedNullHandling": false,
      "dimensionFieldSpecs": [
        {
          "name": "students.firstName",
          "dataType": "STRING",
          "notNull": false,
          "fieldType": "DIMENSION"
        },
        {
          "name": "students.id",
          "dataType": "STRING",
          "notNull": false,
          "fieldType": "DIMENSION"
        },
        {
          "name": "students.scores",
          "dataType": "JSON",
          "notNull": false,
          "fieldType": "DIMENSION"
        }
      ],
      "dateTimeFieldSpecs": [
        {
          "name": "ts",
          "fieldType": "DATE_TIME",
          "dataType": "LONG",
          "format": "1:MILLISECONDS:EPOCH",
          "granularity": "1:MILLISECONDS"
        }
      ],
      "metricFieldSpecs": []
    }
    {
        "ingestionConfig": {
          "complexTypeConfig": {
            "fieldsToUnnest": [
              "students"
            ]
          }
      }
    }
    {
      "student": [
        {
          "name": "John"
        },
        {
          "name": "Jane"
        }
      ],
      "teacher": [
        {
          "physics": "Kim"
        },
        {
          "chemistry": "Lu"
        },
        {
          "maths": "Walsh"
        }
      ]
    }
    {
      "schemaName": "students002",
      "enableColumnBasedNullHandling": false,
      "dimensionFieldSpecs": [
        {
          "name": "student.name",
          "dataType": "STRING",
          "fieldType": "DIMENSION",
          "notNull": false
        },
        {
          "name": "teacher.physics",
          "dataType": "STRING",
          "fieldType": "DIMENSION",
          "notNull": false
        },
        {
          "name": "teacher.chemistry",
          "dataType": "STRING",
          "fieldType": "DIMENSION",
          "notNull": false
        },
        {
          "name": "teacher.maths",
          "dataType": "STRING",
          "fieldType": "DIMENSION",
          "notNull": false
        }
      ]
    }
      "complexTypeConfig": {
        "fieldsToUnnest": [
          "student",
          "teacher"
        ]
      }
    {
      "students": [
        {
          "name": "Jane",
          "grades": [
            {
              "physics": "A+"
            },
            {
              "maths": "A-"
            }
          ]
        },
        {
          "name": "John",
          "grades": [
            {
              "physics": "B+"
            },
            {
              "maths": "B-"
            }
          ]
        }
      ]
    }
    {
      "schemaName": "students003",
      "enableColumnBasedNullHandling": false,
      "dimensionFieldSpecs": [
        {
          "name": "students.name",
          "dataType": "STRING",
          "fieldType": "DIMENSION",
          "notNull": false
        },
        {
          "name": "students.grades.physics",
          "dataType": "STRING",
          "fieldType": "DIMENSION",
          "notNull": false
        },
        {
          "name": "students.grades.maths",
          "dataType": "STRING",
          "fieldType": "DIMENSION",
          "notNull": false
        }
      ]
    }
      "complexTypeConfig": {
        "fieldsToUnnest": [
          "students",
          "students.grades"
        ]
      }
    {
      "students": [
        {
          "name": "John",
          "grades": {
            "finalExam": [
              {
                "physics": "A+"
              },
              {
                "maths": "A-"
              }
            ]
          }
        },
        {
          "name": "Jane",
          "grades": {
            "finalExam": [
              {
                "physics": "B+"
              },
              {
                "maths": "B-"
              }
            ]
          }
        }
      ]
    }
    {
        "schemaName": "students004",
        "enableColumnBasedNullHandling": false,
        "dimensionFieldSpecs": [
          {
            "name": "students.name",
            "dataType": "STRING",
            "notNull": false,
            "fieldType": "DIMENSION"
          },
          {
            "name": "students.grades.finalExam.physics",
            "dataType": "STRING",
            "notNull": false,
            "fieldType": "DIMENSION"
          },
          {
            "name": "students.grades.finalExam.maths",
            "dataType": "STRING",
            "notNull": false,
            "fieldType": "DIMENSION"
          }
        ]
      }
      "complexTypeConfig": {
        "fieldsToUnnest": [
          "students",
          "students.grades.finalExam"
        ]
      }
    {
      "students": [
        {
          "name": "John",
          "grades": [
            {
              "physics": "A+"
            },
            {
              "maths": "A"
            }
          ]
        },
        {
          "name": "Jane",
          "grades": [
            {
              "physics": "B+"
            },
            {
              "maths": "B-"
            }
          ]
        }
      ]
    }
    {
        "schemaName": "students005",
        "enableColumnBasedNullHandling": false,
        "dimensionFieldSpecs": [
          {
            "name": "students.name",
            "dataType": "STRING",
            "notNull": false,
            "fieldType": "DIMENSION"
          },
          {
            "name": "students.grades",
            "dataType": "STRING",
            "notNull": false,
            "isSingleValue": false,
            "fieldType": "DIMENSION"
          }
        ]
      }
      "complexTypeConfig": {
        "fieldsToUnnest": [
          "students"
        ]
      }
    {
      "students": [
        {
          "name": "John",
          "extra_curricular": [
            "piano", "soccer"
          ]
        },
        {
          "name": "Jane",
          "extra_curricular": [
            "violin", "music"
          ]
        }
      ]
    }
    {
        "schemaName": "students006",
        "enableColumnBasedNullHandling": false,
        "dimensionFieldSpecs": [
          {
            "name": "students.name",
            "dataType": "STRING",
            "notNull": false,
            "fieldType": "DIMENSION"
          },
          {
            "name": "students.extra_curricular",
            "dataType": "JSON",
            "notNull": false,
            "fieldType": "DIMENSION"
          }
        ]
      }
        "complexTypeConfig": {
          "fieldsToUnnest": [
            "students"
          ], 
          "collectionNotUnnestedToJson": "ALL"
        }
    "enrichmentConfigs": [
      {
        "enricherType": "generateColumn",
        "properties": {"fieldToFunctionMap":{"recordArray":"jsonStringToListOrMap(data_for_unnesting)"}},
        "preComplexTypeTransform": true
      }
    ],
    "complexTypeConfig": {
      "fieldsToUnnest": [
        "recordArray"
      ],
      "delimiter": "||"
    },
    {
      "key": "value",
      "data_for_unnesting": [
        {
          "name": "record1"
        },
        {
          "name": "record2"
        },
        {
          "name": "record3"
        }
      ],
      "event_time": "2025-04-24T20:45:56.721936"
    }
    {
      "schemaName": "testUnnest",
      "enableColumnBasedNullHandling": true,
      "dimensionFieldSpecs": [
        {
          "name": "key",
          "dataType": "STRING",
          "fieldType": "DIMENSION"
        },
        {
          "name": "recordArray||name",
          "dataType": "STRING",
          "fieldType": "DIMENSION"
        }
      ],
      "dateTimeFieldSpecs": [
        {
          "name": "event_time",
          "dataType": "LONG",
          "fieldType": "DATE_TIME",
          "format": "EPOCH|MILLISECONDS|1",
          "granularity": "MILLISECONDS|1"
        }
      ]
    }
    {
      "tableName": "testUnnest_OFFLINE",
      "tableType": "OFFLINE",
      "segmentsConfig": {
        "deletedSegmentsRetentionPeriod": "0d",
        "segmentPushType": "APPEND",
        "timeColumnName": "event_time",
        "retentionTimeUnit": "DAYS",
        "retentionTimeValue": "180",
        "minimizeDataMovement": false,
        "replication": "1"
      },
      "tenants": {
        "broker": "DefaultTenant",
        "server": "DefaultTenant"
      },
      "tableIndexConfig": {
        "aggregateMetrics": false,
        "optimizeDictionary": false,
        "autoGeneratedInvertedIndex": false,
        "enableDefaultStarTree": false,
        "nullHandlingEnabled": true,
        "skipSegmentPreprocess": false,
        "optimizeDictionaryType": false,
        "enableDynamicStarTreeCreation": false,
        "columnMajorSegmentBuilderEnabled": true,
        "createInvertedIndexDuringSegmentGeneration": true,
        "optimizeDictionaryForMetrics": false,
        "noDictionarySizeRatioThreshold": 0,
        "loadMode": "MMAP",
        "rangeIndexVersion": 2,
        "invertedIndexColumns": [
          "key"
        ],
        "varLengthDictionaryColumns": [
          "key"
        ]
      },
      "metadata": {},
      "ingestionConfig": {
        "transformConfigs": [],
        "enrichmentConfigs": [
          {
            "enricherType": "generateColumn",
            "properties": {"fieldToFunctionMap":{"recordArray":"jsonStringToListOrMap(data_for_unnesting)"}},
            "preComplexTypeTransform": true
          }
        ],
        "continueOnError": true,
        "rowTimeValueCheck": true,
        "complexTypeConfig": {
          "fieldsToUnnest": [
            "recordArray"
          ],
          "delimiter": "||"
        },
        "retryOnSegmentBuildPrecheckFailure": false,
        "segmentTimeValueCheck": false
      },
      "isDimTable": false
    }
    <dependency>
      <groupId>org.apache.pinot</groupId>
      <artifactId>pinot-batch-ingestion-spark-3</artifactId>
      <version>${pinot.version}</version>
    </dependency>
    studentID,firstName,lastName,gender,subject,score,timestampInEpoch
    200,Lucy,Smith,Female,Maths,3.8,1570863600000
    200,Lucy,Smith,Female,English,3.5,1571036400000
    201,Bob,King,Male,Maths,3.2,1571900400000
    202,Nick,Young,Male,Physics,3.6,1572418800000
    {
      "schemaName": "transcript",
      "dimensionFieldSpecs": [
        {
          "name": "studentID",
          "dataType": "INT"
        },
        {
          "name": "firstName",
          "dataType": "STRING"
        },
        {
          "name": "lastName",
          "dataType": "STRING"
        },
        {
          "name": "gender",
          "dataType": "STRING"
        },
        {
          "name": "subject",
          "dataType": "STRING"
        }
      ],
      "metricFieldSpecs": [
        {
          "name": "score",
          "dataType": "FLOAT"
        }
      ],
      "dateTimeFieldSpecs": [{
        "name": "timestampInEpoch",
        "dataType": "LONG",
        "format" : "1:MILLISECONDS:EPOCH",
        "granularity": "1:MILLISECONDS"
      }]
    }
    {
      "tableName": "transcript",
      "tableType": "OFFLINE",
      "segmentsConfig": {
        "replication": 1,
        "timeColumnName": "timestampInEpoch",
        "timeType": "MILLISECONDS",
        "retentionTimeUnit": "DAYS",
        "retentionTimeValue": 365
      },
      "tenants": {
        "broker":"DefaultTenant",
        "server":"DefaultTenant"
      },
      "tableIndexConfig": {
        "loadMode": "MMAP"
      },
      "ingestionConfig": {
        "batchIngestionConfig": {
          "segmentIngestionType": "APPEND",
          "segmentIngestionFrequency": "DAILY"
        },
        "continueOnError": true,
        "rowTimeValueCheck": true,
        "segmentTimeValueCheck": false
    
      },
      "metadata": {}
    }
    bin/pinot-admin.sh AddTable \\
      -tableConfigFile /path/to/table-config.json \\
      -schemaFile /path/to/table-schema.json -exec
    curl -X POST -F [email protected] \
      -H "Content-Type: multipart/form-data" \
      "http://localhost:9000/ingestFromFile?tableNameWithType=foo_OFFLINE&
      batchConfigMapStr={"inputFormat":"json"}"
    curl -X POST -F [email protected] \
      -H "Content-Type: multipart/form-data" \
      "http://localhost:9000/ingestFromFile?tableNameWithType=foo_OFFLINE&
    batchConfigMapStr={
      "inputFormat":"csv",
      "recordReader.prop.delimiter":"|"
    }"
    curl -X POST "http://localhost:9000/ingestFromURI?tableNameWithType=foo_OFFLINE
    &batchConfigMapStr={
      "inputFormat":"json",
      "input.fs.className":"org.apache.pinot.plugin.filesystem.S3PinotFS",
      "input.fs.prop.region":"us-central",
      "input.fs.prop.accessKey":"foo",
      "input.fs.prop.secretKey":"bar"
    }
    &sourceURIStr=s3://test.bucket/path/to/json/data/data.json"
    executionFrameworkSpec:
      name: 'standalone'
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
      segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'
    
    # Recommended to set jobType to SegmentCreationAndMetadataPush for production environment where Pinot Deep Store is configured  
    jobType: SegmentCreationAndTarPush
    
    inputDirURI: '/tmp/pinot-quick-start/rawdata/'
    includeFileNamePattern: 'glob:**/*.csv'
    outputDirURI: '/tmp/pinot-quick-start/segments/'
    overwriteOutput: true
    pinotFSSpecs:
      - scheme: file
        className: org.apache.pinot.spi.filesystem.LocalPinotFS
    recordReaderSpec:
      dataFormat: 'csv'
      className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
      configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
      tableName: 'transcript'
    pinotClusterSpecs:
      - controllerURI: 'http://localhost:9000'
    pushJobSpec:
      pushAttempts: 2
      pushRetryIntervalMillis: 1000
    bin/pinot-admin.sh LaunchDataIngestionJob \\
        -jobSpecFile /tmp/pinot-quick-start/batch-job-spec.yaml
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "s3:ListAllMyBuckets",
                "Resource": "*"
            },
            {
                "Effect": "Allow",
                "Action": "s3:*",
                "Resource": [
                    "arn:aws:s3:::metadata-push-staging",
                    "arn:aws:s3:::metadata-push-staging/*"
                ]
            }
        ]
    }
    ...
    jobType: SegmentCreationAndMetadataPush
    ...
    outputDirURI: 's3://metadata-push-staging/stagingDir/'
    ...
    pushJobSpec:
      copyToDeepStoreForMetadataPush: true
    ...
    recordReaderSpec:
      dataFormat: 'csv'
      className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
      configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
      configs: 
    			key1 : 'value1'
    			key2 : 'value2'

    Spark 3

    Production batch ingestion

    Spark 3.x cluster

    GB to TB+

    Recommended for production

    Hadoop

    Existing MapReduce pipelines

    Hadoop cluster

    GB to TB+

    Legacy

    Flink

    Streaming-first orgs, backfill, upsert bootstrap

    Flink cluster

    GB to TB+

    Active

    LaunchDataIngestionJob

    CLI wrapper for Standalone

    None

    Up to a few GB

    Convenience tool

    Hadoop

    pinot-batch-ingestion-hadoop

    Located in plugins-external/pinot-batch-ingestion/

    Flink

    pinot-flink-connector

    Located in pinot-connectors/

    Common

    pinot-batch-ingestion-common

    Shared library used by all modes

    Flink
    configClassName: Name of the class that implements the RecordReaderConfig interface. This class is used the parse the values mentioned in configs
  • configs: Key-value pair for format-specific configurations. This field is optional.

  • header
    : Header of the file. The
    columnNames
    should be separated by the delimiter mentioned in the configuration.
  • delimiter: The character seperating the columns.

  • multiValueDelimiter: The character separating multiple values in a single column. This can be used to split a column into a list.

  • skipHeader: Skip header record in the file. Boolean.

  • ignoreEmptyLines: Ignore empty lines (instead of filling them with default values). Boolean.

  • ignoreSurroundingSpaces: ignore spaces around column names and values. Boolean

  • quoteCharacter: Single character used for quotes in CSV files.

  • recordSeparator: Character used to separate records in the input file. Default is or \r depending on the platform.

  • nullStringValue: String value that represents null in CSV files. Default is empty string.

  • FLOAT

    FLOAT

    DOUBLE

    DOUBLE

    BOOLEAN

    BOOLEAN

    STRING

    STRING

    ENUM

    STRING

    BYTES

    BYTES

    FIXED

    BYTES

    MAP

    JSON

    ARRAY

    JSON

    RECORD

    JSON

    UNION

    JSON

    DECIMAL

    BYTES

    UUID

    STRING

    DATE

    STRING

    yyyy-MM-dd format

    TIME_MILLIS

    STRING

    HH:mm:ss.SSS format

    TIME_MICROS

    STRING

    HH:mm:ss.SSSSSS format

    TIMESTAMP_MILLIS

    TIMESTAMP

    TIMESTAMP_MICROS

    TIMESTAMP

    FLOAT

    FLOAT

    DOUBLE

    DOUBLE

    BINARY

    BYTES

    FIXED-LEN-BYTE-ARRAY

    BYTES

    DECIMAL

    DOUBLE

    ENUM

    STRING

    UTF8

    STRING

    REPEATED

    MULTIVALUE/MAP (represented as MV

    if parquet original type is LIST, then it is converted to MULTIVALUE column otherwise a MAP column.

    Integer

    FLOAT

    Float

    DOUBLE

    Double

    STRING

    String

    VARCHAR

    String

    CHAR

    String

    LIST

    Object[]

    MAP

    Map<Object, Object>

    DATE

    Long

    TIMESTAMP

    Long

    BINARY

    byte[]

    BYTE

    Integer

    INT

    INT

    LONG

    LONG

    INT96

    LONG

    ParquetINT96 type converts nanoseconds to Pinot INT64 type of milliseconds

    INT64

    LONG

    INT32

    INT

    BOOLEAN

    String

    SHORT

    Integer

    INT

    Integer

    arrow.allocator.limit

    268435456 (256 MB)

    Memory limit for Arrow's off-heap allocator in bytes

    Avro section above
    Apache Arrow IPC streaming formatarrow-up-right

    LONG

    dataFormat: 'csv'
    className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
    configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    configs:
    	fileFormat: 'default' #should be one of default, rfc4180, excel, tdf, mysql
    	header: 'columnName separated by delimiter'
      delimiter: ','
      multiValueDelimiter: '-'
    dataFormat: 'avro'
    className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
    configs:
        enableLogicalTypes: true
    dataFormat: 'json'
    className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
    dataFormat: 'thrift'
    className: 'org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader'
    configs:
    	thriftClass: 'ParserClassName'
    dataFormat: 'parquet'
    className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
    dataFormat: 'parquet'
    className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetNativeRecordReader'
    dataFormat: 'orc'
    className: 'org.apache.pinot.plugin.inputformat.orc.ORCRecordReader'
    dataFormat: 'proto'
    className: 'org.apache.pinot.plugin.inputformat.protobuf.ProtoBufRecordReader'
    configs:
    	descriptorFile: 'file:///path/to/sample.desc'
    protoc --include_imports --descriptor_set_out=/absolute/path/to/output.desc /absolute/path/to/input.proto
    dataFormat: 'arrow'
    stream.kafka.decoder.class.name=org.apache.pinot.plugin.inputformat.arrow.ArrowMessageDecoder
    circle-exclamation

    If a new column is added to your table or schema configuration during ingestion, incorrect data may appear in the consuming segment(s). To ensure accurate values are reloaded, see how to add a new column during ingestion.

    hashtag
    Transformation functions

    Pinot supports the following functions:

    • Groovy functions

    • Built-in functions

    circle-exclamation

    A transformation function cannot mix Groovy and built-in functions; only use one type of function at a time.

    hashtag
    Groovy functions

    Groovy functions can be defined using the syntax:

    Any valid Groovy expression can be used.

    ⚠️ Enabling Groovy

    Allowing executable Groovy in ingestion transformation can be a security vulnerability. To enable Groovy for ingestion, set the following controller configuration:

    controller.disable.ingestion.groovy=false

    If not set, Groovy for ingestion transformation is disabled by default.

    hashtag
    Built-in Pinot functions

    All the functions defined in this directoryarrow-up-right annotated with @ScalarFunction (for example, toEpochSecondsarrow-up-right) are supported ingestion transformation functions.

    Below are some commonly used built-in Pinot functions for ingestion transformations.

    hashtag
    DateTime functions

    These functions enable time transformations.

    toEpochXXX

    Converts from epoch milliseconds to a higher granularity.

    Function name
    Description

    toEpochSeconds

    Converts epoch millis to epoch seconds. Usage:"toEpochSeconds(millis)"

    toEpochMinutes

    Converts epoch millis to epoch minutes Usage: "toEpochMinutes(millis)"

    toEpochHours

    Converts epoch millis to epoch hours Usage: "toEpochHours(millis)"

    toEpochXXXRounded

    Converts from epoch milliseconds to another granularity, rounding to the nearest rounding bucket. For example, 1588469352000 (2020-05-01 42:29:12) is 26474489 minutesSinceEpoch. `toEpochMinutesRounded(1588469352000) = 26474480 (2020-05-01 42:20:00)

    Function Name
    Description

    toEpochSecondsRounded

    Converts epoch millis to epoch seconds, rounding to nearest rounding bucket"toEpochSecondsRounded(millis, 30)"

    toEpochMinutesRounded

    Converts epoch millis to epoch seconds, rounding to nearest rounding bucket"toEpochMinutesRounded(millis, 10)"

    toEpochHoursRounded

    Converts epoch millis to epoch seconds, rounding to nearest rounding bucket"toEpochHoursRounded(millis, 6)"

    fromEpochXXX

    Converts from an epoch granularity to milliseconds.

    Function Name
    Description

    fromEpochSeconds

    Converts from epoch seconds to milliseconds "fromEpochSeconds(secondsSinceEpoch)"

    fromEpochMinutes

    Converts from epoch minutes to milliseconds "fromEpochMinutes(minutesSinceEpoch)"

    fromEpochHours

    Converts from epoch hours to milliseconds "fromEpochHours(hoursSinceEpoch)"

    Simple date format

    Converts simple date format strings to milliseconds and vice versa, per the provided pattern string.

    Function name
    Description

    Converts from milliseconds to a formatted date time string, as per the provided pattern "toDateTime(millis, 'yyyy-MM-dd')"

    Converts a formatted date time string to milliseconds, as per the provided pattern "fromDateTime(dateTimeStr, 'EEE MMM dd HH:mm:ss ZZZ yyyy')"

    circle-info

    Note

    Letters that are not part of Simple Date Time legend (https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.htmlarrow-up-right) need to be escaped. For example:

    "transformFunction": "fromDateTime(dateTimeStr, 'yyyy-MM-dd''T''HH:mm:ss')"

    hashtag
    JSON functions

    Function name
    Description

    json_format

    Converts a JSON/AVRO complex object to a string. This json map can then be queried using function. "json_format(jsonMapField)"

    hashtag
    Types of transformation

    hashtag
    Filtering

    Records can be filtered as they are ingested. A filter function can be specified in the filterConfigs in the ingestionConfigs of the table config.

    If the expression evaluates to true, the record will be filtered out. The expressions can use any of the transform functions described in the previous section.

    Consider a table that has a column timestamp. If you want to filter out records that are older than timestamp 1589007600000, you could apply the following function:

    Consider a table that has a string column campaign and a multi-value column double column prices. If you want to filter out records where campaign = 'X' or 'Y' and sum of all elements in prices is less than 100, you could apply the following function:

    Filter config also supports SQL-like expression of built-in scalar functions for filtering records (starting v 0.11.0+). Example:

    hashtag
    Column transformation

    Transform functions can be defined on columns in the ingestion config of the table config.

    For example, imagine that our source data contains the prices and timestamp fields. We want to extract the maximum price and store that in the maxPrices field and convert the timestamp into the number of hours since the epoch and store it in the hoursSinceEpoch field. You can do this by applying the following transformation:

    Below are some examples of commonly used functions.

    hashtag
    String concatenation

    Concat firstName and lastName to get fullName

    hashtag
    Find an element in an array

    Find max value in array bids

    hashtag
    Time transformation

    Convert timestamp from MILLISECONDS to HOURS

    hashtag
    Column name change

    Change name of the column from user_id to userId

    hashtag
    Rename fields from a Kafka JSON message

    Kafka JSON payloads often use keys that aren’t great Pinot column names. Common examples are keys containing -, such as event-id.

    Map the source key to a schema-friendly column using transformConfigs. Reference the source key with a quoted identifier.

    circle-info

    Add the destination columns (for example, event_id) to your Pinot schema.

    hashtag
    Extract value from a column containing space

    Pinot doesn't support columns that have spaces, so if a source data column has a space, we'll need to store that value in a column with a supported name. To extract the value from first Name into the column firstName, run the following:

    hashtag
    Ternary operation

    If eventType is IMPRESSION set impression to 1. Similar for CLICK.

    hashtag
    AVRO Map

    Store an AVRO Map in Pinot as two multi-value columns. Sort the keys, to maintain the mapping. 1) The keys of the map as map_keys 2) The values of the map as map_values

    hashtag
    Chaining transformations

    Transformations can be chained. This means that you can use a field created by a transformation in another transformation function.

    For example, we might have the following JSON document in the data field of our source data:

    We can apply one transformation to extract the userId and then another one to pull out the numerical part of the identifier:

    hashtag
    Flattening

    There are 2 kinds of flattening:

    hashtag
    One record into many

    This is not natively supported as of yet. You can write a custom Decoder/RecordReader if you want to use this. Once the Decoder generates the multiple GenericRows from the provided input record, a List<GenericRow> should be set into the destination GenericRow, with the key $MULTIPLE_RECORDS_KEY$. The segment generation drivers will treat this as a special case and handle the multiple records case.

    hashtag
    Extract attributes from complex objects

    Feature TBD

    hashtag
    Add a new column during ingestion

    If a new column is added to table or schema configuration during ingestion, incorrect data may appear in the consuming segment(s).

    To ensure accurate values are reloaded, do the following:

    1. Pause consumption (and wait for pause status success): $ curl -X POST {controllerHost}/tables/{tableName}/pauseConsumption

    2. Apply new table or schema configurations.

    3. Reload segments using the Pinot Controller API or Pinot Admin Console.

    4. Resume consumption: $ curl -X POST {controllerHost}/tables/{tableName}/resumeConsumption

    table config

    Stream Ingestion Guide

    This guide shows you how to ingest a stream of records into a Pinot table.

    Apache Pinot lets users consume data from streams and push it directly into the database. This process is called stream ingestion. Stream ingestion makes it possible to query data within seconds of publication.

    Stream ingestion provides support for checkpoints for preventing data loss.

    To set up Stream ingestion, perform the following steps, which are described in more detail in this page:

    1. Create schema configuration

    2. Create table configuration

    3. Create ingestion configuration

    4. Upload table and schema spec

    Here's an example where we assume the data to be ingested is in the following format:

    hashtag
    Create schema configuration

    The schema defines the fields along with their data types. The schema also defines whether fields serve as dimensions, metrics, or timestamp. For more details on schema configuration, see .

    For our sample data, the schema configuration looks like this:

    hashtag
    Create table configuration with ingestion configuration

    The next step is to create a table where all the ingested data will flow and can be queried. For details about each table component, see the reference.

    The table configuration contains an ingestion configuration (ingestionConfig), which specifies how to ingest streaming data into Pinot. For details, see the reference.

    hashtag
    Example table config with ingestionConfig

    For our sample data and schema, the table config will look like this:

    hashtag
    Example ingestionConfig for multi-topics ingestion

    From , Pinot starts to support ingesting data from multiple stream partitions. (It is currently in Beta mode, and only supports multiple Kafka topics. Other stream types would be supported in the near future.) For our sample data and schema, assume that we duplicate it to 2 topics, transcript-topic1 and transcript-topic2. If we want to ingest from both topics, then the table config will look like this:

    With multi-topics ingestion: (details please refer to the )

    • All transform functions would apply to both topics' ingestions.

    • Existing instance assignment strategy would all work as usual.

    • would still be handled in the same way.

    hashtag
    Upload schema and table config

    Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the configs are uploaded, Pinot will start ingesting available records from the topic.

    hashtag
    Tune the stream config

    hashtag
    Throttle stream consumption

    There are some scenarios where the message rate in the input stream can come in bursts which can lead to long GC pauses on the Pinot servers or affect the ingestion rate of other real-time tables on the same server. If this happens to you, throttle the consumption rate during stream ingestion to better manage overall performance.

    There are two independent throttling mechanisms available:

    1. Message-rate–based throttling (table level, records/sec)

    2. Byte-rate–based throttling (server level, bytes/sec)

    Both mechanisms can be enabled simultaneously.

    hashtag
    Message-rate–based throttling (table level)

    Stream consumption throttling can be tuned using the stream config topic.consumption.rate.limit which indicates the upper bound on the message rate for the entire topic.

    Here is the sample configuration on how to configure the consumption throttling:

    Some things to keep in mind while tuning this config are:

    • Since this configuration applied to the entire topic, internally, this rate is divided by the number of partitions in the topic and applied to each partition's consumer. This doesn't take replication factor into account. Example topic.consumption.rate.limit - 1000 num partitions in Kafka topic - 4 replication factor in table - 3 Pinot will impose a fixed limit of 1000 / 4 = 250 records per second on each partition. \

    • In case of multi-tenant deployment (where you have more than 1 table in the same server instance), you need to make sure that the rate limit on one table doesn't step on/starve the rate limiting of another table. So, when there is more than 1 table on the same server (which is most likely to happen), you may need to re-tune the throttling threshold for all the streaming tables.\

    Once throttling is enabled for a table, you can verify by searching for a log that looks similar to:

    In addition, you can monitor the consumption rate utilization with the metric COSUMPTION_QUOTA_UTILIZATION.

    Note that any configuration change for topic.consumption.rate.limit in the stream config will NOT take effect immediately. The new configuration will be picked up from the next consuming segment. In order to enforce the new configuration, you need to trigger forceCommit APIs. Refer to for more details.

    hashtag
    Byte-rate–based throttling (server level)

    In addition to message-rate throttling, Pinot supports byte-based stream consumption throttling at the server level.

    This throttling mechanism limits the total number of bytes consumed per second by a Pinot server, across all real-time tables and partitions hosted on that server.

    When to use byte-based throttling

    Byte-based throttling is especially useful when:

    • Message sizes vary significantly

    • Ingestion pressure is driven by payload size rather than record count

    • You want to cap network, direct memory, or disk IO usage at the server level

    Configuration

    Byte-based throttling is configured via cluster config, not via table or stream configs.

    Config key

    pinot.server.consumption.rate.limit.bytes

    The value is specified in bytes per second.

    Updating the configuration

    The configuration can be updated dynamically using the Cluster Config API.

    This limits each Pinot server to consume at most 3,000,000 bytes/sec (~3 MB/sec) across all real-time tables.

    Example using curl

    How byte-based throttling works

    • The byte rate limit is enforced per server

    • The limit applies collectively to all consuming partitions and tables hosted on that server

    • This throttling is independent of table-level message-rate throttling

    Interaction with message-rate throttling

    If both throttles are enabled:

    • Table-level topic.consumption.rate.limit controls records/sec per table

    • Server-level pinot.server.consumption.rate.limit.bytes controls bytes/sec per server

    • Pinot enforces both limits

    This allows precise control when both message count and payload size matter.

    Dynamic updates and propagation

    • Byte-based throttling is updated dynamically via the Cluster Config Change Listener

    • No server restart is required

    • Changes take effect automatically as servers receive the updated cluster config

    Verifying throttling

    Once enabled, Pinot logs messages indicating that a server-level byte consumption limiter has been applied.

    You can also monitor throttling behavior using the metric:

    This metric reflects how close the server is to its configured consumption quota.

    hashtag
    Custom ingestion support

    You can also write an ingestion plugin if the platform you are using is not supported out of the box. For a walkthrough, see .

    hashtag
    Pause stream ingestion

    There are some scenarios in which you may want to pause the real-time ingestion while your table is available for queries. For example, if there is a problem with the stream ingestion and, while you are troubleshooting the issue, you still want the queries to be executed on the already ingested data. For these scenarios, you can first issue a Pause request to a Controller host. After troubleshooting with the stream is done, you can issue another request to Controller to resume the consumption.

    When a Pause request is issued, the controller instructs the real-time servers hosting your table to commit their consuming segments immediately. However, the commit process may take some time to complete. Note that Pause and Resume requests are async. An OK response means that instructions for pausing or resuming has been successfully sent to the real-time server. If you want to know if the consumption has actually stopped or resumed, issue a pause status request.

    It's worth noting that consuming segments on real-time servers are stored in volatile memory, and their resources are allocated when the consuming segments are first created. These resources cannot be altered if consumption parameters are changed midway through consumption. It may take hours before these changes take effect. Furthermore, if the parameters are changed in an incompatible way (for example, changing the underlying stream with a completely new set of offsets, or changing the stream endpoint from which to consume messages), it will result in the table getting into an error state.

    The pause and resume feature is helpful in these instances. When a pause request is issued by the operator, consuming segments are committed without starting new mutable segments. Instead, new mutable segments are started only when the resume request is issued. This mechanism provides the operators as well as developers with more flexibility. It also enables Pinot to be more resilient to the operational and functional constraints imposed by underlying streams.

    There is another feature called Force Commit which utilizes the primitives of the pause and resume feature. When the operator issues a force commit request, the current mutable segments will be committed and new ones started right away. Operators can now use this feature for all compatible table config parameter changes to take effect immediately.

    (v 0.12.0+) Once submitted, the forceCommit API returns a jobId that can be used to get the current progress of the forceCommit operation. A sample response and status API call:

    circle-info

    The forceCommit request just triggers a regular commit before the consuming segments reaching the end criteria, so it follows the same mechanism as regular commit. It is one-time shot request, and not retried automatically upon failure. But it is idempotent so one may keep issuing it till success if needed.

    This API is async, as it doesn't wait for the segment commit to complete. But a status entry is put in ZK to track when the request is issued and the consuming segments included. The consuming segments tracked in the status entry are compared with the latest IdealState to indicate the progress of forceCommit. However, this status is not updated or deleted upon commit success or failure, so that it could become stale. Currently, the most recent 100 status entries are kept in ZK, and the oldest ones only get deleted when the total number is about to exceed 100.

    For incompatible parameter changes, an option is added to the resume request to handle the case of a completely new set of offsets. Operators can now follow a three-step process: First, issue a pause request. Second, change the consumption parameters. Finally, issue the resume request with the appropriate option. These steps will preserve the old data and allow the new data to be consumed immediately. All through the operation, queries will continue to be served.

    hashtag
    Handle partition changes in streams

    If a Pinot table is configured to consume using a (partition-based) stream type, then it is possible that the partitions of the table change over time. In Kafka, for example, the number of partitions may increase. In Kinesis, the number of partitions may increase or decrease -- some partitions could be merged to create a new one, or existing partitions split to create new ones.

    Pinot runs a periodic task called RealtimeSegmentValidationManager that monitors such changes and starts consumption on new partitions (or stops consumptions from old ones) as necessary. Since this is a that is run on the controller, it may take some time for Pinot to recognize new partitions and start consuming from them. This may delay the data in new partitions appearing in the results that pinot returns.

    If you want to recognize the new partitions sooner, then the periodic task so as to recognize such data immediately.

    hashtag
    Infer ingestion status of real-time tables

    Often, it is important to understand the rate of ingestion of data into your real-time table. This is commonly done by looking at the consumption lag of the consumer. The lag itself can be observed in many dimensions. Pinot supports observing consumption lag along the offset dimension and time dimension, whenever applicable (as it depends on the specifics of the connector).

    The ingestion status of a connector can be observed by querying either the /consumingSegmentsInfo API or the table's /debug API, as shown below:

    A sample response from a Kafka-based real-time table is shown below. The ingestion status is displayed for each of the CONSUMING segments in the table.

    Term
    Description

    hashtag
    Monitor real-time ingestion

    Real-time ingestion includes 3 stages of message processing: Decode, Transform, and Index.

    In each of these stages, a failure can happen which may or may not result in an ingestion failure. The following metrics are available to investigate ingestion issues:

    1. Decode stage -> an error here is recorded as INVALID_REALTIME_ROWS_DROPPED

    2. Transform stage -> possible errors here are:

      1. When a message gets dropped due to the transform, it is recorded as REALTIME_ROWS_FILTERED

    There is yet another metric called ROWS_WITH_ERROR which is the sum of all error counts in the 3 stages above.

    Furthermore, the metric REALTIME_CONSUMPTION_EXCEPTIONS gets incremented whenever there is a transient/permanent stream exception seen during consumption.

    These metrics can be used to understand why ingestion failed for a particular table partition before diving into the server logs.

    Groovy({groovy script}, argument1, argument2...argumentN)
    "tableConfig": {
        "tableName": ...,
        "tableType": ...,
        "ingestionConfig": {
            "filterConfig": {
                "filterFunction": "<expression>"
            }
        }
    }
    "ingestionConfig": {
        "filterConfig": {
            "filterFunction": "Groovy({timestamp < 1589007600000}, timestamp)"
        }
    }
    "ingestionConfig": {
        "filterConfig": {
            "filterFunction": "Groovy({(campaign == \"X\" || campaign == \"Y\") && prices.sum() < 100}, prices, campaign)"
        }
    }
    "ingestionConfig": {
        "filterConfig": {
            "filterFunction": "strcmp(campaign, 'X') = 0 OR strcmp(campaign, 'Y') = 0 OR timestamp < 1589007600000"
        }
    }
    { "tableConfig": {
        "tableName": ...,
        "tableType": ...,
        "ingestionConfig": {
            "transformConfigs": [{
              "columnName": "fieldName",
              "transformFunction": "<expression>"
            }]
        },
        ...
    }
    pinot-table-offline.json
    {
    "tableName": "myTable",
    ...
    "ingestionConfig": {
        "transformConfigs": [{
          "columnName": "maxPrice",
          "transformFunction": "Groovy({prices.max()}, prices)" // groovy function
        },
        {
          "columnName": "hoursSinceEpoch",
          "transformFunction": "toEpochHours(timestamp)" // built-in function
        }]
      }
    }
    "ingestionConfig": {
        "transformConfigs": [{
          "columnName": "fullName",
          "transformFunction": "Groovy({firstName+' '+lastName}, firstName, lastName)"
        }]
    }
    "ingestionConfig": {
        "transformConfigs": [{
          "columnName": "maxBid",
          "transformFunction": "Groovy({bids.max{ it.toBigDecimal() }}, bids)"
        }]
    }
    "ingestionConfig": {
        "transformConfigs": [{
          "columnName": "hoursSinceEpoch",
          "transformFunction": "Groovy({timestamp/(1000*60*60)}, timestamp)"
        }]
    }
    "ingestionConfig": {
        "transformConfigs": [{
          "columnName": "userId",
          "transformFunction": "user_id"
        }]
    }
    "ingestionConfig": {
      "transformConfigs": [
        {
          "columnName": "event_id",
          "transformFunction": "\"event-id\""
        },
        {
          "columnName": "event_timestamp",
          "transformFunction": "\"event-timestamp\""
        },
        {
          "columnName": "user_id",
          "transformFunction": "\"user-id\""
        }
      ]
    }
    "ingestionConfig": {
        "transformConfigs": [{
          "columnName": "firstName",
          "transformFunction": "\"first Name \""
        }]
    }
    "ingestionConfig": {
        "transformConfigs": [{
          "columnName": "impressions",
          "transformFunction": "Groovy({eventType == 'IMPRESSION' ? 1: 0}, eventType)"
        },
        {
          "columnName": "clicks",
          "transformFunction": "Groovy({eventType == 'CLICK' ? 1: 0}, eventType)"
        }]
    }
    "ingestionConfig": {
        "transformConfigs": [{
          "columnName": "map2_keys",
          "transformFunction": "Groovy({map2.sort()*.key}, map2)"
        },
        {
          "columnName": "map2_values",
          "transformFunction": "Groovy({map2.sort()*.value}, map2)"
        }]
    }
    {
      "userId": "12345678__foo__othertext"
    }
    "ingestionConfig": {
        "transformConfigs": [
          {
            "columnName": "userOid",
            "transformFunction": "jsonPathString(data, '$.userId')"
          },
          {
            "columnName": "userId",
            "transformFunction": "Groovy({Long.valueOf(userOid.substring(0, 8))}, userOid)"
          }
       ]
    }

    toEpochDays

    Converts epoch millis to epoch days Usage: "toEpochDays(millis)"

    toEpochDaysRounded

    Converts epoch millis to epoch seconds, rounding to nearest rounding bucket"toEpochDaysRounded(millis, 7)"

    fromEpochDays

    Converts from epoch days to milliseconds "fromEpochDays(daysSinceEpoch)"

    ToDateTime
    FromDateTime
    jsonExtractScalar
    Underlying ingestion still works as LOWLEVEL mode, where
    • transcript-topic1 segments would be named like transcript__0__0__20250101T0000Z

    • transcript-topic2 segments would be named like transcript__10000__0__20250101T0000Z

    The pinot.server.consumption.rate.limit setting must be configured in the server's instance configuration, not in the table configuration. This setting establishes a maximum consumption rate that applies collectively to all table partitions hosted on a single server. When both this server-level setting and the topic.consumption.rate.limit setting are specified, the server configuration has lower priority.1

    \

    Multiple real-time tables coexist on the same server

    Consumption is throttled as soon as either limit is reached

    When the transform pipeline sets the $INCOMPLETE_RECORD_KEY$ key in the message, it is recorded as INCOMPLETE_REALTIME_ROWS_CONSUMED , only when continueOnError configuration is enabled. If the continueOnError is not enabled, the ingestion fails.

  • Index stage -> When there is failure at this stage, the ingestion typically stops and marks the partition as ERROR.

  • currentOffsetsMap

    Current consuming offset position per partition

    latestUpstreamOffsetMap

    (Wherever applicable) Latest offset found in the upstream topic partition

    recordsLagMap

    (Whenever applicable) Defines how far behind the current record's offset / pointer is from upstream latest record. This is calculated as the difference between the latestUpstreamOffset and currentOffset for the partition when the lag computation request is made.

    recordsAvailabilityLagMap

    First table and schema
    table
    ingestion configuration
    this PRarrow-up-right
    design docarrow-up-right
    Partition changes
    Pause Stream Ingestion
    Stream Ingestion Plugin
    Low Level
    periodic task
    manually trigger
    FILTER

    (Whenever applicable) Defines how soon after record ingestion was the record consumed by Pinot. This is calculated as the difference between the time the record was consumed and the time at which the record was ingested upstream.

    {"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"Maths","score":3.8,"timestamp":1571900400000}
    {"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"History","score":3.5,"timestamp":1571900400000}
    {"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Maths","score":3.2,"timestamp":1571900400000}
    {"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Chemistry","score":3.6,"timestamp":1572418800000}
    {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Geography","score":3.8,"timestamp":1572505200000}
    {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"English","score":3.5,"timestamp":1572505200000}
    {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Maths","score":3.2,"timestamp":1572678000000}
    {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Physics","score":3.6,"timestamp":1572678000000}
    {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"Maths","score":3.8,"timestamp":1572678000000}
    {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"English","score":3.5,"timestamp":1572678000000}
    {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"History","score":3.2,"timestamp":1572854400000}
    {"studentID":212,"firstName":"Nick","lastName":"Young","gender":"Male","subject":"History","score":3.6,"timestamp":1572854400000}
    /tmp/pinot-quick-start/transcript-schema.json
    {
      "schemaName": "transcript",
      "dimensionFieldSpecs": [
        {
          "name": "studentID",
          "dataType": "INT"
        },
        {
          "name": "firstName",
          "dataType": "STRING"
        },
        {
          "name": "lastName",
          "dataType": "STRING"
        },
        {
          "name": "gender",
          "dataType": "STRING"
        },
        {
          "name": "subject",
          "dataType": "STRING"
        }
      ],
      "metricFieldSpecs": [
        {
          "name": "score",
          "dataType": "FLOAT"
        }
      ],
      "dateTimeFieldSpecs": [{
        "name": "timestamp",
        "dataType": "LONG",
        "format" : "1:MILLISECONDS:EPOCH",
        "granularity": "1:MILLISECONDS"
      }]
    }
    {
      "tableName": "transcript",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "timestamp",
        "timeType": "MILLISECONDS",
        "schemaName": "transcript",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP",
      },
      "metadata": {
        "customConfigs": {}
      },
      "ingestionConfig": {
        "streamIngestionConfig": {
            "streamConfigMaps": [
              {
                "realtime.segment.flush.threshold.rows": "0",
                "stream.kafka.decoder.prop.format": "JSON",
                "key.serializer": "org.apache.kafka.common.serialization.ByteArraySerializer",
                "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
                "streamType": "kafka",
                "value.serializer": "org.apache.kafka.common.serialization.ByteArraySerializer",
                "realtime.segment.flush.threshold.segment.rows": "50000",
                "stream.kafka.broker.list": "localhost:9876",
                "realtime.segment.flush.threshold.time": "3600000",
                "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
                "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
                "stream.kafka.topic.name": "transcript-topic"
              }
            ]
          },
          "transformConfigs": [],
          "continueOnError": true,
          "rowTimeValueCheck": true,
          "segmentTimeValueCheck": false
        },
        "isDimTable": false
      }
    }
    {
      "tableName": "transcript",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "timestamp",
        "timeType": "MILLISECONDS",
        "schemaName": "transcript",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP",
      },
      "metadata": {
        "customConfigs": {}
      },
      "ingestionConfig": {
        "streamIngestionConfig": {
            "streamConfigMaps": [
              {
                "realtime.segment.flush.threshold.rows": "0",
                "stream.kafka.decoder.prop.format": "JSON",
                "key.serializer": "org.apache.kafka.common.serialization.ByteArraySerializer",
                "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
                "streamType": "kafka",
                "value.serializer": "org.apache.kafka.common.serialization.ByteArraySerializer",
                "realtime.segment.flush.threshold.segment.rows": "50000",
                "stream.kafka.broker.list": "localhost:9876",
                "realtime.segment.flush.threshold.time": "3600000",
                "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
                "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
                "stream.kafka.topic.name": "transcript-topic1"
              },
              {
                "realtime.segment.flush.threshold.rows": "0",
                "stream.kafka.decoder.prop.format": "JSON",
                "key.serializer": "org.apache.kafka.common.serialization.ByteArraySerializer",
                "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
                "streamType": "kafka",
                "value.serializer": "org.apache.kafka.common.serialization.ByteArraySerializer",
                "realtime.segment.flush.threshold.segment.rows": "50000",
                "stream.kafka.broker.list": "localhost:9876",
                "realtime.segment.flush.threshold.time": "3600000",
                "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
                "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
                "stream.kafka.topic.name": "transcript-topic2"
              }
            ]
          },
          "transformConfigs": [],
          "continueOnError": true,
          "rowTimeValueCheck": true,
          "segmentTimeValueCheck": false
        },
        "isDimTable": false
      }
    }
    docker run \
        --network=pinot-demo \
        -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
        --name pinot-streaming-table-creation \
        apachepinot/pinot:latest AddTable \
        -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
        -tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
        -controllerHost pinot-quickstart \
        -controllerPort 9000 \
        -exec
    bin/pinot-admin.sh AddTable \
        -schemaFile /path/to/transcript-schema.json \
        -tableConfigFile /path/to/transcript-table-realtime.json \
        -exec
    {
      "tableName": "transcript",
      "tableType": "REALTIME",
      ...
      "ingestionConfig": {
        "streamIngestionConfig":,
        "streamConfigMaps": {
          "streamType": "kafka",
          "stream.kafka.topic.name": "transcript-topic",
          ...
          "topic.consumption.rate.limit": 1000
        }
      },
      ...
    A consumption rate limiter is set up for topic <topic_name> in table <tableName> with rate limit: <rate_limit> (topic rate limit: <topic_rate_limit>, partition count: <partition_count>)
    $ curl -X POST {controllerHost}/tables/{tableName}/forceCommit
    curl -X POST
    '{controllerHost}/cluster/configs'
    -H 'Content-Type: application/json'
    -d '{
    "pinot.server.consumption.rate.limit.bytes": "3000000"
    }'
    CONSUMPTION_QUOTA_UTILIZATION
    $ curl -X POST {controllerHost}/tables/{tableName}/pauseConsumption
    $ curl -X POST {controllerHost}/tables/{tableName}/resumeConsumption
    $ curl -X POST {controllerHost}/tables/{tableName}/pauseStatus
    $ curl -X POST {controllerHost}/tables/{tableName}/forceCommit
    $ curl -X POST {controllerHost}/tables/{tableName}/forceCommit
    {
      "forceCommitJobId": "6757284f-b75b-45ce-91d8-a277bdbc06ae",
      "forceCommitStatus": "SUCCESS",
      "jobMetaZKWriteStatus": "SUCCESS"
    }
    
    $ curl -X GET {controllerHost}/tables/forceCommitStatus/6757284f-b75b-45ce-91d8-a277bdbc06ae
    {
      "jobId": "6757284f-b75b-45ce-91d8-a277bdbc06ae",
      "segmentsForceCommitted": "[\"airlineStats__0__0__20230119T0700Z\",\"airlineStats__1__0__20230119T0700Z\",\"airlineStats__2__0__20230119T0700Z\"]",
      "submissionTimeMs": "1674111682977",
      "numberOfSegmentsYetToBeCommitted": 0,
      "jobType": "FORCE_COMMIT",
      "segmentsYetToBeCommitted": [],
      "tableName": "airlineStats_REALTIME"
    }
    $ curl -X POST {controllerHost}/tables/{tableName}/resumeConsumption?resumeFrom=smallest
    $ curl -X POST {controllerHost}/tables/{tableName}/resumeConsumption?resumeFrom=largest
    # GET /tables/{tableName}/consumingSegmentsInfo
    curl -X GET "http://<controller_url:controller_admin_port>/tables/meetupRsvp/consumingSegmentsInfo" -H "accept: application/json"
    
    # GET /debug/tables/{tableName}
    curl -X GET "http://localhost:9000/debug/tables/meetupRsvp?type=REALTIME&verbosity=1" -H "accept: application/json"
    {
      "_segmentToConsumingInfoMap": {
        "meetupRsvp__0__0__20221019T0639Z": [
          {
            "serverName": "Server_192.168.0.103_7000",
            "consumerState": "CONSUMING",
            "lastConsumedTimestamp": 1666161593904,
            "partitionToOffsetMap": { // <<-- Deprecated. See currentOffsetsMap for same info
              "0": "6"
            },
            "partitionOffsetInfo": {
              "currentOffsetsMap": {
                "0": "6" // <-- Current consumer position
              },
              "latestUpstreamOffsetMap": {
                "0": "6"  // <-- Upstream latest position
              },
              "recordsLagMap": {
                "0": "0"  // <-- Lag, in terms of #records behind latest
              },
              "recordsAvailabilityLagMap": {
                "0": "2"  // <-- Lag, in terms of time
              }
            }
          }
        ],

    Minion

    Explore the minion component in Apache Pinot, empowering efficient data movement and segment generation within Pinot clusters.

    A Pinot minion is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minon tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.

    Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function with or without minions, they are typically present to support routine tasks like batch data ingest.

    hashtag

    Starting a minion

    Make sure you've set up Zookeeper. If you're using Docker, make sure to pull the Pinot Docker image. To start a minion:

    docker run \
        --network=pinot-demo \
        --name pinot-minion \
        -d ${PINOT_IMAGE} StartMinion \
        -zkAddress pinot-zookeeper:2181
    bin/pinot-admin.sh StartMinion \
        -zkAddress localhost:2181

    hashtag
    Interfaces

    hashtag
    Pinot task generator

    The Pinot task generator interface defines the APIs for the controller to generate tasks for minions to execute.

    hashtag
    PinotTaskExecutorFactory

    Factory for PinotTaskExecutor which defines the APIs for Minion to execute the tasks.

    hashtag
    MinionEventObserverFactory

    Factory for MinionEventObserver which defines the APIs for task event callbacks on minion.

    hashtag
    Built-in tasks

    Pinot ships with the following built-in Minion tasks:

    Task
    Purpose
    Table Types

    Batch ingestion: reads raw data files and converts them into Pinot segments

    OFFLINE

    Converts completed real-time segments into optimized offline segments

    REALTIME to OFFLINE

    hashtag
    SegmentGenerationAndPushTask

    The SegmentGenerationAndPushTask can fetch files from an input folder (e.g. from an S3 bucket) and convert them into segments. It converts one file into one segment and keeps the file name in segment metadata to avoid duplicate ingestion.

    See SegmentGenerationAndPushTask runbook for full configuration details.

    Below is an example task config to put in TableConfig to enable this task. The task is scheduled every 10min to keep ingesting remaining files, with 10 parallel task at max and 1 file per task.

    NOTE: You may want to simply omit "tableMaxNumTasks" due to this caveat: the task generates one segment per file, and derives segment name based on the time column of the file. If two files happen to have same time range and are ingested by tasks from different schedules, there might be segment name conflict. To overcome this issue for now, you can omit “tableMaxNumTasks” and by default it’s Integer.MAX_VALUE, meaning to schedule as many tasks as possible to ingest all input files in a single batch. Within one batch, a sequence number suffix is used to ensure no segment name conflict. Because the sequence number suffix is scoped within one batch, tasks from different batches might encounter segment name conflict issue said above.

    circle-info

    When performing ingestion at scale remember that Pinot will list all of the files contained in the `inputDirURI` every time a `SegmentGenerationAndPushTask` job gets scheduled. This could become a bottleneck when fetching files from a cloud bucket like GCS. To prevent this make `inputDirURI` point to the least number of files possible.

    hashtag
    RealtimeToOfflineSegmentsTask

    See Pinot managed Offline flows for details.

    hashtag
    MergeRollupTask

    See Minion merge rollup task for details.

    hashtag
    PurgeTask

    See PurgeTask runbook for details.

    hashtag
    RefreshSegmentTask

    See RefreshSegmentTask runbook for details.

    hashtag
    UpsertCompactionTask

    See UpsertCompactionTask runbook for details.

    hashtag
    UpsertCompactMergeTask

    See UpsertCompactMergeTask runbook for details.

    hashtag
    Enable tasks

    Tasks are enabled on a per-table basis. To enable a certain task type (e.g. myTask) on a table, update the table config to include the task type:

    Under each enable task type, custom properties can be configured for the task type.

    There are also two task configs to be set as part of cluster configs like below. One controls task's overall timeout (1hr by default) and one for how many tasks to run on a single minion worker (1 by default).

    hashtag
    Schedule tasks

    hashtag
    Auto-schedule

    There are 2 ways to enable task scheduling:

    hashtag
    Controller level schedule for all minion tasks

    Tasks can be scheduled periodically for all task types on all enabled tables. Enable auto task scheduling by configuring the schedule frequency in the controller config with the key controller.task.frequencyPeriod. This takes period strings as values, e.g. 2h, 30m, 1d.

    hashtag
    Per table and task level schedule

    Tasks can also be scheduled based on cron expressions. The cron expression is set in the schedule config for each task type separately. This config in the controller config, controller.task.scheduler.enabled should be set to true to enable cron scheduling.

    As shown below, the RealtimeToOfflineSegmentsTask will be scheduled at the first second of every minute (following the syntax defined herearrow-up-right).

    hashtag
    Manual schedule

    Tasks can be manually scheduled using the following controller rest APIs:

    Rest API
    Description

    POST /tasks/schedule

    Schedule tasks for all task types on all enabled tables

    POST /tasks/schedule?taskType=myTask

    Schedule tasks for the given task type on all enabled tables

    POST /tasks/schedule?tableName=myTable_OFFLINE

    Schedule tasks for all task types on the given table

    POST /tasks/schedule?taskType=myTask&tableName=myTable_OFFLINE

    hashtag
    Schedule task on specific instances

    Tasks can be scheduled on specific instances using the following config at task level:

    By default, the value is minion_untagged to have backward-compatibility. This will allow users to schedule tasks on specific nodes and isolate tasks among tables / task-types.

    Rest API
    Description

    POST /tasks/schedule?taskType=myTask&tableName=myTable_OFFLINE&minionInstanceTag=tag1_MINION

    Schedule tasks for the given task type of the given table on the minion nodes tagged as tag1_MINION.

    hashtag
    Task level advanced configs

    hashtag
    allowDownloadFromServer

    When a task is executed on a segment, the minion node fetches the segment from deepstore. If the deepstore is not accessible, the minion node can download the segment from the server node. This is controlled by the allowDownloadFromServer config in the task config. By default, this is set to false.

    We can also set this config at a minion instance level pinot.minion.task.allow.download.from.server (default is false). This instance level config helps in enforcing this behaviour if the number of tables / tasks is pretty high and we want to enable for all. Note: task-level config will override instance-level config value.

    hashtag
    Plug-in custom tasks

    To plug in a custom task, implement PinotTaskGenerator, PinotTaskExecutorFactory and MinionEventObserverFactory (optional) for the task type (all of them should return the same string for getTaskType()), and annotate them with the following annotations:

    Implementation
    Annotation

    PinotTaskGenerator

    @TaskGenerator

    PinotTaskExecutorFactory

    @TaskExecutorFactory

    MinionEventObserverFactory

    @EventObserverFactory

    After annotating the classes, put them under the package of name org.apache.pinot.*.plugin.minion.tasks.*, then they will be auto-registered by the controller and minion.

    hashtag
    Example

    See SimpleMinionClusterIntegrationTestarrow-up-right where the TestTask is plugged-in.

    hashtag
    Task Manager UI

    In the Pinot UI, there is Minion Task Manager tab under Cluster Manager page. From that minion task manager tab, one can find a lot of task related info for troubleshooting. Those info are mainly collected from the Pinot controller that schedules tasks or Helix that tracks task runtime status. There are also buttons to schedule tasks in an ad hoc way. Below are some brief introductions to some pages under the minion task manager tab.

    This one shows which types of Minion Task have been used. Essentially which task types have created their task queues in Helix.

    **

    Clicking into a task type, one can see the tables using that task. And a few buttons to stop the task queue, cleaning up ended tasks etc.

    **

    Then clicking into any table in this list, one can see how the task is configured for that table. And the task metadata if there is one in ZK. For example, MergeRollupTask tracks a watermark in ZK. If the task is cron scheduled, the current and next schedules are also shown in this page like below.

    **

    **

    At the bottom of this page is a list of tasks generated for this table for this specific task type. Like here, one MergeRollup task has been generated and completed.

    Clicking into a task from that list, we can see start/end time for it, and the subtasks generated for that task (as context, one minion task can have multiple subtasks to process data in parallel). In this example, it happened to have one sub-task here, and it shows when it starts and stops and which minion worker it's running.

    **

    Clicking into this subtask, one can see more details about it like the input task configs and error info if the task failed.

    **

    hashtag
    Task-related metrics

    There is a controller job that runs every 5 minutes by default and emits metrics about Minion tasks scheduled in Pinot. The following metrics are emitted for each task type:

    • NumMinionTasksInProgress: Number of running tasks

    • NumMinionSubtasksRunning: Number of running sub-tasks

    • NumMinionSubtasksWaiting: Number of waiting sub-tasks (unassigned to a minion as yet)

    • NumMinionSubtasksError: Number of error sub-tasks (completed with an error/exception)

    • PercentMinionSubtasksInQueue: Percent of sub-tasks in waiting or running states

    • PercentMinionSubtasksInError: Percent of sub-tasks in error

    The controller also emits metrics about how tasks are cron scheduled:

    • cronSchedulerJobScheduled: Number of current cron schedules registered to be triggered regularly according their cron expressions. It's a Gauge.

    • cronSchedulerJobTrigger: Number of cron scheduled triggered, as a Meter.

    • cronSchedulerJobSkipped: Number of late cron scheduled skipped, as a Meter.

    • cronSchedulerJobExecutionTimeMs: Time used to complete task generation, as a Timer.

    For each task, the minion will emit these metrics:

    • TASK_QUEUEING: Task queueing time (task_dequeue_time - task_inqueue_time), assuming the time drift between helix controller and pinot minion is minor, otherwise the value may be negative

    • TASK_EXECUTION: Task execution time, which is the time spent on executing the task

    • NUMBER_OF_TASKS: number of tasks in progress on that minion. Whenever a Minion starts a task, increase the Gauge by 1, whenever a Minion completes (either succeeded or failed) a task, decrease it by 1

    • NUMBER_TASKS_EXECUTED: Number of tasks executed, as a Meter.

    • NUMBER_TASKS_COMPLETED: Number of tasks completed, as a Meter.

    • NUMBER_TASKS_CANCELLED: Number of tasks cancelled, as a Meter.

    • NUMBER_TASKS_FAILED: Number of tasks failed, as a Meter. Different from fatal failure, the task encountered an error which can not be recovered from this run, but it may still succeed by retrying the task.

    • NUMBER_TASKS_FATAL_FAILED: Number of tasks fatal failed, as a Meter. Different from failure, the task encountered an error, which will not be recoverable even with retrying the task.

    Usage: StartMinion
        -help                                                   : Print this message. (required=false)
        -minionHost               <String>                      : Host name for minion. (required=false)
        -minionPort               <int>                         : Port number to start the minion at. (required=false)
        -zkAddress                <http>                        : HTTP address of Zookeeper. (required=false)
        -clusterName              <String>                      : Pinot cluster name. (required=false)
        -configFileName           <Config File Name>            : Minion Starter Config file. (required=false)
    public interface PinotTaskGenerator {
    
      /**
       * Initializes the task generator.
       */
      void init(ClusterInfoAccessor clusterInfoAccessor);
    
      /**
       * Returns the task type of the generator.
       */
      String getTaskType();
    
      /**
       * Generates a list of tasks to schedule based on the given table configs.
       */
      List<PinotTaskConfig> generateTasks(List<TableConfig> tableConfigs);
    
      /**
       * Returns the timeout in milliseconds for each task, 3600000 (1 hour) by default.
       */
      default long getTaskTimeoutMs() {
        return JobConfig.DEFAULT_TIMEOUT_PER_TASK;
      }
    
      /**
       * Returns the maximum number of concurrent tasks allowed per instance, 1 by default.
       */
      default int getNumConcurrentTasksPerInstance() {
        return JobConfig.DEFAULT_NUM_CONCURRENT_TASKS_PER_INSTANCE;
      }
    
      /**
       * Performs necessary cleanups (e.g. remove metrics) when the controller leadership changes.
       */
      default void nonLeaderCleanUp() {
      }
    }
    public interface PinotTaskExecutorFactory {
    
      /**
       * Initializes the task executor factory.
       */
      void init(MinionTaskZkMetadataManager zkMetadataManager);
    
      /**
       * Returns the task type of the executor.
       */
      String getTaskType();
    
      /**
       * Creates a new task executor.
       */
      PinotTaskExecutor create();
    }
    public interface PinotTaskExecutor {
    
      /**
       * Executes the task based on the given task config and returns the execution result.
       */
      Object executeTask(PinotTaskConfig pinotTaskConfig)
          throws Exception;
    
      /**
       * Tries to cancel the task.
       */
      void cancel();
    }
    public interface MinionEventObserverFactory {
    
      /**
       * Initializes the task executor factory.
       */
      void init(MinionTaskZkMetadataManager zkMetadataManager);
    
      /**
       * Returns the task type of the event observer.
       */
      String getTaskType();
    
      /**
       * Creates a new task event observer.
       */
      MinionEventObserver create();
    }
    public interface MinionEventObserver {
    
      /**
       * Invoked when a minion task starts.
       *
       * @param pinotTaskConfig Pinot task config
       */
      void notifyTaskStart(PinotTaskConfig pinotTaskConfig);
    
      /**
       * Invoked when a minion task succeeds.
       *
       * @param pinotTaskConfig Pinot task config
       * @param executionResult Execution result
       */
      void notifyTaskSuccess(PinotTaskConfig pinotTaskConfig, @Nullable Object executionResult);
    
      /**
       * Invoked when a minion task gets cancelled.
       *
       * @param pinotTaskConfig Pinot task config
       */
      void notifyTaskCancelled(PinotTaskConfig pinotTaskConfig);
    
      /**
       * Invoked when a minion task encounters exception.
       *
       * @param pinotTaskConfig Pinot task config
       * @param exception Exception encountered during execution
       */
      void notifyTaskError(PinotTaskConfig pinotTaskConfig, Exception exception);
    }
      "ingestionConfig": {
        "batchIngestionConfig": {
          "segmentIngestionType": "APPEND",
          "segmentIngestionFrequency": "DAILY",
          "batchConfigMaps": [
            {
              "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS",
              "input.fs.prop.region": "us-west-2",
              "input.fs.prop.secretKey": "....",
              "input.fs.prop.accessKey": "....",
              "inputDirURI": "s3://my.s3.bucket/batch/airlineStats/rawdata/",
              "includeFileNamePattern": "glob:**/*.avro",
              "excludeFileNamePattern": "glob:**/*.tmp",
              "inputFormat": "avro"
            }
          ]
        }
      },
      "task": {
        "taskTypeConfigsMap": {
          "SegmentGenerationAndPushTask": {
            "schedule": "0 */10 * * * ?",
            "tableMaxNumTasks": "10"
          }
        }
      }
    {
      ...
      "task": {
        "taskTypeConfigsMap": {
          "myTask": {
            "myProperty1": "value1",
            "myProperty2": "value2"
          }
        }
      }
    }
    Using "POST /cluster/configs" API on CLUSTER tab in Swagger, with this payload
    {
    	"RealtimeToOfflineSegmentsTask.timeoutMs": "600000",
    	"RealtimeToOfflineSegmentsTask.numConcurrentTasksPerInstance": "4"
    }
      "task": {
        "taskTypeConfigsMap": {
          "RealtimeToOfflineSegmentsTask": {
            "bucketTimePeriod": "1h",
            "bufferTimePeriod": "1h",
            "schedule": "0 * * * * ?"
          }
        }
      },
      "task": {
        "taskTypeConfigsMap": {
          "RealtimeToOfflineSegmentsTask": {
            "bucketTimePeriod": "1h",
            "bufferTimePeriod": "1h",
            "schedule": "0 * * * * ?",
            "minionInstanceTag": "tag1_MINION"
          }
        }
      },

    Merges small segments into larger ones and optionally rolls up data at coarser granularity

    OFFLINE, REALTIME (without upsert/dedup)

    Removes or modifies records for data retention and compliance (e.g., GDPR)

    OFFLINE, REALTIME

    Reprocesses segments after table config or schema changes (new indexes, columns, data types)

    OFFLINE, REALTIME

    Compacts individual upsert segments by removing invalidated records

    REALTIME (upsert only)

    Merges multiple small upsert segments into larger ones to reduce segment count

    REALTIME (upsert only)

    Schedule tasks for the given task type on the given table

    SegmentGenerationAndPushTask
    RealtimeToOfflineSegmentsTask
    MergeRollupTask
    PurgeTask
    RefreshSegmentTask
    UpsertCompactionTask
    UpsertCompactMergeTask

    SQL Reference

    Complete reference for SQL syntax, operators, and clauses supported by Apache Pinot's single-stage engine (SSE) and multi-stage engine (MSE).

    Pinot uses the Apache Calcite SQL parser with the MYSQL_ANSI dialect. This page documents every SQL statement, clause, and operator that Pinot supports, and notes where behavior differs between the single-stage engine (SSE) and the multi-stage engine (MSE).

    circle-info

    To use MSE-only features such as JOINs, subqueries, window functions, and set operations, enable the multi-stage engine with SET useMultistageEngine = true; before your query. See Use the multi-stage query engine for details.


    hashtag
    Supported Statements

    Pinot supports the following top-level statement types:

    Statement
    Description

    hashtag
    SELECT Syntax

    The full syntax for a SELECT statement in Pinot is:

    hashtag
    Column Expressions

    A select_expression can be any of the following:

    • * -- all columns

    • A column name: city

    • A qualified column name: myTable.city

    hashtag
    Aliases

    Use AS to assign an alias to any select expression:

    hashtag
    DISTINCT

    Use SELECT DISTINCT to return unique combinations of column values:

    circle-exclamation

    In the SSE, DISTINCT is implemented as an aggregation function. DISTINCT * is not supported; you must list specific columns. DISTINCT with GROUP BY is also not supported.


    hashtag
    FROM Clause

    hashtag
    Table References

    The simplest FROM clause references a single table:

    hashtag
    Subqueries (MSE Only)

    With the multi-stage engine, you can use a subquery as a data source:

    hashtag
    JOINs (MSE Only)

    The multi-stage engine supports the following join types:

    Join Type
    Description

    For detailed join syntax and examples, see .


    hashtag
    WHERE Clause

    The WHERE clause filters rows using predicates. Multiple predicates can be combined with .

    hashtag
    Comparison Operators

    Operator
    Description
    Example

    hashtag
    BETWEEN

    Tests whether a value falls within an inclusive range:

    NOT BETWEEN is also supported:

    hashtag
    IN

    Tests whether a value matches any value in a list:

    NOT IN is also supported:

    circle-info

    For large value lists, consider using for better performance.

    hashtag
    LIKE

    Pattern matching with wildcards. % matches any sequence of characters; _ matches any single character:

    NOT LIKE is also supported.

    hashtag
    IS NULL / IS NOT NULL

    Tests whether a value is null:

    See for details on how nulls work in Pinot.

    hashtag
    REGEXP_LIKE

    Filters rows using regular expression matching:

    circle-info

    REGEXP_LIKE supports case-insensitive matching via a third parameter: REGEXP_LIKE(col, pattern, 'i').

    hashtag
    TEXT_MATCH

    Full-text search on columns with a text index:

    hashtag
    JSON_MATCH

    Predicate matching on columns with a JSON index:

    hashtag
    VECTOR_SIMILARITY

    Approximate nearest-neighbor search on vector-indexed columns:


    hashtag
    GROUP BY

    Groups rows that share values in the specified columns, typically used with aggregation functions:

    Rules:

    • Every non-aggregated column in the SELECT list must appear in the GROUP BY clause.

    • Aggregation functions and non-aggregation columns cannot be mixed in the SELECT list without a GROUP BY.


    hashtag
    HAVING

    Filters groups after aggregation. Use HAVING instead of WHERE when filtering on aggregated values:


    hashtag
    ORDER BY

    Sorts the result set by one or more expressions:

    hashtag
    Ordering Direction

    • ASC -- ascending order (default)

    • DESC -- descending order

    hashtag
    NULL Ordering

    • NULLS FIRST -- null values appear first

    • NULLS LAST -- null values appear last


    hashtag
    LIMIT / OFFSET

    hashtag
    LIMIT

    Restricts the number of rows returned:

    If no LIMIT is specified, Pinot defaults to returning 10 rows for selection queries.

    hashtag
    OFFSET

    Skips a number of rows before returning results. Requires ORDER BY for consistent pagination:

    Pinot also supports the legacy LIMIT offset, count syntax:


    hashtag
    Logical Operators

    Operator
    Description

    hashtag
    Precedence

    From highest to lowest:

    1. NOT

    2. AND

    3. OR

    Use parentheses to override default precedence:


    hashtag
    Arithmetic Operators

    Arithmetic expressions can be used in SELECT expressions, WHERE clauses, and other contexts:

    Operator
    Description
    Example

    hashtag
    Type Casting

    Use CAST to convert a value from one type to another:

    hashtag
    Supported Target Types

    Type
    Description

    hashtag
    Set Operations (MSE Only)

    The multi-stage engine supports combining results from multiple queries:

    Operation
    Description

    hashtag
    Window Functions (MSE Only)

    Window functions compute a value across a set of rows related to the current row, without collapsing them into a single output row.

    hashtag
    Syntax

    hashtag
    Frame Clause

    hashtag
    Example

    For the full list of supported window functions and detailed syntax, see .


    hashtag
    OPTION Clause

    The OPTION clause provides Pinot-specific query hints. These are not standard SQL but allow you to control engine behavior:

    The preferred approach is to use SET statements before the query:

    Common query options include:

    Option
    Description

    For the complete list of query options, see .


    hashtag
    NULL Semantics

    hashtag
    Default Behavior

    By default, Pinot treats null values as the default value for the column type (0 for numeric types, empty string for strings, etc.). This avoids the overhead of null tracking and maintains backward compatibility.

    hashtag
    Nullable Columns

    To enable full null handling:

    1. Mark columns as nullable in the schema (do not set notNull: true).

    2. Enable null handling at query time:

    hashtag
    Three-Valued Logic

    When null handling is enabled, Pinot follows standard SQL three-valued logic:

    Key behaviors with null handling enabled:

    • Comparisons with NULL (e.g., col = NULL) return NULL (not TRUE or FALSE). Use IS NULL / IS NOT NULL instead.

    • NULL IN (...) returns NULL, not FALSE.

    For more details, see .


    hashtag
    Identifier and Literal Rules

    • Double quotes (") delimit identifiers (column names, table names). Use double quotes for reserved keywords or special characters: SELECT "timestamp", "date" FROM myTable.

    • Single quotes (') delimit string literals: WHERE city = 'NYC'. Escape an embedded single quote by doubling it: 'it''s'.


    hashtag
    CASE WHEN

    Pinot supports CASE WHEN expressions for conditional logic:

    CASE WHEN can be used inside aggregation functions:

    circle-exclamation

    Aggregation functions inside the ELSE clause are not supported.


    hashtag
    Engine Compatibility Matrix

    The following table summarizes feature support across the single-stage engine (SSE) and multi-stage engine (MSE):

    Feature
    SSE
    MSE

    Ingest from Apache Kafka

    This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.

    Learn how to ingest data from Kafka, a stream processing platform. You should have a local cluster up and running, following the instructions in Set up a cluster.

    circle-info

    This guide uses the Kafka 3.0 connector (kafka30). Pinot also supports a Kafka 4.0 connector for KRaft-mode Kafka clusters. See Kafka Connector Versions for details on choosing the right connector.

    hashtag
    Install and Launch Kafka

    Let's start by downloading Kafka to our local machine.

    To pull down the latest Docker image, run the following command:

    Download Kafka from and then extract it:

    Next we'll spin up a Kafka broker. Kafka 4.0 uses KRaft mode by default and does not require ZooKeeper:

    Note: The --network pinot-demo flag is optional and assumes that you have a Docker network named pinot-demo that you want to connect the Kafka container to.

    Kafka 4.0 uses KRaft mode by default. Generate a cluster ID and format the storage directory, then start the broker:

    Start Kafka Broker (KRaft mode)

    hashtag
    Data Source

    We're going to generate some JSON messages from the terminal using the following script:

    datagen.py

    If you run this script (python datagen.py), you'll see the following output:

    hashtag
    Ingesting Data into Kafka

    Let's now pipe that stream of messages into Kafka, by running the following command:

    We can check how many messages have been ingested by running the following command:

    Output

    And we can print out the messages themselves by running the following command

    Output

    hashtag
    Schema

    A schema defines what fields are present in the table along with their data types in JSON format.

    Create a file called /tmp/pinot/schema-stream.json and add the following content to it.

    hashtag
    Table Config

    A table is a logical abstraction that represents a collection of related data. It is composed of columns and rows (known as documents in Pinot). The table config defines the table's properties in JSON format.

    Create a file called /tmp/pinot/table-config-stream.json and add the following content to it.

    hashtag
    Create schema and table

    Create the table and schema by running the appropriate command below:

    hashtag
    Querying

    Navigate to and click on the events table to run a query that shows the first 10 rows in this table.

    _Querying the events table_

    hashtag
    Kafka ingestion guidelines

    hashtag
    Kafka connector modules in Pinot

    Pinot ships two Kafka connector modules:

    • pinot-kafka-3.0 -- Uses Kafka client library 3.x (currently 3.9.2). This is the default connector included in Pinot distributions. Consumer factory class: org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory.

    • pinot-kafka-4.0 -- Uses Kafka client library 4.x (currently 4.1.1). This connector drops the ZooKeeper-based Scala dependency and uses the pure-Java Kafka client, suitable for KRaft-mode Kafka clusters. Consumer factory class: org.apache.pinot.plugin.stream.kafka40.KafkaConsumerFactory.

    circle-info

    The legacy kafka-0.9 and kafka-2.x connector modules have been removed. If you are upgrading from an older Pinot release that used org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory, update your table configs to use one of the current connector classes listed above.

    circle-info

    Pinot does not support using high-level Kafka consumers (HLC). Pinot uses low-level consumers to ensure accurate results, supports operational complexity and scalability, and minimizes storage overhead.

    hashtag
    Migrating from the kafka-2.x connector

    If your existing table configs reference the removed kafka-2.x connector, update the stream.kafka.consumer.factory.class.name property:

    • From: org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory

    • To (Kafka 3.x): org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory

    • To (Kafka 4.x): org.apache.pinot.plugin.stream.kafka40.KafkaConsumerFactory

    No other stream config changes are required. The Kafka 3.x connector is compatible with Kafka brokers 2.x and above. The Kafka 4.x connector requires Kafka brokers 4.0 or above.

    hashtag
    Kafka configurations in Pinot

    hashtag
    Use Kafka partition (low) level consumer with SSL

    Here is an example config which uses SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, ones starting with ssl. are for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.

    hashtag
    Use Confluent Schema Registry with JSON encoded messages

    If your Kafka messages are JSON-encoded and registered with Confluent Schema Registry, use the KafkaConfluentSchemaRegistryJsonMessageDecoder. This decoder uses the Confluent KafkaJsonSchemaDeserializer to decode messages whose JSON schemas are managed by the registry.

    When to use this decoder

    • Your Kafka producer serializes messages using the Confluent JSON Schema serializer.

    • Your JSON schemas are registered in Confluent Schema Registry.

    • You want schema validation and evolution support for JSON messages.

    If your messages are Avro-encoded and registered with Schema Registry, use KafkaConfluentSchemaRegistryAvroMessageDecoder instead (shown in the SSL example above). If your messages are plain JSON without a schema registry, use JSONMessageDecoder.

    Example table config

    The key configuration properties for this decoder are:

    • stream.kafka.decoder.class.name -- Set to org.apache.pinot.plugin.inputformat.json.confluent.KafkaConfluentSchemaRegistryJsonMessageDecoder.

    • stream.kafka.decoder.prop.schema.registry.rest.url -- The URL of the Confluent Schema Registry.

    Authentication

    This decoder supports the same authentication options as the Avro schema registry decoder. You can configure SSL or SASL_SSL authentication for both the Kafka consumer and the Schema Registry client using the stream.kafka.decoder.prop.schema.registry.* properties. See the and above for details.

    For Schema Registry basic authentication, add the following properties:

    circle-info

    This decoder was added in Pinot 1.4. Make sure your Pinot deployment is running version 1.4 or later.

    hashtag
    Consume transactionally-committed messages

    The Kafka 3.x and 4.x connectors support Kafka transactions. The transaction support is controlled by config kafka.isolation.level in Kafka stream config, which can be read_committed or read_uncommitted (default). Setting it to read_committed will ingest transactionally committed messages in Kafka stream only.

    For example,

    Note that the default value of this config read_uncommitted to read all messages. Also, this config supports low-level consumer only.

    hashtag
    Use Kafka partition (low) level consumer with SASL_SSL

    Here is an example config which uses SASL_SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, some for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.

    hashtag
    Extract record headers as Pinot table columns

    Pinot's Kafka connector supports automatically extracting record headers and metadata into the Pinot table columns. The following table shows the mapping for record header/metadata to Pinot table column names:

    Kafka Record
    Pinot Table Column
    Description

    In order to enable the metadata extraction in a Kafka table, you can set the stream config metadata.populate to true.

    In addition to this, if you want to use any of these columns in your table, you have to list them explicitly in your table's schema.

    For example, if you want to add only the offset and key as dimension columns in your Pinot table, it can listed in the schema as follows:

    Once the schema is updated, these columns are similar to any other pinot column. You can apply ingestion transforms and / or define indexes on them.

    circle-info

    Remember to follow the when updating schema of an existing table!

    hashtag
    Tell Pinot where to find an Avro schema

    There is a standalone utility to generate the schema from an Avro file. See for details.

    To avoid errors like The Avro schema must be provided, designate the location of the schema in your streamConfigs section. For example, if your current section contains the following:

    Then add this key: "stream.kafka.decoder.prop.schema"followed by a value that denotes the location of your schema.

    hashtag
    Subset partition ingestion

    By default, a Pinot REALTIME table consumes from all partitions of the configured Kafka topic. In some scenarios you may want a table to consume only a subset of the topic's partitions. The stream.kafka.partition.ids setting lets you specify exactly which Kafka partitions a table should consume.

    When to use subset partition ingestion

    • Split-topic ingestion -- Multiple Pinot tables share the same Kafka topic, and each table is responsible for a different set of partitions. This is useful when the same topic contains logically distinct data partitioned by key, and you want separate tables (or indexes) for each partition group.

    • Multi-table partition assignment -- You want to distribute the partitions of a high-throughput topic across several Pinot tables for workload isolation, independent scaling, or different retention policies.

    • Selective consumption -- You only need data from specific partitions of a topic (for example, partitions that correspond to a particular region or tenant).

    Configuration

    Add stream.kafka.partition.ids to the streamConfigMaps entry in your table config. The value is a comma-separated list of Kafka partition IDs (zero-based integers):

    When this setting is present, Pinot will consume only from the listed partitions. When it is absent or blank, Pinot consumes from all partitions of the topic (the default behavior).

    Example: splitting a topic across two tables

    Suppose you have a Kafka topic called events with two partitions (0 and 1). You can create two Pinot tables, each consuming from one partition:

    Table events_part_0:

    Table events_part_1:

    Validation rules and limitations

    • Partition IDs must be non-negative integers. Negative values will cause a validation error.

    • Non-integer values (e.g. "abc") will cause a validation error.

    • Duplicate IDs are silently deduplicated. For example, "0,2,0,5" is treated as "0,2,5"

    hashtag
    Use Protocol Buffers (Protobuf) format

    Pinot supports decoding Protocol Buffer messages from Kafka using several decoder options depending on your setup.

    ProtoBufMessageDecoder (descriptor file based)

    Use ProtoBufMessageDecoder when you have a pre-compiled .desc (descriptor) file for your Protobuf schema. This decoder uses dynamic message parsing and does not require compiled Java classes.

    Required stream config properties:

    Property
    Description

    Example streamConfigs:

    ProtoBufCodeGenMessageDecoder (compiled JAR based)

    Use ProtoBufCodeGenMessageDecoder when you have a compiled JAR containing your generated Protobuf Java classes. This decoder uses runtime code generation for improved decoding performance.

    Required stream config properties:

    Property
    Description

    Example streamConfigs:

    KafkaConfluentSchemaRegistryProtoBufMessageDecoder (Confluent Schema Registry)

    Use KafkaConfluentSchemaRegistryProtoBufMessageDecoder when your Protobuf schemas are managed by Confluent Schema Registry. This decoder automatically resolves schemas from the registry at runtime.

    Required stream config properties:

    Property
    Description

    Optional properties:

    Property
    Description

    Example streamConfigs:

    hashtag
    Use Apache Arrow format

    Pinot supports decoding Apache Arrow IPC streaming format messages from Kafka using ArrowMessageDecoder. This is useful when upstream systems produce data serialized in Arrow format.

    Optional stream config properties:

    Property
    Description

    Example streamConfigs:

    circle-info

    The Arrow decoder expects each Kafka message to contain a complete Arrow IPC stream (schema + record batch). Ensure your producer serializes Arrow data in the IPC streaming format.

    hashtag
    Consuming a Subset of Kafka Partitions

    By default, a Pinot realtime table consumes all partitions of a Kafka topic. You can restrict ingestion to a specific subset of partitions using the stream.kafka.partition.ids property. This is useful when:

    • Splitting a single Kafka topic across multiple Pinot tables for independent scaling

    • Multi-tenant scenarios where different tables own different partition ranges

    hashtag
    Configuration

    Add stream.kafka.partition.ids to your streamConfigs with a comma-separated list of partition IDs:

    hashtag
    Notes

    • Partition IDs are validated against actual Kafka topic metadata at startup.

    • Duplicate IDs in the list are automatically deduplicated.

    • The total partition count reported to the broker reflects the full Kafka topic size, ensuring correct query routing across tables sharing the same topic.

    An expression: price * quantity

  • A function call: UPPER(city)

  • An aggregation function: COUNT(*), SUM(revenue)

  • A CASE WHEN expression

  • CROSS JOIN

    Cartesian product of both tables

    SEMI JOIN

    Rows from the left table that have a match in the right table

    ANTI JOIN

    Rows from the left table that have no match in the right table

    ASOF JOIN

    Rows matched by closest value (e.g., closest timestamp)

    LEFT ASOF JOIN

    Like ASOF JOIN but keeps all left rows

    Less than

    WHERE price < 100

    >

    Greater than

    WHERE price > 50

    <=

    Less than or equal to

    WHERE quantity <= 10

    >=

    Greater than or equal to

    WHERE rating >= 4.0

    Aggregate expressions are not allowed inside the GROUP BY clause.

    Multiplication

    price * quantity

    /

    Division

    total / count

    %

    Modulo (remainder)

    id % 10

    BOOLEAN

    Boolean value

    TIMESTAMP

    Timestamp value

    VARCHAR / STRING

    Variable-length string

    BYTES

    Byte array

    JSON

    JSON value

    useStarTree

    Enable or disable star-tree index usage

    skipUpsert

    Query all records in an upsert table, ignoring deletes

    TRUE

    FALSE

    FALSE

    TRUE

    FALSE

    TRUE

    NULL

    NULL

    TRUE

    NULL

    FALSE

    FALSE

    FALSE

    FALSE

    TRUE

    FALSE

    NULL

    FALSE

    NULL

    TRUE

    NULL

    NULL

    NULL

    NULL

    NULL

    NULL NOT IN (...) returns NULL, not TRUE.
  • Aggregate functions like SUM, AVG, MIN, MAX ignore NULL values.

  • COUNT(*) counts all rows; COUNT(col) counts only non-null values.

  • Decimal literals should be enclosed in single quotes to preserve precision.

    Yes

    Yes

    CASE WHEN

    Yes

    Yes

    BETWEEN, IN, LIKE, IS NULL

    Yes

    Yes

    Arithmetic operators (+, -, *, /, %)

    Yes

    Yes

    CAST

    Yes

    Yes

    OPTION / SET query hints

    Yes

    Yes

    EXPLAIN PLAN

    Yes

    Yes

    OFFSET

    Yes

    Yes

    JOINs (INNER, LEFT, RIGHT, FULL, CROSS)

    No

    Yes

    Semi / Anti joins

    No

    Yes

    ASOF / LEFT ASOF joins

    No

    Yes

    Subqueries

    No

    Yes

    Set operations (UNION, INTERSECT, EXCEPT)

    No

    Yes

    Window functions (OVER, PARTITION BY)

    No

    Yes

    Correlated subqueries

    No

    No

    INSERT INTO (from file)

    No

    Yes

    CREATE TABLE / DROP TABLE DDL

    No

    No

    DISTINCT with *

    No

    No

    DISTINCT with GROUP BY

    No

    No

    SELECT

    Query data from one or more tables

    SET

    Set query options for the session (e.g., SET useMultistageEngine = true)

    EXPLAIN PLAN FOR

    Display the query execution plan without running the query

    [INNER] JOIN

    Rows that match in both tables

    LEFT [OUTER] JOIN

    All rows from the left table, matching rows from the right

    RIGHT [OUTER] JOIN

    All rows from the right table, matching rows from the left

    FULL [OUTER] JOIN

    =

    Equal to

    WHERE city = 'NYC'

    <> or !=

    Not equal to

    WHERE status <> 'canceled'

    AND

    True if both conditions are true

    OR

    True if either condition is true

    NOT

    Negates a condition

    +

    Addition

    price + tax

    -

    Subtraction

    total - discount

    INT / INTEGER

    32-bit signed integer

    BIGINT / LONG

    64-bit signed integer

    FLOAT

    32-bit floating point

    DOUBLE

    UNION ALL

    Combine all rows from both queries (including duplicates)

    UNION

    Combine rows from both queries, removing duplicates

    INTERSECT

    Return rows that appear in both queries

    EXCEPT

    timeoutMs

    Query timeout in milliseconds

    useMultistageEngine

    Use the multi-stage engine (true/false)

    enableNullHandling

    Enable three-valued null logic

    maxExecutionThreads

    A

    B

    A AND B

    A OR B

    NOT A

    TRUE

    TRUE

    TRUE

    TRUE

    FALSE

    SELECT, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT

    Yes

    Yes

    DISTINCT

    Yes

    Yes

    JOINs
    logical operators
    Filtering with IdSet
    NULL Semantics
    Window Functions
    Query Options
    Null value support

    All rows from both tables

    <

    *

    64-bit floating point

    Return rows from the first query that do not appear in the second

    Limit CPU threads used by the query

    Aggregation functions

    __metadata$offset : String

    Record metadata - partition : int

    __metadata$partition : String

    Record metadata - recordTimestamp : long

    __metadata$recordTimestamp : String

    .
  • The partition IDs are sorted internally for stable ordering, regardless of the order specified in the config.

  • The configured partition IDs are validated against the actual Kafka topic metadata at table creation time. If a specified partition ID does not exist in the topic, an error is raised.

  • When using subset partition ingestion with multiple tables consuming from the same topic, ensure that the partition assignments do not overlap if you want each record to be consumed by exactly one table. Pinot does not enforce non-overlapping partition assignments across tables.

  • Whitespace around partition IDs and commas is trimmed (e.g., " 0 , 2 , 5 " is valid).

  • When splitting a topic between two tables, configure one with even-numbered IDs and another with odd-numbered IDs (for example, "0,2" and "1,3" for a 4-partition topic).

    Record key: any type

    __key : String

    For simplicity of design, we assume that the record key is always a UTF-8 encoded String

    Record Headers: Map<String, String>

    Each header key is listed as a separate column: __header$HeaderKeyName : String

    For simplicity of design, we directly map the string headers from kafka record to pinot table column

    stream.kafka.decoder.prop.descriptorFile

    Path or URI to the .desc descriptor file. Supports local file paths, HDFS, and other Pinot-supported file systems.

    stream.kafka.decoder.prop.protoClassName

    (Optional) Fully qualified Protobuf message name within the descriptor. If omitted, the first message type in the descriptor is used.

    stream.kafka.decoder.prop.jarFile

    Path or URI to the JAR file containing compiled Protobuf classes.

    stream.kafka.decoder.prop.protoClassName

    Fully qualified Java class name of the Protobuf message (required).

    stream.kafka.decoder.prop.schema.registry.rest.url

    URL of the Confluent Schema Registry.

    stream.kafka.decoder.prop.cached.schema.map.capacity

    Maximum number of cached schemas. Default: 1000.

    stream.kafka.decoder.prop.schema.registry.*

    SSL and authentication options for connecting to Schema Registry (same pattern as the Avro Confluent decoder).

    stream.kafka.decoder.prop.arrow.allocator.limit

    Maximum memory (in bytes) for the Arrow allocator. Default: 268435456 (256 MB).

    kafka.apache.org/quickstart#quickstart_downloadarrow-up-right
    localhost:9000/#/queryarrow-up-right
    SSL example
    SASL_SSL example
    schema evolution guidelines
    infer the pinot schema from the avro schema and JSON data
    docker pull apache/kafka:4.0.0
    tar -xzf kafka_2.13-4.0.0.tgz
    cd kafka_2.13-4.0.0
    docker run --network pinot-demo --name=kafka \
        -e KAFKA_NODE_ID=1 \
        -e KAFKA_PROCESS_ROLES=broker,controller \
        -e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093 \
        -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092 \
        -e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
        -e KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT \
        -e KAFKA_CONTROLLER_QUORUM_VOTERS=1@kafka:9093 \
        -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
        -e CLUSTER_ID=MkU3OEVBNTcwNTJENDM2Qk \
        apache/kafka:4.0.0
    KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
    bin/kafka-storage.sh format --standalone -t $KAFKA_CLUSTER_ID -c config/server.properties
    bin/kafka-server-start.sh config/server.properties

    Record metadata - offset : long

    -- Set a query option, then run a query
    SET useMultistageEngine = true;
    SELECT COUNT(*) FROM myTable WHERE city = 'San Francisco';
    -- View the execution plan
    EXPLAIN PLAN FOR
    SELECT COUNT(*) FROM myTable GROUP BY city;
    SELECT [ DISTINCT ] select_expression [, select_expression ]*
    FROM table_reference
    [ WHERE filter_condition ]
    [ GROUP BY group_expression [, group_expression ]* ]
    [ HAVING having_condition ]
    [ ORDER BY order_expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] [, ...] ]
    [ LIMIT count ]
    [ OFFSET offset ]
    [ OPTION ( key = value [, key = value ]* ) ]
    SELECT city AS metro_area, COUNT(*) AS total_orders
    FROM orders
    GROUP BY city
    SELECT DISTINCT city, state
    FROM stores
    LIMIT 100
    SELECT * FROM myTable
    SET useMultistageEngine = true;
    SELECT city, avg_revenue
    FROM (
      SELECT city, AVG(revenue) AS avg_revenue
      FROM orders
      GROUP BY city
    ) AS sub
    WHERE avg_revenue > 1000
    SET useMultistageEngine = true;
    SELECT o.order_id, c.name
    FROM orders AS o
    JOIN customers AS c ON o.customer_id = c.id
    WHERE o.amount > 100
    SELECT * FROM orders
    WHERE amount BETWEEN 100 AND 500
    SELECT * FROM orders
    WHERE amount NOT BETWEEN 100 AND 500
    SELECT * FROM orders
    WHERE city IN ('NYC', 'LA', 'Chicago')
    SELECT * FROM orders
    WHERE status NOT IN ('canceled', 'refunded')
    SELECT * FROM customers
    WHERE name LIKE 'John%'
    SELECT * FROM orders
    WHERE discount IS NOT NULL
    SELECT * FROM airlines
    WHERE REGEXP_LIKE(airlineName, '^U.*')
    SELECT * FROM logs
    WHERE TEXT_MATCH(message, 'error AND timeout')
    SELECT * FROM events
    WHERE JSON_MATCH(payload, '"$.type" = ''click''')
    SELECT * FROM embeddings
    WHERE VECTOR_SIMILARITY(vector_col, ARRAY[0.1, 0.2, 0.3], 10)
    SELECT city, COUNT(*) AS order_count, SUM(amount) AS total
    FROM orders
    GROUP BY city
    SELECT city, COUNT(*) AS order_count
    FROM orders
    GROUP BY city
    HAVING COUNT(*) > 100
    SELECT city, SUM(amount) AS total
    FROM orders
    GROUP BY city
    ORDER BY total DESC
    SELECT city, revenue
    FROM stores
    ORDER BY revenue DESC NULLS LAST
    SELECT * FROM orders LIMIT 50
    SELECT * FROM orders
    ORDER BY created_at DESC
    LIMIT 20 OFFSET 40
    SELECT * FROM orders
    ORDER BY created_at DESC
    LIMIT 40, 20
    SELECT * FROM orders
    WHERE (status = 'completed' OR status = 'shipped')
      AND amount > 100
    SELECT order_id, price * quantity AS line_total
    FROM line_items
    WHERE (price * quantity) > 1000
    SELECT CAST(revenue AS BIGINT) FROM orders
    SELECT CAST(event_time AS TIMESTAMP), CAST(user_id AS VARCHAR)
    FROM events
    SET useMultistageEngine = true;
    
    SELECT city FROM stores
    UNION ALL
    SELECT city FROM warehouses
    SET useMultistageEngine = true;
    
    SELECT customer_id FROM orders_2024
    INTERSECT
    SELECT customer_id FROM orders_2025
    function_name ( expression ) OVER (
      [ PARTITION BY partition_expression [, ...] ]
      [ ORDER BY order_expression [ ASC | DESC ] [, ...] ]
      [ frame_clause ]
    )
    { ROWS | RANGE } BETWEEN frame_start AND frame_end
    
    frame_start / frame_end:
      UNBOUNDED PRECEDING
      | offset PRECEDING
      | CURRENT ROW
      | offset FOLLOWING
      | UNBOUNDED FOLLOWING
    SET useMultistageEngine = true;
    
    SELECT
      city,
      order_date,
      amount,
      SUM(amount) OVER (PARTITION BY city ORDER BY order_date) AS running_total,
      ROW_NUMBER() OVER (PARTITION BY city ORDER BY amount DESC) AS rank
    FROM orders
    SELECT * FROM orders
    WHERE city = 'NYC'
    OPTION(timeoutMs=5000)
    SET timeoutMs = 5000;
    SET useMultistageEngine = true;
    SELECT * FROM orders WHERE city = 'NYC'
    SET enableNullHandling = true;
    SELECT * FROM orders WHERE discount IS NULL
    SELECT
      order_id,
      CASE
        WHEN amount > 1000 THEN 'high'
        WHEN amount > 100 THEN 'medium'
        ELSE 'low'
      END AS tier
    FROM orders
    SELECT
      SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END) AS completed_revenue
    FROM orders
    import datetime
    import uuid
    import random
    import json
    
    while True:
        ts = int(datetime.datetime.now().timestamp()* 1000)
        id = str(uuid.uuid4())
        count = random.randint(0, 1000)
        print(
            json.dumps({"ts": ts, "uuid": id, "count": count})
        )
    
    {"ts": 1644586485807, "uuid": "93633f7c01d54453a144", "count": 807}
    {"ts": 1644586485836, "uuid": "87ebf97feead4e848a2e", "count": 41}
    {"ts": 1644586485866, "uuid": "960d4ffa201a4425bb18", "count": 146}
    python datagen.py | docker exec -i kafka /opt/kafka/bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic events;
    python datagen.py | bin/kafka-console-producer.sh --bootstrap-server localhost:9092  --topic events;
    docker exec -i kafka /opt/kafka/bin/kafka-get-offsets.sh --bootstrap-server localhost:9092 --topic events
    bin/kafka-get-offsets.sh --bootstrap-server localhost:9092 --topic events
    events:0:11940
    docker exec -i kafka /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events
    bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events
    ...
    {"ts": 1644586485807, "uuid": "93633f7c01d54453a144", "count": 807}
    {"ts": 1644586485836, "uuid": "87ebf97feead4e848a2e", "count": 41}
    {"ts": 1644586485866, "uuid": "960d4ffa201a4425bb18", "count": 146}
    ...
    {
      "schemaName": "events",
      "dimensionFieldSpecs": [
        {
          "name": "uuid",
          "dataType": "STRING"
        }
      ],
      "metricFieldSpecs": [
        {
          "name": "count",
          "dataType": "INT"
        }
      ],
      "dateTimeFieldSpecs": [{
        "name": "ts",
        "dataType": "TIMESTAMP",
        "format" : "1:MILLISECONDS:EPOCH",
        "granularity": "1:MILLISECONDS"
      }]
    }
    {
      "tableName": "events",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "ts",
        "schemaName": "events",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP",
        "streamConfigs": {
          "streamType": "kafka",
          "stream.kafka.topic.name": "events",
          "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
          "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
          "stream.kafka.broker.list": "kafka:9092",
          "realtime.segment.flush.threshold.rows": "0",
          "realtime.segment.flush.threshold.time": "24h",
          "realtime.segment.flush.threshold.segment.size": "50M",
          "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
        }
      },
      "metadata": {
        "customConfigs": {}
      }
    }
    docker run --rm -ti  --network=pinot-demo  -v /tmp/pinot:/tmp/pinot  apachepinot/pinot:1.0.0 AddTable  -schemaFile /tmp/pinot/schema-stream.json  -tableConfigFile /tmp/pinot/table-config-stream.json  -controllerHost pinot-controller  -controllerPort 9000 -exec
    bin/pinot-admin.sh AddTable -schemaFile /tmp/pinot/schema-stream.json -tableConfigFile /tmp/pinot/table-config-stream.json
      {
        "tableName": "transcript",
        "tableType": "REALTIME",
        "segmentsConfig": {
        "timeColumnName": "timestamp",
        "timeType": "MILLISECONDS",
        "schemaName": "transcript",
        "replicasPerPartition": "1"
        },
        "tenants": {},
        "tableIndexConfig": {
          "loadMode": "MMAP",
          "streamConfigs": {
            "streamType": "kafka",
            "stream.kafka.topic.name": "transcript-topic",
            "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
            "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
            "stream.kafka.broker.list": "localhost:9092",
            "schema.registry.url": "",
            "security.protocol": "SSL",
            "ssl.truststore.location": "",
            "ssl.keystore.location": "",
            "ssl.truststore.password": "",
            "ssl.keystore.password": "",
            "ssl.key.password": "",
            "stream.kafka.decoder.prop.schema.registry.rest.url": "",
            "stream.kafka.decoder.prop.schema.registry.ssl.truststore.location": "",
            "stream.kafka.decoder.prop.schema.registry.ssl.keystore.location": "",
            "stream.kafka.decoder.prop.schema.registry.ssl.truststore.password": "",
            "stream.kafka.decoder.prop.schema.registry.ssl.keystore.password": "",
            "stream.kafka.decoder.prop.schema.registry.ssl.keystore.type": "",
            "stream.kafka.decoder.prop.schema.registry.ssl.truststore.type": "",
            "stream.kafka.decoder.prop.schema.registry.ssl.key.password": "",
            "stream.kafka.decoder.prop.schema.registry.ssl.protocol": ""
          }
        },
        "metadata": {
          "customConfigs": {}
        }
      }
    {
      "tableName": "events",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "created_at",
        "timeType": "MILLISECONDS",
        "schemaName": "events",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP",
        "streamConfigs": {
          "streamType": "kafka",
          "stream.kafka.topic.name": "events",
          "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.confluent.KafkaConfluentSchemaRegistryJsonMessageDecoder",
          "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
          "stream.kafka.broker.list": "localhost:9092",
          "stream.kafka.schema.registry.url": "http://localhost:8081",
          "stream.kafka.decoder.prop.schema.registry.rest.url": "http://localhost:8081",
          "realtime.segment.flush.threshold.rows": "0",
          "realtime.segment.flush.threshold.time": "24h",
          "realtime.segment.flush.threshold.segment.size": "50M",
          "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
        }
      },
      "metadata": {
        "customConfigs": {}
      }
    }
    "stream.kafka.decoder.prop.basic.auth.credentials.source": "USER_INFO",
    "stream.kafka.decoder.prop.schema.registry.basic.auth.user.info": "<username>:<password>"
      {
        "tableName": "transcript",
        "tableType": "REALTIME",
        "segmentsConfig": {
        "timeColumnName": "timestamp",
        "timeType": "MILLISECONDS",
        "schemaName": "transcript",
        "replicasPerPartition": "1"
        },
        "tenants": {},
        "tableIndexConfig": {
          "loadMode": "MMAP",
          "streamConfigs": {
            "streamType": "kafka",
            "stream.kafka.topic.name": "transcript-topic",
            "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
            "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
            "stream.kafka.broker.list": "kafka:9092",
            "stream.kafka.isolation.level": "read_committed"
          }
        },
        "metadata": {
          "customConfigs": {}
        }
      }
    "streamConfigs": {
            "streamType": "kafka",
            "stream.kafka.topic.name": "mytopic",
            "stream.kafka.consumer.prop.auto.offset.reset": "largest",
            "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
            "stream.kafka.broker.list": "kafka:9092",
            "stream.kafka.schema.registry.url": "https://xxx",
            "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
            "stream.kafka.decoder.prop.schema.registry.rest.url": "https://xxx",
            "stream.kafka.decoder.prop.basic.auth.credentials.source": "USER_INFO",
            "stream.kafka.decoder.prop.schema.registry.basic.auth.user.info": "schema_registry_username:schema_registry_password",
            "sasl.mechanism": "PLAIN" ,
            "security.protocol": "SASL_SSL" ,
            "sasl.jaas.config":"org.apache.kafka.common.security.scram.ScramLoginModule required username=\"kafkausername\" password=\"kafkapassword\";",
            "realtime.segment.flush.threshold.rows": "0",
            "realtime.segment.flush.threshold.time": "24h",
            "realtime.segment.flush.autotune.initialRows": "3000000",
            "realtime.segment.flush.threshold.segment.size": "500M"
          },
      "dimensionFieldSpecs": [
        {
          "name": "__key",
          "dataType": "STRING"
        },
        {
          "name": "__metadata$offset",
          "dataType": "STRING"
        },
        {
          "name": "__metadata$partition",
          "dataType": "STRING"
        },
        ...
      ],
    ...
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.SimpleAvroMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
      "stream.kafka.broker.list": "",
      "stream.kafka.consumer.prop.auto.offset.reset": "largest"
      ...
    }
    "stream.kafka.partition.ids": "0,2,5"
    {
      "tableName": "events_part_0",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "ts",
        "schemaName": "events",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP"
      },
      "ingestionConfig": {
        "streamIngestionConfig": {
          "streamConfigMaps": [
            {
              "streamType": "kafka",
              "stream.kafka.topic.name": "events",
              "stream.kafka.partition.ids": "0",
              "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
              "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
              "stream.kafka.broker.list": "kafka:9092",
              "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
              "realtime.segment.flush.threshold.rows": "0",
              "realtime.segment.flush.threshold.time": "24h",
              "realtime.segment.flush.threshold.segment.size": "50M"
            }
          ]
        }
      },
      "metadata": {
        "customConfigs": {}
      }
    }
    {
      "tableName": "events_part_1",
      "tableType": "REALTIME",
      "segmentsConfig": {
        "timeColumnName": "ts",
        "schemaName": "events",
        "replicasPerPartition": "1"
      },
      "tenants": {},
      "tableIndexConfig": {
        "loadMode": "MMAP"
      },
      "ingestionConfig": {
        "streamIngestionConfig": {
          "streamConfigMaps": [
            {
              "streamType": "kafka",
              "stream.kafka.topic.name": "events",
              "stream.kafka.partition.ids": "1",
              "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
              "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
              "stream.kafka.broker.list": "kafka:9092",
              "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
              "realtime.segment.flush.threshold.rows": "0",
              "realtime.segment.flush.threshold.time": "24h",
              "realtime.segment.flush.threshold.segment.size": "50M"
            }
          ]
        }
      },
      "metadata": {
        "customConfigs": {}
      }
    }
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "my-protobuf-topic",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.protobuf.ProtoBufMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
      "stream.kafka.broker.list": "kafka:9092",
      "stream.kafka.decoder.prop.descriptorFile": "/path/to/my_message.desc",
      "stream.kafka.decoder.prop.protoClassName": "mypackage.MyMessage"
    }
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "my-protobuf-topic",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.protobuf.ProtoBufCodeGenMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
      "stream.kafka.broker.list": "kafka:9092",
      "stream.kafka.decoder.prop.jarFile": "/path/to/my-protobuf-classes.jar",
      "stream.kafka.decoder.prop.protoClassName": "com.example.proto.MyMessage"
    }
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "my-protobuf-topic",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.protobuf.KafkaConfluentSchemaRegistryProtoBufMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
      "stream.kafka.broker.list": "kafka:9092",
      "stream.kafka.decoder.prop.schema.registry.rest.url": "http://schema-registry:8081"
    }
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "my-arrow-topic",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.arrow.ArrowMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
      "stream.kafka.broker.list": "kafka:9092",
      "stream.kafka.decoder.prop.arrow.allocator.limit": "536870912"
    }
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "myTopic",
      "stream.kafka.broker.list": "localhost:9092",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
      "stream.kafka.partition.ids": "0,2,5"
    }

    GapFill Function for Time-Series Dataset

    circle-info

    GapFill function is experimental, and has limited support, validation and error reporting.

    circle-info

    GapFill Function is only supported with the single-stage query engine (v1).

    Many of the datasets are time series in nature, tracking state change of an entity over time. The granularity of recorded data points might be sparse or the events could be missing due to network and other device issues in the IOT environment. But analytics applications which are tracking the state change of these entities over time, might be querying for values at lower granularity than the metric interval.

    Here is the sample data set tracking the status of parking lots in parking space.

    lotId
    event_time
    is_occupied

    We want to find out the total number of parking lots that are occupied over a period of time which would be a common use case for a company that manages parking spaces.

    Let us take 30 minutes' time bucket as an example:

    timeBucket/lotId
    P1
    P2
    P3

    If you look at the above table, you will see a lot of missing data for parking lots inside the time buckets. In order to calculate the number of occupied park lots per time bucket, we need gap fill the missing data.

    hashtag
    The Ways of Gap Filling the Data

    There are two ways of gap filling the data: FILL_PREVIOUS_VALUE and FILL_DEFAULT_VALUE.

    FILL_PREVIOUS_VALUE means the missing data will be filled with the previous value for the specific entity, in this case, park lot, if the previous value exists. Otherwise, it will be filled with the default value.

    FILL_DEFAULT_VALUE means that the missing data will be filled with the default value. For numeric column, the default value is 0. For Boolean column type, the default value is false. For TimeStamp, it is January 1, 1970, 00:00:00 GMT. For STRING, JSON and BYTES, it is empty String. For Array type of column, it is empty array.

    We will leverage the following the query to calculate the total occupied parking lots per time bucket.

    hashtag
    Aggregation/Gapfill/Aggregation

    hashtag
    Query Syntax

    In the example above, TIMESERIESON(column_name) element is obligatory, and column_name must point to actual table column. It can't be a literal or expression.

    Moreover, if the innermost query contains GROUP BY clause then (contrary to regular queries) it must contain an aggregate function, otherwise Select and Gapfill should be in the same sql statement error is returned.

    hashtag
    Workflow

    The most nested sql will convert the raw event table to the following table.

    lotId
    event_time
    is_occupied

    The second most nested sql will gap fill the returned data as following:

    timeBucket/lotId
    P1
    P2
    P3

    The outermost query will aggregate the gapfilled data as follows:

    timeBucket
    totalNumOfOccuppiedSlots

    There is one assumption we made here that the raw data is sorted by the timestamp. The Gapfill and Post-Gapfill Aggregation will not sort the data.

    The above example just shows the use case where the three steps happen:

    1. The raw data will be aggregated;

    2. The aggregated data will be gapfilled;

    3. The gapfilled data will be aggregated.

    There are three more scenarios we can support.

    hashtag
    Select/Gapfill

    If we want to gapfill the missing data per half an hour time bucket, here is the query:

    hashtag
    Query Syntax

    hashtag
    Workflow

    At first the raw data will be transformed as follows:

    lotId
    event_time
    is_occupied

    Then it will be gapfilled as follows:

    lotId
    event_time
    is_occupied

    hashtag
    Aggregate/Gapfill

    hashtag
    Query Syntax

    hashtag
    Workflow

    The nested sql will convert the raw event table to the following table.

    lotId
    event_time
    is_occupied

    The outer sql will gap fill the returned data as following:

    timeBucket/lotId
    P1
    P2
    P3

    hashtag
    Gapfill/Aggregate

    hashtag
    Query Syntax

    hashtag
    Workflow

    The raw data will be transformed as following at first:

    lotId
    event_time
    is_occupied

    The transformed data will be gap filled as follows:

    lotId
    event_time
    is_occupied

    The aggregation will generate the following table:

    timeBucket
    totalNumOfOccuppiedSlots

    P1

    2021-10-01 09:33:00.000

    0

    P1

    2021-10-01 09:47:00.000

    1

    P3

    2021-10-01 10:05:00.000

    1

    P2

    2021-10-01 10:06:00.000

    0

    P2

    2021-10-01 10:16:00.000

    1

    P2

    2021-10-01 10:31:00.000

    0

    P3

    2021-10-01 11:17:00.000

    0

    P1

    2021-10-01 11:54:00.000

    0

    2021-10-01 10:00:00.000

    0,1

    1

    2021-10-01 10:30:00.000

    0

    2021-10-01 11:00:00.000

    0

    2021-10-01 11:30:00.000

    0

    P1

    2021-10-01 09:30:00.000

    1

    P3

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:30:00.000

    0

    P3

    2021-10-01 11:00:00.000

    0

    P1

    2021-10-01 11:30:00.000

    0

    1

    0

    2021-10-01 10:00:00.000

    1

    1

    1

    2021-10-01 10:30:00.000

    1

    0

    1

    2021-10-01 11:00:00.000

    1

    0

    0

    2021-10-01 11:30:00.000

    0

    0

    0

    2021-10-01 10:30:00.000

    2

    2021-10-01 11:00:00.000

    1

    2021-10-01 11:30:00.000

    0

    P1

    2021-10-01 09:30:00.000

    0

    P1

    2021-10-01 09:30:00.000

    1

    P3

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:00:00.000

    0

    P2

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:30:00.000

    0

    P3

    2021-10-01 11:00:00.000

    0

    P1

    2021-10-01 11:30:00.000

    0

    P3

    2021-10-01 09:00:00.000

    0

    P1

    2021-10-01 09:30:00.000

    0

    P1

    2021-10-01 09:30:00.000

    1

    P2

    2021-10-01 09:30:00.000

    1

    P3

    2021-10-01 09:30:00.000

    0

    P1

    2021-10-01 10:00:00.000

    1

    P3

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:00:00.000

    0

    P2

    2021-10-01 10:00:00.000

    1

    P1

    2021-10-01 10:30:00.000

    1

    P2

    2021-10-01 10:30:00.000

    0

    P3

    2021-10-01 10:30:00.000

    1

    P1

    2021-10-01 11:00:00.000

    1

    P2

    2021-10-01 11:00:00.000

    0

    P3

    2021-10-01 11:00:00.000

    0

    P1

    2021-10-01 11:30:00.000

    0

    P2

    2021-10-01 11:30:00.000

    0

    P3

    2021-10-01 11:30:00.000

    0

    P1

    2021-10-01 09:30:00.000

    1

    P3

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:30:00.000

    0

    P3

    2021-10-01 11:00:00.000

    0

    P1

    2021-10-01 11:30:00.000

    0

    1

    0

    2021-10-01 10:00:00.000

    1

    1

    1

    2021-10-01 10:30:00.000

    1

    0

    1

    2021-10-01 11:00:00.000

    1

    0

    0

    2021-10-01 11:30:00.000

    0

    0

    0

    P1

    2021-10-01 09:30:00.000

    0

    P1

    2021-10-01 09:30:00.000

    1

    P3

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:00:00.000

    0

    P2

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:30:00.000

    0

    P3

    2021-10-01 11:00:00.000

    0

    P1

    2021-10-01 11:30:00.000

    0

    P3

    2021-10-01 09:00:00.000

    0

    P1

    2021-10-01 09:30:00.000

    0

    P1

    2021-10-01 09:30:00.000

    1

    P2

    2021-10-01 09:30:00.000

    1

    P3

    2021-10-01 09:30:00.000

    0

    P1

    2021-10-01 10:00:00.000

    1

    P3

    2021-10-01 10:00:00.000

    1

    P2

    2021-10-01 10:00:00.000

    0

    P2

    2021-10-01 10:00:00.000

    1

    P1

    2021-10-01 10:30:00.000

    1

    P2

    2021-10-01 10:30:00.000

    0

    P3

    2021-10-01 10:30:00.000

    1

    P2

    2021-10-01 10:30:00.000

    0

    P1

    2021-10-01 11:00:00.000

    1

    P2

    2021-10-01 11:00:00.000

    0

    P3

    2021-10-01 11:00:00.000

    0

    P1

    2021-10-01 11:30:00.000

    0

    P2

    2021-10-01 11:30:00.000

    0

    P3

    2021-10-01 11:30:00.000

    0

    2021-10-01 10:30:00.000

    2

    2021-10-01 11:00:00.000

    1

    2021-10-01 11:30:00.000

    0

    P1

    2021-10-01 09:01:00.000

    1

    P2

    2021-10-01 09:17:00.000

    1

    2021-10-01 09:00:00.000

    1

    1

    2021-10-01 09:30:00.000

    P1

    2021-10-01 09:00:00.000

    1

    P2

    2021-10-01 09:00:00.000

    1

    2021-10-01 09:00:00.000

    1

    1

    0

    2021-10-01 09:30:00.000

    2021-10-01 09:00:00.000

    2

    2021-10-01 09:30:00.000

    2

    2021-10-01 10:00:00.000

    3

    P1

    2021-10-01 09:00:00.000

    1

    P2

    2021-10-01 09:00:00.000

    1

    P1

    2021-10-01 09:00:00.000

    1

    P2

    2021-10-01 09:00:00.000

    1

    P1

    2021-10-01 09:00:00.000

    1

    P2

    2021-10-01 09:00:00.000

    1

    2021-10-01 09:00:00.000

    1

    1

    0

    2021-10-01 09:30:00.000

    P1

    2021-10-01 09:00:00.000

    1

    P2

    2021-10-01 09:00:00.000

    1

    P1

    2021-10-01 09:00:00.000

    1

    P2

    2021-10-01 09:00:00.000

    1

    2021-10-01 09:00:00.000

    2

    2021-10-01 09:30:00.000

    2

    2021-10-01 10:00:00.000

    3

    0,1

    1

    1

    SELECT time_col, SUM(status) AS occupied_slots_count
    FROM (
        SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
                       '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
                        TIMESERIESON(lotId)), lotId, status
        FROM (
            SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
                   '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
                   lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
            FROM parking_data
            WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
            GROUP BY 1, 2
            ORDER BY 1
            LIMIT 100)
        LIMIT 100)
    GROUP BY 1
    LIMIT 100
    SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
                   '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
                   '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
                   '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
                   TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
    FROM parking_data
    WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
    ORDER BY 1
    LIMIT 100
    SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
                   '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
                   TIMESERIESON(lotId)), lotId, status
    FROM (
        SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
               lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
        FROM parking_data
        WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
        GROUP BY 1, 2
        ORDER BY 1
        LIMIT 100)
    LIMIT 100
    SELECT time_col, SUM(is_occupied) AS occupied_slots_count
    FROM (
        SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
               '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
               TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
        FROM parking_data
        WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
        ORDER BY 1
        LIMIT 100)
    GROUP BY 1
    LIMIT 100

    Stream Ingestion with Upsert

    Upsert support in Apache Pinot.

    Pinot provides native upsert support during ingestion. There are scenarios where records need modifications, such as correcting a ride fare or updating a delivery status.

    Partial upserts are convenient as you only need to specify the columns where values change, and you ignore the rest.

    hashtag
    Table type support

    Upsert is supported across REALTIME, OFFLINE, and HYBRID table types. The available modes depend on the table type:

    Table type
    FULL upsert
    PARTIAL upsert
    Notes

    For OFFLINE table upsert configuration details, see .

    hashtag
    Overview of upserts in Pinot

    See an overview of how upserts work in Pinot.

    hashtag
    Enable upserts in Pinot

    To enable upserts on a Pinot table, do the following:

    hashtag
    Define the primary key in the schema

    To update a record, you need a primary key to uniquely identify the record. To define a primary key, add the field primaryKeyColumns to the schema definition. For example, the schema definition of UpsertMeetupRSVP in the quick start example has this definition.

    Note this field expects a list of columns, as the primary key can be a composite.

    When two records of the same primary key are ingested, the record with the greater comparison value (timeColumn by default) is used. When records have the same primary key and event time, then the order is not determined. In most cases, the later ingested record will be used, but this may not be true in cases where the table has a column to sort by.

    circle-exclamation

    Partition the input stream by the primary key

    An important requirement for the Pinot upsert table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the API. If the original stream is not partitioned, then a streaming processing job (such as with Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.

    Additionally if using

    hashtag
    Enable upsert in the table configurations

    To enable upsert, make the following configurations in the table configurations.

    hashtag
    Upsert modes

    Full upsert

    The upsert mode defaults to FULL . FULL upsert means that a new record will replace the older record completely if they have same primary key. Example config:

    Partial upserts

    Partial upsert lets you choose to update only specific columns and ignore the rest.

    To enable the partial upsert, set the mode to PARTIAL and specify partialUpsertStrategies for partial upsert columns. Since release-0.10.0, OVERWRITE is used as the default strategy for columns without a specified strategy. defaultPartialUpsertStrategy is also introduced to change the default strategy for all columns.

    circle-info

    Note that null handling must be enabled for partial upsert to work.

    For example:

    Pinot supports the following partial upsert strategies:

    Strategy
    Description

    With partial upsert, if the value is null in either the existing record or the new coming record, Pinot will ignore the upsert strategy and the null value:

    (null, newValue) -> newValue

    (oldValue, null) -> oldValue

    (null, null) -> null

    hashtag
    Post-Partial-Upsert Transforms (Derived Columns)

    When using partial upserts, you may have derived columns that need to be recomputed after the row is merged from the incoming record and the existing record. The postPartialUpsertTransformConfigs feature allows you to apply transformation functions to compute derived columns from the fully merged row.

    Use Case

    Consider an e-commerce table tracking orders:

    • order_id: Primary key

    • score: Points earned from the order

    • bonus: Bonus points awarded

    With partial upserts, incoming records may only contain updated values for score or bonus. The ingestion-time transforms only see the incoming record, so they cannot correctly compute total from a partially merged row. The postPartialUpsertTransformConfigs allows you to recompute total from the complete merged row after the partial upsert merge happens.

    Configuration

    To enable post-partial-upsert transforms, add the postPartialUpsertTransformConfigs configuration to your table's upsertConfig:

    Evaluation Semantics

    • Post-partial-upsert transforms are evaluated after the partial upsert merge completes

    • They operate on the complete merged row, not just the incoming record

    • Both incoming and existing column values are available for the transform expression

    Interaction with Ingestion Transforms

    Ingestion-time transforms and post-partial-upsert transforms serve different purposes:

    Aspect
    Ingestion Transforms
    Post-Partial-Upsert Transforms

    Both can be used together:

    1. Ingestion transforms normalize the incoming record

    2. The normalized incoming record participates in partial upsert merge

    3. Post-partial-upsert transforms recompute derived columns from the complete merged row

    Example Workflow

    Given a partial upsert table with this configuration:

    Processing these records:

    1. Initial record (order_id=123):

      • Incoming: {order_id: 123, score: 100, bonus: 10}

      • Merge: (first record, no existing row)

    circle-info

    The derived columns computed by post-partial-upsert transforms can be queried like any other column. If you need to use these derived columns in further upsert strategies or transforms, ensure they are defined in your schema.

    None upserts

    If set mode to NONE, the upsert is disabled.

    hashtag
    Comparison column

    By default, Pinot uses the value in the time column (timeColumn in tableConfig) to determine the latest record. That means, for two records with the same primary key, the record with the larger value of the time column is picked as the latest update. However, there are cases when users need to use another column to determine the order. In such case, you can use option comparisonColumn to override the column used for comparison. For example,

    For partial upsert table, the out-of-order events won't be consumed and indexed. For example, for two records with the same primary key, if the record with the smaller value of the comparison column came later than the other record, it will be skipped.

    circle-info

    NOTE: Please use comparisonColumns for single comparison column instead of comparisonColumn as it is currently deprecated. You may see unrecognizedProperties when using the old config, but it's converted to comparisonColumns automatically when adding the table.

    hashtag
    Multiple comparison columns

    In some cases, especially where partial upsert might be employed, there may be multiple producers of data each writing to a mutually exclusive set of columns, sharing only the primary key. In such a case, it may be helpful to use one comparison column per producer group so that each group can manage its own specific versioning semantics without the need to coordinate versioning across other producer groups.

    Documents written to Pinot are expected to have exactly 1 non-null value out of the set of comparisonColumns; if more than 1 of the columns contains a value, the document will be rejected. When new documents are written, whichever comparison column is non-null will be compared against only that same comparison column seen in prior documents with the same primary key. Consider the following examples, where the documents are assumed to arrive in the order specified in the array.

    The following would occur:

    1. orderReceived: 1

    • Result: persisted

    • Reason: first doc seen for primary key "aa"

    1. orderReceived: 2

    • Result: persisted (replacing orderReceived: 1)

    • Reason: comparison column (secondsSinceEpoch) larger than that previously seen

    1. orderReceived: 3

    • Result: rejected

    • Reason: comparison column (secondsSinceEpoch) smaller than that previously seen

    1. orderReceived: 4

    • Result: persisted (replacing orderReceived: 2)

    • Reason: comparison column (otherComparisonColumn) larger than previously seen (never seen previously), despite the value being smaller than that seen for secondsSinceEpoch

    1. orderReceived: 5

    • Result: rejected

    • Reason: comparison column (otherComparisonColumn) smaller than that previously seen

    1. orderReceived: 6

    • Result: persist (replacing orderReceived: 4)

    • Reason: comparison column (otherComparisonColumn) larger than that previously seen

    hashtag
    Metadata time-to-live (TTL)

    In Pinot, the metadata map is stored in heap memory. To decrease in-memory data and improve performance, minimize the time primary key entries are stored in the metadata map (metadata time-to-live (TTL)). Limiting the TTL is especially useful for primary keys with high cardinality and frequent updates.

    Since the metadata TTL is applied on the first comparison column, the time unit of upsert TTL is the same as the first comparison column.

    hashtag
    Configure how long primary keys are stored in metadata

    To configure how long primary keys are stored in metadata, specify the length of time in metadataTTL. For example:

    In this example, Pinot will retain primary keys in metadata for 1 day.

    Note that enabling upsert snapshot is required for metadata TTL for in-memory validDocsIDs recovery.

    hashtag
    Delete column

    Upsert Pinot table can support soft-deletes of primary keys. This requires the incoming record to contain a dedicated boolean single-field column that serves as a delete marker for a primary key. Once the real-time engine encounters a record with delete column set to true , the primary key will no longer be part of the queryable set of documents. This means the primary key will not be visible in the queries, unless explicitly requested via query option skipUpsert=true.

    Note that the delete column has to be a single-value boolean column.

    circle-info

    Note that when deleteRecordColumn is added to an existing table, it will require a server restart to actually pick up the upsert config changes.

    A deleted primary key can be revived by ingesting a record with the same primary, but with higher comparison column value(s).

    Note that when reviving a primary key in a partial upsert table, the revived record will be treated as the source of truth for all columns. This means any previous updates to the columns will be ignored and overwritten with the new record's values.

    hashtag
    Deleted Keys time-to-live (TTL)

    The above config deleteRecordColumn only soft-deletes the primary key. To decrease in-memory data and improve performance, minimize the time deleted-primary-key entries are stored in the metadata map (deletedKeys time-to-live (TTL)). Limiting the TTL is especially useful for deleted-primary-keys where there are no future updates foreseen.

    hashtag
    Configure how long deleted-primary-keys are stored in metadata

    To configure how long primary keys are stored in metadata, specify the length of time in deletedKeysTTL For example:

    In this example, Pinot will retain the deleted-primary-keys in metadata for 1 day.

    circle-info

    Note that the value of this field deletedKeysTTL should be the same as the unit of comparison column. If your comparison column is having values which corresponds to seconds, this config should also have values in seconds (see above example). metadataTTL and deletedKeysTTL do not work with multiple comparison columns and comparison/time column must be of NUMERIC type.

    hashtag
    Data consistency with deletes and compaction together

    When using deletedKeysTTL together with UpsertCompactionTask, there can be a scenario where a segment containing deleted-record (where deleteRecordColumn = true was set for the primary key) gets compacted first and a previous old record is not yet compacted. During server restart, now the old record is added to the metadata manager map and is treated as non-deleted. To prevent data inconsistencies in this scenario, we have added a new config enableDeletedKeysCompactionConsistency which when set to true, will ensure that the deleted records are not compacted until all the previous records from all other segments are compacted for the deleted primary-key.

    hashtag
    Data consistency when queries and upserts happen concurrently

    Upserts in Pinot enable real-time updates and ensure that queries always retrieve the latest version of a record, making them a powerful feature for managing mutable data efficiently. However, in applications with extremely high QPS and high ingestion rates, queries and upserts happening concurrently can sometimes lead to inconsistencies in query results.

    For example, consider a table with 1 million primary keys. A distinct count query should always return 1 million, regardless of how new records are ingested and older records are invalidated. However, at high ingestion and query rates, the query may occasionally return a count slightly above or below 1 million. This happens because queries determine valid records by acquiring validDocIds bitmaps from multiple segments, which indicate which documents are currently valid. Since acquiring these bitmaps is not atomic with respect to ongoing upserts, a query may capture an inconsistent view of the data, leading to overcounting or undercounting of valid records.

    This is a classic concurrency issue where reads and writes happen simultaneously, leading to temporary inconsistencies. Typically, such issues are resolved using locks or snapshots to maintain a stable view of the data during query execution. To address this, two new consistency modes - SYNC and SNAPSHOT - have been introduced for upsert enabled tables to ensure consistent query results even when queries and upserts occur concurrently and at very high throughput.

    By default, the consistency mode is NONE, meaning the system operates as before. The SYNC mode ensures consistency by blocking upserts while queries execute, guaranteeing that queries always see a stable upserted data view. However, this can introduce write latency. Alternatively, the SNAPSHOT mode creates a consistent snapshot of validDocIds bitmaps for queries to use. This allows upserts to continue without blocking queries, making it more suitable for workloads with both high query and write rates. These new consistency modes provide flexibility, allowing applications to balance consistency guarantees against performance trade-offs based on their specific requirements.

    For SNAPSHOT mode, one can configure how often the upsert view should be refreshed via a upsertConfig called upsertViewRefreshIntervalMs, which is 3000ms by default. Both the write and query threads can refresh the upsert view when it gets stale according to this config. Changing this config requires server restarts.

    One can further adjust the view's freshness during query time without restarting servers via a query option called upsertViewFreshnessMs . By default, this query option matches with that upsertConfig upsertViewRefreshIntervalMs , but if a query sets it to a smaller value, the upsert view may get refreshed sooner for the query; and if set to 0, the query simply forces to refresh upsert view every time.

    For debugging purposes, there's a query option called skipUpsertView. If set to true, it bypasses the consistent upsert view maintained by SYNC or SNAPSHOT modes. This effectively executes the query as if it were in NONE mode.

    hashtag
    Use strictReplicaGroup for routing

    The upsert Pinot table can use only the low-level consumer for the input streams. As a result, it uses the implicitly for the segments. Moreover, upsert poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires to use strictReplicaGroup as the routing strategy. To use that, configure instanceSelectorType in Routing as the following:

    circle-exclamation

    Using implicit partitioned replica-group assignment from low-level consumer won't persist the instance assignment (mapping from partition to servers) to the ZooKeeper, and new added servers will be automatically included without explicit reassigning instances (usually through rebalance). This can cause new segments of the same partition assigned to a different server and break the requirement of upsert.

    To prevent this, we recommend using explicit to ensure the instance assignment is persisted. Note that numInstancesPerPartition should always be 1 in replicaGroupPartitionConfig

    hashtag
    Enable validDocIds snapshots for upsert metadata recovery

    Upsert snapshot support is also added in release-0.12.0. To enable the snapshot, set snapshot to ENABLE. For example:

    Upsert maintains metadata in memory containing which docIds are valid in a particular segment (ValidDocIndexes). This metadata gets lost during server restarts and needs to be recreated again. ValidDocIndexes can not be recovered easily after out-of-TTL primary keys get removed. Enabling snapshots addresses this problem by adding functions to store and recover validDocIds snapshot for Immutable Segments

    The snapshots are taken on every segment commit to ensure that they are consistent with the persisted data in case of abrupt shutdown. We recommend that you enable this feature so as to speed up server boot times during restarts.

    circle-info

    The lifecycle for validDocIds snapshots are shows as follows,

    1. If snapshot is enabled, snapshots for existing segments are taken or refreshed when the next consuming segment gets started.

    2. The snapshot files are kept on disk until the segments get removed, e.g. due to data retention or manual deletion.

    hashtag
    Enable preload for faster server restarts

    Upsert preload feature can make it faster to restore the upsert states when server restarts. To enable the preload feature, set preload to ENABLE. Snapshot must also be enabled. For example:

    Under the hood, it uses the validDocIds snapshots to identify the valid docs and restore their upsert metadata quickly instead of performing a whole upsert comparison flow. The flow is triggered before the server is marked as ready, after which the server starts to load the remaining segments without snapshots (hence the name preload).

    The feature also requires you to specify pinot.server.instance.max.segment.preload.threads: N in the server config where N should be replaced with the number of threads that should be used for preload. It's 0 by default to disable the preloading feature.

    circle-exclamation

    A bug was introduced in v1.2.0 that when enablePreload and enableSnapshot flags are set to true but max.segment.preload.threads is left as 0, the preloading mechanism is still enabled but segments fail to get loaded as there is no threads for preloading. This was fixed in newer versions, but for v1.2.0, if enablePreload and enableSnapshot are set to true, remember to set max.segment.preload.threads to a positive value as well. Server restart is needed to get max.segment.preload.threads config change into effect.

    hashtag
    Enable commit time compaction for storage optimization

    circle-exclamation

    If you are enabling commit time compaction for an existing table, it is recommended to first pause the ingestion for that table, enable this feature by updating the table-config, and then resume ingestion.

    Many Upsert use-cases have a lot of Update events within the segment commit window. For instance, if we had an Upsert table for order status of Uber Eats orders, we would expect a lot of update events for the same order within a 1 hour window. For such use-cases, the committed segments end up with a lot of dead tuples, and you have to wait for the Segment Compaction tasks to prune them, which can take hours.

    Commit time compaction is a performance optimization feature for upsert tables that removes invalid and obsolete records during the segment commit process itself. This not only reduces the storage bloat of the table immediately, but it can also bring down the segment commit time.

    To enable commit time compaction, set the enableCommitTimeCompaction to true in the upsert configuration. For example:

    How it works

    During segment commit, commit time compaction:

    • Filters out invalid document IDs. Retains valid records and soft-deleted records.

    • Generates accurate column statistics for compacted segments

    • Maintains correct document order while removing obsolete data

    Configuration requirements

    • The feature is enabled per table by setting enableCommitTimeCompaction=true in the upsert configuration

    • Changes take effect after one segment commit cycle (the current consuming segment will be committed without compaction)

    • Compatible with all types of upsert tables

    hashtag
    Handle out-of-order events

    There are 2 configs added related to handling out-of-order events.

    hashtag
    dropOutOfOrderRecord

    To enable dropping of out-of-order record, set the dropOutOfOrderRecord to true. For example:

    This feature doesn't persist any out-of-order event to the consuming segment. If not specified, the default value is false.

    • When false, the out-of-order record gets persisted to the consuming segment, but the MetadataManager mapping is not updated thus this record is not referenced in query or in any future updates. You can still see the records when using skipUpsert query option.

    • When true, the out-of-order record doesn't get persisted at all and the MetadataManager mapping is not updated so this record is not referenced in query or in any future updates. You cannot see the records when using skipUpsert query option.

    hashtag
    outOfOrderRecordColumn

    This is to identify out-of-order events programmatically. To enable this config, add a boolean field in your table schema, say isOutOfOrder and enable via this config. For example:

    This feature persists a true / false value to the isOutOfOrder field based on the orderness of the event. You can filter out out-of-order events while using skipUpsert to avoid any confusion. For example:

    circle-info

    Note that dropOutOfOrderRecord and outOfOrderRecordColumn are only supported when no consistencyMode is set (i.e., consistencyMode = NONE). This is because, when a consistencyMode is enabled, rows are added before the valid documents are updated. As a result, out-of-order records cannot be dropped or marked in upsert tables, defeating the purpose of these options.

    hashtag
    Use custom metadata manager

    Pinot supports custom PartitionUpsertMetadataManager that handle records and segments updates.

    hashtag
    Adding custom upsert managers

    You can add custom PartitionUpsertMetadataManager as follows:

    • Create a new java project. Make sure you keep the package name as org.apache.pinot.segment.local.upsert.xxx

    • In your java project include the dependency

    • Add your custom partition manager that implements PartitionUpsertMetadataManager interface

    • Add your custom TableUpsertMetadataManager that implements BaseTableUpsertMetadataManager interface

    • Place the compiled JAR in the /plugins directory in pinot. You will need to restart all Pinot instances if they are already running.

    • Now, you can use the custom upsert manager in table configs as follows:

    ⚠️ The upsert manager class name is case-insensitive as well.

    hashtag
    Immutable upsert configuration fields

    triangle-exclamation

    Certain upsert and schema configuration fields cannot be modified after table creation.

    Changing these fields on an existing upsert table can lead to data inconsistencies or data loss, particularly when servers restart and commit segments. Pinot validates and invalidates documents based on these configurations, so altering them after data has been ingested will cause the existing validDocId snapshots to become inconsistent with the new configuration.

    The following fields are immutable after table creation:

    hashtag
    Upsert table limitations

    There are some limitations for the upsert Pinot tables.

    • Partial upsert is supported for REALTIME tables only. OFFLINE tables support FULL upsert only. See for details.

    • The star-tree index cannot be used for indexing, as the star-tree index performs pre-aggregation during the ingestion.

    • Unlike append-only tables, out-of-order events (with comparison value in incoming record less than the latest available value) won't be consumed and indexed by Pinot partial upsert table, these late events will be skipped.

    hashtag
    Best practices

    Unlike other real-time tables, Upsert table takes up more memory resources as it needs to bookkeep the record locations in memory. As a result, it's important to plan the capacity beforehand, and monitor the resource usage. Here are some recommended practices of using Upsert table.

    hashtag
    Create the topic/stream with more partitions.

    The number of partitions in input streams determines the partition numbers of the Pinot table. The more partitions you have in input topic/stream, more Pinot servers you can distribute the Pinot table to and therefore more you can scale the table horizontally. Do note that you can't increase the partitions in future for upsert enabled tables so you need to start with good enough partitions (atleast 2-3X the number of pinot servers)

    hashtag
    Memory usage

    Upsert table maintains an in-memory map from the primary key to the record location. So it's recommended to use a simple primary key type and avoid composite primary keys to save the memory cost. Beware when using JSON column as primary key, same key-values in different order would be considered as different primary keys. In addition, consider the hashFunction config in the Upsert config, which can be UUID, MD5 or MURMUR3.

    If your primary key column is a valid UUID and you are running out of memory due to a high number of primary keys, the UUID hash function can lower memory requirements by up to 35% without bringing in any hash collision risks. If the primary key is not a valid UUID, this hash function stores the primary key as is and skips the UUID based compression.

    MD5 and MURMUR3 can also help lower memory requirements. They work for all types of primary key values, but bring in a small risk of hash collision. The generated hash from MD5 and MURMUR3 is a 128-bit hash, so this is beneficial when your primary key values are larger than 128-bits.

    hashtag
    Monitoring

    Set up a dashboard over the metric pinot.server.upsertPrimaryKeysCount.tableName to watch the number of primary keys in a table partition. It's useful for tracking its growth which is proportional to the memory usage growth. **** The total memory usage by upsert is roughly (primaryKeysCount * (sizeOfKeyInBytes + 24))

    hashtag
    Capacity planning

    It's useful to plan the capacity beforehand to ensure you will not run into resource constraints later. A simple way is to measure the rate of the primary keys in the input stream per partition and extrapolate the data to a specific time period (based on table retention) to approximate the memory usage. A heap dump is also useful to check the memory usage so far on an upsert table instance.

    hashtag
    Example

    Putting these together, you can find the table configurations of the quick start examples as the following:

    circle-info

    Pinot server maintains a primary key to record location map across all the segments served in an upsert-enabled table. As a result, when updating the config for an existing upsert table (e.g. change the columns in the primary key, change the comparison column), servers need to be restarted in order to apply the changes and rebuild the map.

    hashtag
    Advanced Server Configuration

    hashtag
    Consuming Segment Consistency Mode

    For partial upsert tables or tables with dropOutOfOrder=true, configure how the server handles segment reloads and force commits via pinot.server.consuming.segment.consistency.mode in pinot-server.conf:

    Mode
    Description

    Note: This is a server-level property distinct from the table-level upsertConfig.consistencyMode setting.

    hashtag
    Migrating from deprecated config fields

    As of Pinot 1.4.0, the following upsert config fields have been renamed:

    Deprecated field
    New field
    Values

    The new fields use the Enablement enum (ENABLE, DISABLE, DEFAULT) instead of boolean values. DEFAULT defers to the server-level configuration, which allows table-level overrides when the feature is enabled at the instance level.

    The deprecated boolean fields still work but will be removed in a future release. Update your table configs to use the new field names.

    hashtag
    Quick Start

    To illustrate how the full upsert works, the Pinot binary comes with a quick start example. Use the following command to creates a real-time upsert table meetupRSVP.

    You can also run partial upsert demo with the following command

    As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to check out the real-time data.

    For partial upsert you can see only the value from configured column changed based on specified partial upsert strategy.

    An example for partial upsert is shown below, each of the event_id kept being unique during ingestion, meanwhile the value of rsvp_count incremented.

    To see the difference from the non-upsert table, you can use a query option skipUpsert to skip the upsert effect in the query result.

    hashtag
    FAQ

    Can I change configs like primary key columns and comparison columns in existing upsert table?

    Not recommended. Existing segments contain validDocId snapshots computed using the old configuration. Changing the configuration can lead to data inconsistencies as existing snapshots wouldn't be cleaned up, especially if a server restarts with validDocId snapshots while replica server do not.

    Avoid changing: primary key columns, comparison columns, partial upsert strategies, upsert mode, and hashFunction.

    If changes are unavoidable:

    Best option: Create a new table and reingest all data.

    Alternative: Disable SNAPSHOT, pause consumption and restart all the servers. This will work for new incoming keys only; consistency across existing data is not guaranteed.

    HYBRID

    Yes

    No

    Avoid overlapping time ranges between offline and realtime

    segmentPartitionConfig
    to leverage Broker segment pruning then it's important to ensure that the partition function used matches both on the Kafka producer side as well as Pinot. In Kafka default for Java client is 32-bit
    murmur2
    hash and for all other languages such as Python its
    CRC32
    (Cyclic Redundancy Check 32-bit).

    IGNORE

    Ignore the new value, keep the existing value (v0.10.0+)

    MAX

    Keep the maximum value betwen the existing value and new value (v0.12.0+)

    MIN

    Keep the minimum value betwen the existing value and new value (v0.12.0+)

    total: Derived column that should equal score + bonus

    The transforms use the same function syntax as ingestion-time transforms
  • Transform results are stored in the derived columns as part of the final record

  • Normalize/clean raw input data

    Recompute derived columns from merged state

    Applies to

    All table types (upsert and non-upsert)

    Partial upsert tables only

    Example

    Convert timestamp format

    total = plus(score, bonus) where score and bonus come from merged row

    Post-transform: total = plus(100, 10) = 110

  • Final: {order_id: 123, score: 100, bonus: 10, total: 110}

  • Update record (order_id=123):

    • Incoming: {order_id: 123, score: 150} (only score updated)

    • Merge: {order_id: 123, score: 150, bonus: 10} (bonus preserved from existing row)

    • Post-transform: total = plus(150, 10) = 160

    • Final: {order_id: 123, score: 150, bonus: 10, total: 160}

  • Another update (order_id=123):

    • Incoming: {order_id: 123, bonus: 25} (only bonus updated)

    • Merge: {order_id: 123, score: 150, bonus: 25} (score preserved from existing row)

    • Post-transform: total = plus(150, 25) = 175

    • Final: {order_id: 123, score: 150, bonus: 25, total: 175}

  • .

    If snapshot is disabled, the existing snapshot for a segment is cleaned up when the segment gets loaded by the server, e.g. when the server restarts.

    Reduces segment size immediately without requiring minion tasks
    Schema fields:
    • primaryKeyColumns

    upsertConfig fields:

    • mode (FULL, PARTIAL, NONE)

    • hashFunction

    • comparisonColumns

    • timeColumnName (when used as the default comparison column)

    • partialUpsertStrategies (for PARTIAL mode)

    • defaultPartialUpsertStrategy (for PARTIAL mode)

    • dropOutOfOrderRecord

    • outOfOrderRecordColumn

    Attempting to update these fields will return an error:

    Recommended workaround: Create a new table with the desired configuration and reingest all data.

    Alternative (use with caution): If you must modify these fields without recreating the table, you can use the force=true query parameter on the table config update API. Before doing so, disable SNAPSHOT mode in upsertConfig, pause consumption, and restart all servers. Note that this approach only guarantees consistency for newly ingested keys; existing data may remain inconsistent.

    We cannot change the number of partitions in the source topic after the upsert/dedup table is created (start with a relatively high number of partitions as mentioned in best practices).

    REALTIME

    Yes

    Yes

    Stream-based ingestion with full upsert feature set

    OFFLINE

    Yes

    No

    OVERWRITE

    Overwrite the column of the last record

    INCREMENT

    Add the new value to the existing values

    APPEND

    Add the new item to the Pinot unordered set

    UNION

    Execution timing

    Before ingestion into Pinot

    After partial upsert merge, during ingestion

    Input record

    Incoming source record

    Merged row (incoming + existing)

    RESTRICTED

    (Default for partial upsert tables with RF > 1) Disables segment reloads and force commits to prevent data inconsistency.

    PROTECTED

    Enables reloads/force commits with upsert metadata reversion during segment replacements. Requires ParallelSegmentConsumptionPolicy set to DISALLOW_ALWAYS or ALLOW_DURING_BUILD_ONLY.

    UNSAFE

    Allows reloads without metadata reversion. Use only if inconsistency is acceptable or handled externally.

    enableSnapshot

    snapshot

    ENABLE, DISABLE, or DEFAULT

    enablePreload

    preload

    ENABLE, DISABLE, or DEFAULT

    Offline Table Upsert
    Define the primary key in the schema
    Enable upserts in the table configurations
    sendarrow-up-right
    partitioned replica-group assignment
    partitioned replica-group instance assignment
    Offline Table Upsert
    Query the upsert table
    Query the partial upsert table
    Explain partial upsert table

    Batch ingestion; replaces full rows only

    Add the new item to the Pinot unordered set if not exists

    Use case

    upsert_meetupRsvp_schema.json
    {
        "primaryKeyColumns": ["event_id"]
    }
    {
      "upsertConfig": {
        "mode": "FULL"
      }
    }
    release-0.8.0
    {
      "upsertConfig": {
        "mode": "PARTIAL",
        "partialUpsertStrategies":{
          "rsvp_count": "INCREMENT",
          "group_name": "IGNORE",
          "venue_name": "OVERWRITE"
        }
      },
      "tableIndexConfig": {
        "nullHandlingEnabled": true
      }
    }
    release-0.10.0
    {
      "upsertConfig": {
        "mode": "PARTIAL",
        "defaultPartialUpsertStrategy": "OVERWRITE",
        "partialUpsertStrategies":{
          "rsvp_count": "INCREMENT",
          "group_name": "IGNORE"
        }
      },
      "tableIndexConfig": {
        "nullHandlingEnabled": true
      }
    }
    Table Config Example
    {
      "upsertConfig": {
        "mode": "PARTIAL",
        "defaultPartialUpsertStrategy": "OVERWRITE",
        "partialUpsertStrategies": {
          "score": "OVERWRITE",
          "bonus": "OVERWRITE"
        },
        "postPartialUpsertTransformConfigs": {
          "total": "plus(score, bonus)"
        }
      },
      "tableIndexConfig": {
        "nullHandlingEnabled": true
      }
    }
    {
      "upsertConfig": {
        "mode": "PARTIAL",
        "partialUpsertStrategies": {
          "score": "OVERWRITE",
          "bonus": "OVERWRITE"
        },
        "postPartialUpsertTransformConfigs": {
          "total": "plus(score, bonus)"
        }
      },
      "tableIndexConfig": {
        "nullHandlingEnabled": true
      }
    }
    {
      "upsertConfig": {
        "mode": "FULL",
        "comparisonColumn": "anotherTimeColumn"
      }
    }
    {
      "upsertConfig": {
        "mode": "PARTIAL",
        "defaultPartialUpsertStrategy": "OVERWRITE",
        "partialUpsertStrategies":{},
        "comparisonColumns": ["secondsSinceEpoch", "otherComparisonColumn"]
      }
    }
    [
      {
        "event_id": "aa",
        "orderReceived": 1,
        "description" : "first",
        "secondsSinceEpoch": 1567205394
      },
      {
        "event_id": "aa",
        "orderReceived": 2,
        "description" : "update",
        "secondsSinceEpoch": 1567205397
      },
      {
        "event_id": "aa",
        "orderReceived": 3,
        "description" : "update",
        "secondsSinceEpoch": 1567205396
      },
      {
        "event_id": "aa",
        "orderReceived": 4,
        "description" : "first arrival, other column",
        "otherComparisonColumn": 1567205395
      },
      {
        "event_id": "aa",
        "orderReceived": 5,
        "description" : "late arrival, other column",
        "otherComparisonColumn": 1567205392
      },
      {
        "event_id": "aa",
        "orderReceived": 6,
        "description" : "update, other column",
        "otherComparisonColumn": 1567205398
      }
    ]
    {
      "upsertConfig": {
        "mode": "FULL",
        "snapshot": "ENABLE",
        "preload": "ENABLE",
        "metadataTTL": 86400
      }
    }
    { 
        "upsertConfig": {  
            ... 
            "deleteRecordColumn": <column_name>
        } 
    }
    // In the Schema
    {
        ...
        {
          "name": "<delete_column_name>",
          "dataType": "BOOLEAN"
        },
        ...
    }
      "upsertConfig": {
        "mode": "FULL",
        "deleteRecordColumn": <column_name>,
        "deletedKeysTTL": 86400
      }
    }
    {
      "upsertConfig": {
        "mode": "FULL",
        "deleteRecordColumn": <column_name>,
        "deletedKeysTTL": 86400,
        "enableDeletedKeysCompactionConsistency": true
      }
    }
    {
      "upsertConfig": {
        "consistencyMode": "SYNC", // or "SNAPSHOT", "NONE"
    ...
      }
    }
    {
      "routing": {
        "instanceSelectorType": "strictReplicaGroup"
      }
    }
    {
      "upsertConfig": {
        "mode": "FULL",
        "snapshot": "ENABLE"
      }
    }
    {
      "upsertConfig": {
        "mode": "FULL",
        "snapshot": "ENABLE",
        "preload": "ENABLE"
      }
    }
    {
      "upsertConfig": {
        "mode": "FULL",
        "enableCommitTimeCompaction": true
      }
    }
    {
      "upsertConfig": {
        ...,
        "dropOutOfOrderRecord": true
      }
    }
    {
      "upsertConfig": {
        ...,
        "outOfOrderRecordColumn": "isOutOfOrder"
      }
    }
    select key, val from tbl1 where isOutOfOrder = false option(skipUpsert=false)
    {
      "upsertConfig": {
        "metadataManagerClass": org.apache.pinot.segment.local.upsert.CustomPartitionUpsertMetadataManager
      }
    }
    <dependency>
      <groupId>org.apache.pinot</groupId>
      <artifactId>pinot-segment-local</artifactId>
      <version>1.0.0</version>
     </dependency>
    include 'org.apache.pinot:pinot-common:1.0.0'
    //Example custom partition manager
    
    class CustomPartitionUpsertMetadataManager implements PartitionUpsertMetadataManager {}
    //Example custom table upsert metadata manager
    
    public class CustomTableUpsertMetadataManager extends BaseTableUpsertMetadataManager {}
    {
      "upsertConfig": {
        "metadataManagerClass": org.apache.pinot.segment.local.upsert.CustomPartitionUpsertMetadataManager
      }
    }
    {
      "tableName": "upsertMeetupRsvp",
      "tableType": "REALTIME",
      "tenants": {},
      "segmentsConfig": {
        "timeColumnName": "mtime",
        "retentionTimeUnit": "DAYS",
        "retentionTimeValue": "1",
        "replication": "1"
      },
      "tableIndexConfig": {
        "segmentPartitionConfig": {
          "columnPartitionMap": {
            "event_id": {
              "functionName": "Hashcode",
              "numPartitions": 2
            }
          }
        }
      },
      "instanceAssignmentConfigMap": {
        "CONSUMING": {
          "tagPoolConfig": {
            "tag": "DefaultTenant_REALTIME"
          },
          "replicaGroupPartitionConfig": {
            "replicaGroupBased": true,
            "numReplicaGroups": 1,
            "partitionColumn": "event_id",
            "numPartitions": 2,
            "numInstancesPerPartition": 1
          }
        }
      },
      "routing": {
        "segmentPrunerTypes": [
          "partition"
        ],
        "instanceSelectorType": "strictReplicaGroup"
      },
      "ingestionConfig": {
        "streamIngestionConfig": {
          "streamConfigMaps": [
            {
              "streamType": "kafka",
              "stream.kafka.topic.name": "upsertMeetupRSVPEvents",
              "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
              "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
              "stream.kafka.broker.list": "localhost:19092"
            }
          ]
        }
      },
      "upsertConfig": {
        "mode": "FULL",
        "snapshot": "ENABLE",
        "preload": "ENABLE"
      },
      "fieldConfigList": [
        {
          "name": "location",
          "encodingType": "RAW",
          "indexType": "H3",
          "properties": {
            "resolutions": "5"
          }
        }
      ],
      "metadata": {
        "customConfigs": {}
      }
    }
    {
      "tableName": "upsertPartialMeetupRsvp",
      "tableType": "REALTIME",
      "tenants": {},
      "segmentsConfig": {
        "timeColumnName": "mtime",
        "retentionTimeUnit": "DAYS",
        "retentionTimeValue": "1",
        "replication": "1"
      },
      "tableIndexConfig": {
        "segmentPartitionConfig": {
          "columnPartitionMap": {
            "event_id": {
              "functionName": "Hashcode",
              "numPartitions": 2
            }
          }
        },
        "nullHandlingEnabled": true
      },
      "instanceAssignmentConfigMap": {
        "CONSUMING": {
          "tagPoolConfig": {
            "tag": "DefaultTenant_REALTIME"
          },
          "replicaGroupPartitionConfig": {
            "replicaGroupBased": true,
            "numReplicaGroups": 1,
            "partitionColumn": "event_id",
            "numPartitions": 2,
            "numInstancesPerPartition": 1
          }
        }
      },
      "routing": {
        "segmentPrunerTypes": [
          "partition"
        ],
        "instanceSelectorType": "strictReplicaGroup"
      },
      "ingestionConfig": {
        "streamIngestionConfig": {
          "streamConfigMaps": [
            {
              "streamType": "kafka",
              "stream.kafka.topic.name": "upsertPartialMeetupRSVPEvents",
              "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
              "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
              "stream.kafka.broker.list": "localhost:19092"
            }
          ]
        }
      },
      "upsertConfig": {
        "mode": "PARTIAL",
        "partialUpsertStrategies": {
          "rsvp_count": "INCREMENT",
          "group_name": "UNION",
          "venue_name": "APPEND"
        }
      },
      "fieldConfigList": [
        {
          "name": "location",
          "encodingType": "RAW",
          "indexType": "H3",
          "properties": {
            "resolutions": "5"
          }
        }
      ],
      "metadata": {
        "customConfigs": {}
      }
    }
    # stop previous quick start cluster, if any
    bin/quick-start-upsert-streaming.sh
    # stop previous quick start cluster, if any
    bin/quick-start-partial-upsert-streaming.sh
    Failed to update table '<tableName>': Cannot modify [<field>] as it may lead to data inconsistencies. Please create a new table instead.
    Apache Pinot 1.0 Upserts overview