Start here to learn Apache Pinot and go from zero to running your first query. Follow the guided onboarding path or jump to the section that fits your experience level.
Apache Pinot is a real-time distributed OLAP datastore purpose-built for low-latency, high-throughput analytics. It ingests data from streaming and batch sources and makes it queryable in under a second. This guide walks you through everything you need to go from first contact to a working Pinot deployment.
Onboarding path
Follow these pages in order for a complete introduction:
-- Understand what Pinot does and whether it fits your use case.
-- Launch a local cluster and run your first query in minutes.
-- Set up Pinot for local development, Docker, or Kubernetes.
-- Define a schema and create your first table.
-- Load data from a file into Pinot.
-- Connect Pinot to a streaming source for real-time data.
-- Write SQL queries against your Pinot tables.
Choose your path
Just exploring?
Start with the conceptual overview, then try the quickstart to see Pinot in action with zero setup:
Ready to build?
Jump straight to installation and follow the linear onboarding path from step 3 onward:
Next step
Introduction
Apache Pinot is a real-time distributed OLAP datastore purpose-built for low-latency, high-throughput analytics, and perfect for user-facing analytical workloads.
Apache Pinot™ is a real-time distributed online analytical processing (OLAP) datastore. Use Pinot to ingest and immediately query data from streaming or batch data sources (including, Apache Kafka, Amazon Kinesis, Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage).
We'd love to hear from you! to ask questions, troubleshoot, and share feedback.
Apache Pinot includes the following:
10-Minute Quickstart
Run a complete Pinot cluster with sample data in under 10 minutes.
Outcome
By the end of this guide you will have a fully functional Apache Pinot cluster running locally with sample data loaded, ready to query.
What is Pinot?
Learn what Apache Pinot is, what problems it solves, and whether it is the right tool for your use case.
Outcome
By the end of this page you will understand what Apache Pinot is, what problems it solves, and whether it is the right tool for your use case.
You should see results returned within milliseconds.
Next step
This quickstart bundles everything in a single process for convenience. For a list of all available quickstart types (batch, streaming, hybrid, and more), see Quick Start Examples.
Ready for a production-style setup? Continue to Install / deploy.
SELECT playerName, sum(runs) AS totalRuns
FROM baseballStats
GROUP BY playerName
ORDER BY totalRuns DESC
LIMIT 10
Ultra low-latency analytics even at extremely high throughput.
Columnar data store with several smart indexing and pre-aggregation techniques.
Scaling up and out with no upper bound.
Consistent performance based on the size of your cluster and an expected query per second (QPS) threshold.
It's perfect for user-facing real-time analytics and other analytical use cases, including internal dashboards, anomaly detection, and ad hoc data exploration.
User-facing real-time analytics
User-facing analytics refers to the analytical tools exposed to the end users of your product. In a user-facing analytics application, all users receive personalized analytics on their devices, resulting in hundreds of thousands of queries per second. Queries triggered by apps may grow quickly in proportion to the number of active users on the app, as many as millions of events per second. Data generated in Pinot is immediately available for analytics in latencies under one second.
User-facing real-time analytics requires the following:
Fresh data. The system needs to be able to ingest data in real time and make it available for querying, also in real time.
Support for high-velocity, highly dimensional event data from a wide range of actions and from multiple sources.
Low latency. Queries are triggered by end users interacting with apps, resulting in hundreds of thousands of queries per second with arbitrary patterns.
Reliability and high availability.
Scalability.
Low cost to serve.
Why Pinot?
Pinot is designed to execute OLAP queries with low latency. It works well where you need fast analytics, such as aggregations, on both mutable and immutable data.
Pinot can perform typical analytical operations such as slice and dice, drill down, roll up, and pivot on large scale multi-dimensional data. For instance, at LinkedIn, Pinot powers dashboards for thousands of business metrics. Connect various business intelligence (BI) tools such as Superset, Tableau, or PowerBI to visualize data in Pinot.
Enterprise business intelligence
For analysts and data scientists, Pinot works well as a highly-scalable data platform for business intelligence. Pinot converges big data platforms with the traditional role of a data warehouse, making it a suitable replacement for analysis and reporting.
Enterprise application development
For application developers, Pinot works well as an aggregate store that sources events from streaming data sources, such as Kafka, and makes it available for a query using SQL. You can also use Pinot to aggregate data across a microservice architecture into one easily queryable view of the domain.
Pinot tenants prevent any possibility of sharing ownership of database tables across microservice teams. Developers can create their own query models of data from multiple systems of record depending on their use case and needs. As with all aggregate stores, query models are eventually consistent.
Get started
If you're new to Pinot, take a look at our Getting Started guide:
To start importing data into Pinot, see how to import batch and stream data:
To start querying data in Pinot, check out our Query guide:
Learn
For a conceptual overview that explains how Pinot works, check out the Concepts guide:
To understand the distributed systems architecture that explains Pinot's operating model, take a look at our basic architecture section:
None. This is the starting point of the onboarding path.
What Apache Pinot does
Apache Pinot is a real-time distributed online analytical processing (OLAP) datastore. It ingests data from streaming sources (such as Apache Kafka and Amazon Kinesis) and batch sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage) and makes that data immediately available for analytic queries with sub-second latency.
Key capabilities
Ultra-low-latency analytics -- Queries return in milliseconds, even at hundreds of thousands of queries per second.
Columnar storage with smart indexing -- Purpose-built storage format with inverted, sorted, range, text, and other indexes to accelerate query patterns.
Horizontal scaling -- Scale out by adding nodes with no upper bound on cluster size.
Consistent performance -- Latency stays predictable as data volume and query load grow, based on cluster sizing and expected throughput.
Real-time ingestion -- Data is available for querying within seconds of arriving at the streaming source.
When to use Pinot
User-facing real-time analytics
Pinot was built at LinkedIn to power interactive analytics features such as Who Viewed Profile and Company Analytics. UberEats Restaurant Manager is another production example. These applications serve personalized analytics to every end user, generating hundreds of thousands of queries per second with strict latency requirements.
Real-time dashboards
Pinot supports slice-and-dice, drill-down, roll-up, and pivot operations on high-dimensional data. Connect business intelligence tools such as Apache Superset, Tableau, or PowerBI to Pinot to build live dashboards over streaming data.
Enterprise analytics
Pinot works well as a highly scalable platform for business intelligence. It converges the capabilities of a big data platform with the traditional role of a data warehouse, making it suitable for analysis and reporting at scale.
Aggregate store for microservices
Application developers can use Pinot as an aggregate store that consumes events from streaming sources and exposes them through SQL. This is useful for building a unified, queryable view across a microservice architecture. Query models are eventually consistent, as with all aggregate stores.
When NOT to use Pinot
Pinot is not a general-purpose transactional database. It does not support row-level updates, deletes, or transactions in the way that PostgreSQL or MySQL do. If your workload requires ACID transactions or frequent single-row mutations, a relational database is a better fit.
If your dataset is small enough to fit comfortably in a single PostgreSQL or MySQL instance (a few million rows or less) and you do not need sub-second query latency at high concurrency, a traditional database will be simpler to operate and sufficient for your needs.
Verify
You now know:
What Apache Pinot is and how it differs from transactional databases.
The four main categories of use cases where Pinot excels.
When a simpler tool would be a better choice.
Next step
Continue to the 10-minute quickstart to launch a local Pinot cluster and run your first query:
The segment threshold determines when a segment is committed in real-time tables.
When data is first ingested from a streaming provider like Kafka, Pinot stores the data in a consuming segment.
This segment is on the disk of the server(s) processing a particular partition from the streaming provider.
However, it's not until a segment is committed that the segment is written to the deep store. The segment threshold decides when that should happen.
Why is the segment threshold important?
The segment threshold is important because it ensures segments are a reasonable size.
When queries are processed, smaller segments may increase query latency due to more overhead (number of threads spawned, meta data processing, and so on).
Larger segments may cause servers to run out of memory. When a server is restarted, the consuming segment must start consuming from the first row again, causing a lag between Pinot and the streaming provider.
Mark Needham explains the segment threshold
Query Syntax Overview
Query Pinot using supported syntax.
Query Pinot using supported syntax.
Concepts
Explore the fundamental concepts of Apache Pinot™ as a distributed OLAP database.
Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-latency queries on large datasets through a set of common data model abstractions. Delivering on these goals requires several foundational architectural commitments, including:
Storing data in columnar form to support high-performance scanning
Sharding of data to scale both storage and computation
A distributed architecture designed to scale capacity linearly
A tabular data model read by SQL queries
To learn about Pinot components, terminology, and gain a conceptual understanding of how data is stored in Pinot, review the following sections:
Stream Ingestion on Kubernetes
Load streaming data into Pinot on Kubernetes using Kafka
This guide walks you through loading streaming data into a Pinot cluster running in Kubernetes. Make sure you have completed first.
Load data into Pinot using Kafka
Logical Tables
Use logical tables when one query name should span multiple physical Pinot tables without exposing the partitioning scheme to users.
Logical tables are a naming and routing layer on top of physical tables. They let you split data by region, age, or operating mode while keeping one user-facing table name.
Use a logical table when the split is an implementation detail, not part of the query contract. Keep the physical tables aligned on schema, and use a reference physical table only as a metadata anchor.
When they help
Logical tables are most useful when you need one of these patterns:
Time Boundary
Learn about time boundaries in hybrid tables.
Learn about time boundaries in hybrid tables. Hybrid tables are when we have offline and real-time tables with the same name.
When querying these tables, the Pinot broker decides which records to read from the offline table and which to read from the real-time table. It does this using the time boundary.
How is the time boundary determined?
The time boundary is determined by looking at the maximum end time of the offline segments and the segment ingestion frequency specified for the offline table.
Components
Discover the core components of Apache Pinot, enabling efficient data processing and analytics. Unleash the power of Pinot's building blocks for high-performance data-driven applications.
Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-latency queries on large datasets through a set of common data model abstractions. Delivering on these goals requires several foundational architectural commitments, including:
Storing data in columnar form to support high-performance scanning
Sharding of data to scale both storage and computation
Azure
Provision a managed Kubernetes cluster on Azure AKS ready for Pinot.
Outcome
Create an Azure Kubernetes Service cluster with the required tooling, ready to deploy Apache Pinot.
Transformations and Aggregations
Use ingest-time transformations and aggregations when Pinot should normalize or reduce data before it reaches query time.
Ingestion transformations clean up source records before they become Pinot rows. Ingestion aggregations reduce repeated values into fewer rows when a realtime table can safely store the summarized shape instead of the raw event stream.
Transformations
Use transformations to rename, reshape, extract, filter, or derive fields while ingesting. Keep the logic close to the table so the pipeline stays understandable.
The detailed examples still live in .
Upsert and Dedup
Use upsert or dedup when ingesting rows should collapse to one current record per key instead of preserving every event.
Upsert and dedup are for tables that ingest repeated keys. Use them when the current value matters more than the raw event history, or when duplicate events should not fan out into duplicate query results.
Choose the right behavior
Use upsert when newer rows should replace older rows for the same primary key.
Use dedup when repeated records should be filtered out and only the first or unique representation should remain.
Install / Deploy
Choose the deployment method that matches your environment.
Outcome
Select the right installation method for your use case and deploy a Pinot cluster.
SQL Insert Into From Files
Insert a file into Pinot from Query Console
This feature is supported after the 0.11.0 release. Reference PR:
Prerequisite
Batch Ingestion
Choose batch ingestion when Pinot should load prebuilt data from files, warehouses, or distributed processing jobs.
Batch ingestion builds Pinot segments outside the cluster and pushes them into Pinot after the data is already shaped. Use it when the data changes in larger chunks, when you need deterministic backfills, or when the pipeline already produces files or segment artifacts.
The most important design choice is not the framework, but the output contract: what the schema looks like, what the table expects, and where the segments land.
Common batch paths
Spark-based ingestion.
Multi-Stage Query
Deep dive into the multi-stage engine (MSE) internals, execution model, and troubleshooting.
For an overview of when to use the multi-stage engine (MSE) versus the single-stage engine (SSE), see . This section provides a deep dive into MSE internals. Most of the concepts explained here are related to the engine's execution model and are not required for writing queries. However, understanding them can help you take advantage of MSE's capabilities and troubleshoot issues.
Bring up a Kafka cluster for real-time data ingestion
The Bitnami Kafka Helm chart deploys Kafka in KRaft mode (with a built-in controller quorum) by default, so a separate ZooKeeper deployment is not required for Kafka.
Check Kafka deployment status
Ensure the Kafka deployment is ready before executing the scripts in the following steps. Run the following command:
Below is an example output showing the deployment is ready:
Create Kafka topics
Run the scripts below to create two Kafka topics for data ingestion:
Load data into Kafka and create Pinot schema/tables
The script below does the following:
Ingests 19492 JSON messages to Kafka topic flights-realtime at a speed of 1 msg/sec
Ingests 19492 Avro messages to Kafka topic flights-realtime-avro at a speed of 1 msg/sec
Uploads Pinot schema airlineStats
Creates Pinot table airlineStats to ingest data from JSON encoded Kafka topic flights-realtime
Creates Pinot table airlineStatsAvro to ingest data from Avro encoded Kafka topic flights-realtime-avro
Query with the Pinot Data Explorer
Pinot Data Explorer
The following script (located at ./pinot/helm/pinot) performs local port forwarding, and opens the Pinot query console in your default web browser.
It is possible to force the hybrid table to use max(all offline segments' end time) by calling the API (V 0.12.0+)
Note that this will not automatically update the time boundary as more segments are added to the offline table, and must be called each time a segment with more recent end time is uploaded to the offline table. You can revert back to using the derived time boundary by calling API:
Querying
When a Pinot broker receives a query for a hybrid table, the broker sends a time boundary annotated version of the query to the offline and real-time tables.
For example, if we executed the following query:
The broker would send the following query to the offline table:
And the following query to the real-time table:
The results of the two queries are merged by the broker before being returned to the client.
timeBoundary = Maximum end time of offline segments - 1 hour
Prerequisites
An Azure account
The following CLI tools installed (see steps below)
SELECT count(*)
FROM events_OFFLINE
WHERE timeColumn <= $timeBoundary
SELECT count(*)
FROM events_REALTIME
WHERE timeColumn > $timeBoundary
brew install kubernetes-cli
kubectl version
brew install kubernetes-helm
helm version
brew update && brew install azure-cli
az login
AKS_RESOURCE_GROUP=pinot-demo
AKS_RESOURCE_GROUP_LOCATION=eastus
az group create --name ${AKS_RESOURCE_GROUP} \
--location ${AKS_RESOURCE_GROUP_LOCATION}
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks create --resource-group ${AKS_RESOURCE_GROUP} \
--name ${AKS_CLUSTER_NAME} \
--node-count 3
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks get-credentials --resource-group ${AKS_RESOURCE_GROUP} \
--name ${AKS_CLUSTER_NAME}
kubectl get nodes
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks delete --resource-group ${AKS_RESOURCE_GROUP} \
--name ${AKS_CLUSTER_NAME}
SET taskName = 'myTask-s3';
SET input.fs.className = 'org.apache.pinot.plugin.filesystem.S3PinotFS';
SET input.fs.prop.accessKey = 'my-key';
SET input.fs.prop.secretKey = 'my-secret';
SET input.fs.prop.region = 'us-west-2';
INSERT INTO "baseballStats"
FROM FILE 's3://my-bucket/public_data_set/baseballStats/rawdata/'
Different physical tables per region or business unit.
Separate offline and realtime tables that still answer one business question.
Time-sliced tables that should be queried together.
Design rules
Keep the underlying schemas aligned. Keep the logical name stable. Prefer this pattern only when the underlying split is operationally meaningful; do not use it to hide a modeling problem that should instead be solved with cleaner ingestion.
For hybrid-style layouts, make the time boundary explicit so Pinot does not double count overlapping data.
Example pattern
Learn more
The original logical-table walkthrough lives in Logical Table.
What this page covered
This page covered when to use logical tables and how they hide physical table splits from readers.
Next step
Read Schema Evolution if the schema needs to grow after the table is already in production.
Use ingestion aggregation when the use case only needs summarized realtime data. This can reduce storage and improve query performance, but it changes the data you keep, so use it only when raw rows are not needed later.
These patterns need a careful schema, a stable primary key, and ingestion flow that understands the table-level metadata Pinot uses to keep the result consistent.
The strongest detail still lives in the original docs under Upsert and Dedup.
What this page covered
This page covered the difference between upsert and dedup and when each is the better fit.
Next step
Read Formats and Filesystems to decide how Pinot should read source data and store generated segments.
Dimension tables and other specialized offline loads.
What to decide early
Decide on the file format, the deep-storage target, and the segment push workflow before you optimize the job itself. Most batch ingestion problems come from mismatched assumptions at those boundaries.
Discover how Apache Pinot's broker component optimizes query processing, data retrieval, and enhances data-driven applications.
Pinot brokers take query requests from client processes, scatter them to applicable servers, gather the results, and return results to the client. The controller shares cluster metadata with the brokers, which allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.
A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.
Broker interaction with other components
Pinot brokers are modeled as Helix spectators. They need to know the location of each segment of a table (and each replica of the segments) and route requests to the appropriate server that hosts the segments of the table being queried.
The broker ensures that all the rows of the table are queried exactly once so as to return correct, consistent results for a query. The brokers may optimize to prune some of the segments as long as accuracy is not sacrificed.
Helix provides the framework by which spectators can learn the location in which each partition of a resource (i.e. participant) resides. The brokers use this mechanism to learn the servers that host specific segments of a table.
In the case of hybrid tables, the brokers ensure that the overlap between real-time and offline segment data is queried exactly once, by performing offline and real-time federation.
Let's take this example, we have real-time data for five days - March 23 to March 27, and offline data has been pushed until Mar 25, which is two days behind real-time. The brokers maintain this time boundary.
Suppose, we get a query to this table : select sum(metric) from table. The broker will split the query into 2 queries based on this time boundary – one for offline and one for real-time. This query becomes select sum(metric) from table_REALTIME where date >= Mar 25
and select sum(metric) from table_OFFLINE where date < Mar 25
The broker merges results from both these queries before returning the result to the client.
Starting a broker
Make sure you've . If you're using Docker, make sure to . To start a broker:
Local
Start a Pinot cluster on your local machine.
Outcome
Start a multi-component Pinot cluster directly on your machine without containers.
Prerequisites
JDK 11 or 21 (JDK 17 should work but is not officially supported)
Apache Maven 3.6+ (only if building from source)
Steps
1. Download or build Apache Pinot
See the page for the current stable release.
Extract and enter the directory:
Prerequisite: Install 3.6 or higher.
2. Start ZooKeeper
3. Start Pinot Controller
4. Start Pinot Broker
5. Start Pinot Server
6. Start Pinot Minion (optional)
7. Start Kafka (optional)
Only needed if you plan to ingest real-time streaming data.
Verify
Check that the Controller is healthy:
The response should return OK. You can also open the Pinot Query Console at .
Next step
Your cluster is running. Continue to to load data.
GCP
Provision a managed Kubernetes cluster on Google GKE ready for Pinot.
Outcome
Create a Google Kubernetes Engine cluster with the required tooling, ready to deploy Apache Pinot.
Prerequisites
A Google Cloud account and project
The following CLI tools installed (see steps below)
Steps
1. Install tooling
kubectl
Verify:
Helm
Verify:
Google Cloud SDK
Follow the or run:
2. Initialize Google Cloud
3. Create a GKE cluster
The following creates a 3-node cluster named pinot-quickstart in us-west1-b using n1-standard-2 machines:
Monitor cluster status:
Wait until the cluster status is RUNNING.
4. Connect to the cluster
Verify
You should see your worker nodes listed and in Ready status.
Cleaning up
To delete the cluster when you are done:
Next step
Your cluster is ready. Continue to to deploy Pinot.
Schema and Table Shape
Understand Pinot schema design, table shape, null handling, and the schema fields that drive query and ingestion behavior.
A Pinot schema defines the columns that exist in a table and how Pinot should treat them. The important part is not only the column list, but also the shape of the table: which fields are dimensions, metrics, and time fields, how nulls behave, and whether the table is built for offline, realtime, or hybrid ingestion.
Pinot stores schema and table metadata separately, but the two should be designed together. Keep the schema narrow enough to match the data you actually query, and keep the table config dense enough for reference pages rather than this narrative overview.
What to design
The schema answers four practical questions:
What columns exist?
What data type does each column use?
Which columns are dimensions, metrics, or date-time fields?
Good defaults
Use column names that are stable and business-facing. Prefer simple types that match the source data. Add only the fields you need at query time, because schema changes are additive and should be deliberate.
For time columns, keep one primary time field in mind for retention and hybrid-table boundary behavior. For null handling, decide early whether the table needs column-based or table-based semantics.
Example schema
When to use the reference pages
Use the when you need the exact JSON fields, validation rules, or date-time field formats. Use the when you need indexing, retention, or routing configuration.
What this page covered
This page covered the parts of Pinot schema design that shape ingestion and query behavior.
Next step
Read if one query name should route to multiple physical tables.
Related pages
Version Reference
Current Apache Pinot release version and how to pin versions in examples.
Outcome
Know which Pinot version to use and how to pin versions in examples.
All code samples in the Start Here guide use PINOT_VERSION=1.4.0. If you are using a different version, set the variable accordingly before running any commands.
This page is the single source of truth for version information across the Start Here guide and the wider documentation. When following tutorials or code samples, make sure the version you use matches your installed release.
Current stable release
Artifact
Version
Using PINOT_VERSION in examples
Most code samples in these docs set a PINOT_VERSION environment variable near the top of each snippet. Always verify that the value matches your installed version:
Once the variable is set, every command in the tutorial that references ${PINOT_VERSION} will use the correct value automatically.
Start Here pages never use the latest Docker tag. Always pin to a specific version for reproducibility. The latest tag can change without notice and may introduce breaking changes during a tutorial.
Compatibility notes
Requirement
Detail
If you are running JDK 8 and cannot upgrade, use Pinot 0.12.1. For all new deployments, JDK 11 or 21 is recommended.
Release links
You can find all published releases on the page, and all Docker tags on .
Older versions
Older Pinot binaries are archived at .
Row Expression Comparison
Row Expression
ROW()
Description:
ROW value expressions is supported in Pinot in comparison contexts, enabling efficient keyset pagination queries. ROW expressions allow users to write cleaner multi-column comparisons like WHERE (col1, col2, col3) > (val1, val2, val3) instead of verbose nested conditions. The evaluation of row expression is based on the () for the comparison operators.
Syntax:
Pinot supports implicit ROW-style expressions in comparison predicates using a parenthesized list of expressions on both sides of the comparator:
WHERE (col1, col2, col3) > (val1, val2, val3)
Supported comparison operators:
Note: Explicit use of the ROW() keyword (e.g., WHERE ROW(col1, col2) = ROW(1, 2)) is not yet supported due to the current SQL parser configuration (SqlConformanceEnum.BABEL). Future improvements may enable explicit row value constructors.
Note:
ROW comparisons are lexicographic, not element-wise
Pinot does not materialize row types — it rewrites comparisons at planning time
Rewrite complexity grows linearly with the number of columns
Sample Example Usage:
Equality (=)
is rewritten to
Greater Than (>)
is rewritten to
Less Than (<)
is rewritten to
Stream Ingestion
Choose stream ingestion when Pinot should consume events continuously and expose new rows quickly.
Stream ingestion keeps Pinot close to the source of truth. Use it when rows should be queryable soon after they are emitted, and when the system needs a steady flow rather than periodic batch loads.
Core decisions
Pick the stream connector and partitioning strategy.
Choose how Pinot should flush, commit, and complete segments.
Decide whether the table should remain purely realtime or later become hybrid.
What matters most
The stream has to support the consumption mode you choose. The table config has to describe the partitioning, replicas, and segment lifecycle clearly enough that the servers can behave predictably under load.
Learn more
The existing walk-throughs in and still contain the detailed mechanics.
What this page covered
This page covered the stream-ingestion model and the main lifecycle choices behind it.
Next step
Read if the stream should collapse duplicate keys or keep only the latest row.
Related pages
Overview
Build applications and data workflows with Apache Pinot using task-oriented guidance.
Use this section when you are designing tables, ingesting data, querying Pinot, choosing indexes, or connecting Pinot to applications and tools. The goal here is to help you decide what to do next and then take you to the right detailed docs without forcing you through raw reference first.
If you already know the exact property, endpoint, or plugin you need, jump to the section. Build-focused pages in this section explain how pieces fit together. Reference pages stay dense on purpose.
What this page covered
This page introduced the task-oriented Build with Pinot structure and pointed to the main workflows for modeling, ingestion, querying, indexing, and integration.
Next step
Start with the workflow that matches your immediate task, such as or .
Related pages
Data modeling
Build Pinot tables by getting schema, table shape, logical-table, and schema-evolution decisions right before ingestion starts.
Pinot works best when the table shape is clear before data lands. Start here to understand the structure that every ingestion and query decision depends on: schema design, table composition, logical-table layout, and how schemas evolve without breaking existing pipelines.
If you need dense JSON config or controller endpoints, jump to the Reference section instead. This section stays narrative and decision-oriented.
Start Here
Related Existing Docs
What this page covered
This landing page defines the scope of Pinot data modeling and points to the core pages that matter first.
Next step
Read to lock in the table structure before designing ingestion.
Related pages
Querying & SQL
Learn how to query Apache Pinot, choose the right query engine, and find SQL and function guidance quickly.
Use this section to decide how to query Pinot, how much SQL support you need, which query engine to use, and where to look for execution controls such as quotas, cancellation, and cursors. Narrative guidance lives here. Dense syntax and endpoint detail is linked where needed.
For explain plans, joins, optimizer behavior, and operator details, continue into the multi-stage query docs and engine-specific material linked from and .
What this page covered
This page mapped the main query workflows in Pinot: learning the query path, understanding SQL behavior, finding functions, choosing between SSE and MSE, and tuning execution controls.
Next step
Read if you want the end-to-end query flow, or if you are deciding which engine to use.
Related pages
Formats and Filesystems
Match Pinot ingestion to the right input formats and deep-storage filesystems without overcomplicating the table design.
Pinot supports several source formats and deep-storage choices. Pick these early, because they affect how segments are produced, moved, and recovered.
Source formats
Use the original format docs when you need the exact supported file types or loader behavior. The main landing page is Supported Data Formats.
Filesystems and deep storage
Choose the deep-storage backend that matches your operational environment. The detailed filesystem docs still live under .
Keep it simple
Do not mix format decisions with schema design. The schema says what the data means; the filesystem says where segments survive after Pinot produces them.
What this page covered
This page covered how source formats and deep storage fit into the ingestion design.
Next step
Read if data needs cleanup or pre-aggregation before query time.
Related pages
Schema Evolution
Evolve Pinot schemas safely by adding columns, reloading segments, and deciding when a new table is the cleaner path.
Pinot schema evolution is intentionally narrow. The safe path is to add columns, reload the affected segments, and backfill only when the table type and data flow support it. If the change is more invasive than that, create a new table instead of forcing the old one to stretch.
What is safe
Additive schema changes are the normal path. New columns can be introduced without rewriting the whole table, as long as the ingestion flow and segment reload behavior are understood.
What is not safe
Renaming a column, dropping a column, or changing a column type is not a small schema tweak. Treat those as table redesign work.
Typical flow
Add the new column to the schema.
Update the table config or ingestion config if the new field needs transforms.
Reload the affected segments.
Reference material
The detailed walkthrough still lives in .
What this page covered
This page covered the additive schema-evolution path and the cases where a new table is safer.
Next step
Read the ingestion pages to see how schema design affects batch and stream pipelines.
Related pages
Managed Kubernetes
Set up a Kubernetes cluster on your cloud provider.
Outcome
Provision a managed Kubernetes cluster on AWS, GCP, or Azure that is ready for a Pinot deployment.
Overview
These guides walk you through creating a managed Kubernetes cluster on your cloud provider. Once the cluster is running, you will use the page to deploy Pinot onto it.
Cloud providers
Provider
Service
Guide
Next step
Once your cluster is ready, follow the to deploy Pinot.
Segment Retention
In this Apache Pinot concepts guide, we'll learn how segment retention works.
Segments in Pinot tables have a retention time, after which the segments are deleted. Typically, offline tables retain segments for a longer period of time than real-time tables.
The removal of segments is done by the retention manager. By default, the retention manager runs once every 6 hours.
The retention manager purges two types of segments:
Expired segments: Segments whose end time has exceeded the retention period.
Replaced segments: Segments that have been replaced as part of the
There are a couple of scenarios where segments in offline tables won't be purged:
If the segment doesn't have an end time. This would happen if the segment doesn't contain a time column.
If the segment's table has a segmentIngestionType of REFRESH.
If the retention period isn't specified, segments aren't purged from tables.
The retention manager initially moves these segments into a Deleted Segments area, from where they will eventually be permanently removed. The duration that deleted segments are kept is controlled by the controller.deleted.segments.retentionInDays configuration (default: 7 days).
When deleting a table via the API, you can override this behavior by passing a retention query parameter. For example, DELETE /tables/{tableName}?retention=0d deletes all segments immediately without moving them to the deleted-segments area. See the for more details.
Server
Uncover the efficient data processing and storage capabilities of Apache Pinot's server component, optimizing performance for data-driven applications.
Pinot servers provide the primary storage for and perform the computation required to execute queries. A production Pinot cluster contains many servers. In general, the more servers, the more data the cluster can retain in tables, the lower latency the cluster can deliver on queries, and the more concurrent queries the cluster can process.
Servers are typically segregated into real-time and offline workloads, with "real-time" servers hosting only real-time tables, and "offline" servers hosting only offline tables. This is a ubiquitous operational convention, not a difference or an explicit configuration in the server process itself. There are two types of servers:
Offline
Kubernetes
Deploy a Pinot cluster on Kubernetes using Helm.
Outcome
Deploy a production-ready Pinot cluster on Kubernetes with Helm charts.
The examples in this guide are sample configurations for reference. For production deployments, customize settings as needed -- especially security features like TLS and authentication.
Deep Store
Leverage Apache Pinot's deep store component for efficient large-scale data storage and management, enabling impactful data processing and analysis.
The deep store (or deep storage) is the permanent store for files.
It is used for backup and restore operations. New nodes in a cluster will pull down a copy of segment files from the deep store. If the local segment files on a server gets damaged in some way (or accidentally deleted), a new copy will be pulled down from the deep store on server restart.
The deep store stores a compressed version of the segment files and it typically won't include any indexes. These compressed files can be stored on a local file system or on a variety of other file systems. For more details on supported file systems, see .
Note: Deep store by itself is not sufficient for restore operations. Pinot stores metadata such as table config, schema, segment metadata in Zookeeper. For restore operations, both Deep Store as well as Zookeeper metadata are required.
Understanding Stages
Learn more about multi-stage stages and how to extract stages from query plans.
Deep dive into stages
As explained in the reference documentation, the multi-stage query engine breaks down a query into multiple stages. Each stage corresponds to a subset of the query plan and is executed independently. Stages are connected in a tree-like structure where the output of one stage is the input to another stage. The stage that is at the root of the tree sends the final results to the client. The stages that are at the leaves of the tree read from the tables. The intermediate stages process the data and send it to the next stage.
When the broker receives a query, it generates a query plan. This is a tree-like structure where each node is an operator. The plan is then optimized, moving and changing nodes to generate a plan that is semantically equivalent (it returns the same rows) but more efficient. During this phase the broker colors the nodes of the plan, assigning them to a stage. The broker also assigns a parallelism to each stage and defines which servers are going to execute each stage. For example, if a stage has a parallelism of 10, then at most 10 servers will execute that stage in parallel. One single server can execute multiple stages in parallel and it can even execute multiple instances of the same stage in parallel.
Ingestion
Plan Pinot ingestion around batch, stream, upsert, dedup, formats, filesystems, and transformation choices.
Ingestion is where Pinot tables become real. Start here to choose the right path for batch or stream data, then refine the design with upsert, dedup, file format, filesystem, transform, and aggregation decisions.
The detailed controller and table-config material belongs in . This section stays focused on data flow and operational choices.
Start Here
Backfill Data
Batch ingestion of backfill data into Apache Pinot.
Introduction
Pinot batch ingestion involves two parts: routine ingestion job(hourly/daily) and backfill. Here are some examples to show how routine batch ingestion works in Pinot offline table:
File Systems
This section contains a collection of short guides to show you how to import data from a Pinot-supported file system.
FileSystem is an abstraction provided by Pinot to access data stored in distributed file systems (DFS).
Pinot uses distributed file systems for the following purposes:
Batch ingestion job: To read the input data (CSV, Avro, Thrift, etc.) and to write generated segments to DFS.
SQL syntax
A narrative guide to Pinot SQL syntax and the main constructs you use most often.
Pinot uses the Apache Calcite SQL parser with the MYSQL_ANSI dialect. This page is the practical overview: it explains the syntax patterns most people use every day and points to the deeper reference when you need the full operator list.
Organize raw data into buckets (eg: /var/pinot/airlineStats/rawdata/2014/01/01). Each bucket typically contains several files (eg: /var/pinot/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01_0.avro)
Run a Pinot batch ingestion job, which points to a specific date folder like ‘/var/pinot/airlineStats/rawdata/2014/01/01’. The segment generation job will convert each such avro file into a Pinot segment for that day and give it a unique name.
Run Pinot segment push job to upload those segments with those uniques names via a Controller API
IMPORTANT: The segment name is the unique identifier used to uniquely identify that segment in Pinot. If the controller gets an upload request for a segment with the same name - it will attempt to replace it with the new one.
This newly uploaded data can now be queried in Pinot. However, sometimes users will make changes to the raw data which need to be reflected in Pinot. This process is known as 'Backfill'.
How to backfill data in Pinot
Pinot supports data modification only at the segment level, which means you must update entire segments for doing backfills. The high level idea is to repeat steps 2 (segment generation) and 3 (segment upload) mentioned above:
Backfill jobs must run at the same granularity as the daily job. E.g., if you need to backfill data for 2014/01/01, specify that input folder for your backfill job (e.g.: ‘/var/pinot/airlineStats/rawdata/2014/01/01’)
The backfill job will then generate segments with the same name as the original job (with the new data).
When uploading those segments to Pinot, the controller will replace the old segments with the new ones (segment names act like primary keys within Pinot) one by one.
Edge case example
Backfill jobs expect the same number of (or more) data files on the backfill date. So the segment generation job will create the same number of (or more) segments than the original run.
For example, assuming table airlineStats has 2 segments(airlineStats_2014-01-01_2014-01-01_0, airlineStats_2014-01-01_2014-01-01_1) on date 2014/01/01 and the backfill input directory contains only 1 input file. Then the segment generation job will create just one segment: airlineStats_2014-01-01_2014-01-01_0. After the segment push job, only segment airlineStats_2014-01-01_2014-01-01_0 got replaced and stale data in segment airlineStats_2014-01-01_2014-01-01_1 are still there.
If the raw data is modified in such a way that the original time bucket has fewer input files than the first ingestion run, backfill will fail.
Stages are identified by their stage ID, which is a unique identifier for each stage. In the current implementation the stage ID is a number and the root stage has a stage ID of 0, although this may change in the future.
The current implementation has some properties that are worth mentioning:
The leaf stages execute a slightly modified version of the single-stage query engine. Therefore these stages cannot execute joins or aggregations, which are always executed in the intermediate stages.
Intermediate stages execute operations using a new query execution engine that has been created for the multi-stage query engine. This is why some of the functions that are supported in the single-stage query engine are not supported in the multi-stage query engine and vice versa.
An intermediate stage can only have one join, one window function or one set operation. If a query has more than one of these operations, the broker will create multiple stages, each with one of these operations.
Extracting Stages from Query Plans
As explained in Explain Plan (Multi-Stage), you can use the EXPLAIN PLAN syntax to obtain the logical plan of a query. This logical plan can be used to extract the stages of the query.
For example, if the query is:
A possible output of the EXPLAIN PLAN command is:
As it happens with all queries, the logical plan forms a tree-like structure. In this default explain format, the tree-like structure is represented with indentation. The root of the tree is the first line, which is the last operator to be executed and marks the root stage. The boundary between stages are the PinotLogicalExchange operators. In the example above, there are four stages:
The root stage starts with the LogicalSort operator in the root of operators and ends with the PinotLogicalSortExchange operator. This is the last stage to be executed and the only one that is executed in the broker, which will directly send the result to the client once it is computed.
The next stage starts with this PinotLogicalSortExchange operator and includes the LogicalSort operator, the LogicalProject operator, the LogicalJoin operator and the two PinotLogicalExchange operators. This stage clearly is not a root stage and it is not reading data from the segments, so it is not a leaf stage. Therefore it has to be an intermediate stage.
The join has two children, which are the PinotLogicalExchange operators. In this specific case, both sides are very similar. They start with a PinotLogicalExchange operator and end with a LogicalTableScan operator. All stages that end with a LogicalTableScan operator are leaf stages.
Use double quotes for identifiers when a column name is reserved or contains special characters.
SET statements apply query options before the query runs.
EXPLAIN PLAN FOR shows how Pinot will execute a query without returning data.
Common query shapes
Pinot supports the usual SELECT, WHERE, GROUP BY, ORDER BY, and LIMIT patterns.
Typical query shapes include:
filtering a table and returning a small result set
grouping and aggregating by one or more dimensions
using ORDER BY to rank rows before a LIMIT
using CASE WHEN and scalar functions in select lists
Engine-aware syntax
Some SQL features depend on the engine:
single-stage execution is best for simple analytic queries
multi-stage execution is required for joins, subqueries, and several advanced distributed patterns
EXPLAIN PLAN FOR is the best way to see how Pinot interprets a statement
If you are working on a query and do not know whether a feature is supported, check the engine-specific guidance before you assume the syntax is invalid.
Where the details live
This page intentionally stays light. For the full statement-by-statement reference, use the detailed SQL syntax and operators reference. For query controls and diagnostics, use the pages under query-execution-controls/.
What this page covered
This page covered the main Pinot SQL rules, the most common statement patterns, and the difference between narrative guidance and the full SQL reference.
Next step
Read Querying Pinot for the broader query workflow, or jump to Query options if you want to control runtime behavior.
SET useMultistageEngine = true;
SELECT "date", city, COUNT(*)
FROM orders
WHERE status = 'shipped'
GROUP BY "date", city
ORDER BY "date" DESC
LIMIT 20;
Offline servers are responsible for downloading segments from the segment store, to host and serve queries off. When a new segment is uploaded to the controller, the controller decides the servers (as many as replication) that will host the new segment and notifies them to download the segment from the segment store. On receiving this notification, the servers download the segment file and load the segment onto the server, to server queries off them.
Real-time
Real-time servers directly ingest from a real-time stream (such as Kafka or EventHubs). Periodically, they make segments of the in-memory ingested data, based on certain thresholds. This segment is then persisted onto the segment store.
Pinot servers are modeled as Helix participants, hosting Pinot tables (referred to as resources in Helix terminology). Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more Helix partitions of one or more helix resources (i.e. one or more segments of one or more tables).
A managed cloud cluster -- see for AWS, GCP, and Azure setup guides
Steps
1. Add the Pinot Helm repository
2. Create a namespace
3. Install Pinot
StorageClass: Specify the StorageClass for your cloud vendor. Use block storage only -- do not mount blob stores (S3, GCS, AzureFile) as the data-serving file system.
AWS: gp2
GCP: pd-ssd or standard
Azure: AzureDisk
Docker Desktop: hostpath
Verify
Check the deployment status:
All pods should reach Running status. You can port-forward the Controller to access the UI:
For stream ingestion on Kubernetes, see the Kubernetes stream ingestion guide. For batch data loading and table creation, continue with the onboarding path below.
There are several different ways that segments are persisted in the deep store.
For offline tables, the batch ingestion job writes the segment directly into the deep store, as shown in the diagram below:
Batch job writing a segment into the deep store
The ingestion job then sends a notification about the new segment to the controller, which in turn notifies the appropriate server to pull down that segment.
For real-time tables, by default, a segment is first built-in memory by the server. It is then uploaded to the lead controller (as part of the Segment Completion Protocol sequence), which writes the segment into the deep store, as shown in the diagram below:
Server sends segment to Controller, which writes segments into the deep store
Having all segments go through the controller can become a system bottleneck under heavy load, in which case you can use the peer download policy, as described in Decoupling Controller from the Data Path.
When using this configuration, the server will directly write a completed segment to the deep store, as shown in the diagram below:
Server writing a segment into the deep store
Configuring the deep store
For hands-on examples of how to configure the deep store, see the following tutorials:
To use a distributed file system, you need to enable plugins. To do that, specify the plugin directory and include the required plugins:
You can change the file system in the controller and server configuration. In the following configuration example, the URI is s3://bucket/path/to/file and scheme refers to the file system URI prefix s3.
You can also change the file system during ingestion. In the ingestion job spec, specify the file system with the following configuration:
This guide shows you how to import data from GCP (Google Cloud Platform).
Enable the Google Cloud Storage using the pinot-gcs plugin. In the controller or server, add the config:
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-3.0...
GCP file systems provides the following options:
projectId - The name of the Google Cloud Platform project under which you have created your storage bucket.
gcpKey - Location of the json file containing GCP keys. You can refer to download the keys.
Each of these properties should be prefixed by pinot.[node].storage.factory.class.gs. where node is either controller or server depending on the configuration, like this:
Examples
Job spec
Controller config
Server config
Minion config
Offline Table Upsert
Use upsert semantics on batch-ingested offline tables.
Pinot supports upsert on OFFLINE tables in builds that include PR #17789.
Use it for batch corrections, replays, and late-arriving records.
For a full overview of upsert features (comparison columns, delete columns, TTL, metadata management), see the main Upsert page. This page covers the OFFLINE-specific configuration and differences.
How offline upsert works
Pinot keeps one row per primary key.
For duplicate keys, Pinot keeps the row with the greatest comparison value.
If you do not set comparisonColumns, Pinot uses the table time column.
Offline upsert replaces full rows.
It does not merge partial rows.
Configure offline upsert
1
Define a primary key
Add primaryKeyColumns to the schema.
2
When to use it
Use offline upsert when updates arrive in files.
Use it for daily corrections.
Use it for backfills.
Use it for replaying snapshots into offline segments.
Differences from real-time upsert
Offline upsert does not consume a stream.
It does not require low-level consumers.
It does not depend on stream partitioning.
It fits batch ingestion and segment replacement workflows.
For stream-based updates, use .
Operational notes
Changing the primary key needs a full rebuild.
Changing comparison columns also needs a full rebuild.
Reload alone is not enough for these changes.
If you use a hybrid table, avoid overlapping offline and realtime time ranges.
Related topics
AWS
Provision a managed Kubernetes cluster on Amazon EKS ready for Pinot.
Outcome
Create an Amazon EKS cluster with the required tooling, ready to deploy Apache Pinot.
Prerequisites
An AWS account
The following CLI tools installed (see steps below)
Steps
1. Install tooling
kubectl
Verify:
Helm
Verify:
AWS CLI
Follow the or run:
eksctl
2. Configure AWS credentials
Environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY override credentials stored in ~/.aws/credentials.
3. Create an EKS cluster
The following creates a single-node cluster named pinot-quickstart in us-west-2 using t3.xlarge instances:
For Kubernetes 1.23+, enable the EBS CSI driver to allow persistent volume provisioning:
Monitor cluster status:
Wait until the cluster status is ACTIVE.
4. Connect to the cluster
Verify
You should see your worker nodes listed and in Ready status.
Cleaning up
To delete the cluster when you are done:
Next step
Your cluster is ready. Continue to to deploy Pinot.
First Query
Run your first SQL queries against Pinot using the Query Console and REST API.
Outcome
Run your first SQL queries against Pinot and understand the query interface.
Prerequisites
You have completed either or . The transcript table exists and contains data.
The Pinot cluster is running (Controller on port 9000, Broker on port 8099).
Steps
1. Open the Query Console
Navigate to in your browser. Click Query Console in the left sidebar. You should see the transcript table listed in the table explorer on the left.
2. Run a simple SELECT
Paste the following query into the query editor and click Run Query:
The results panel shows all columns in the transcript table -- studentID, firstName, lastName, gender, subject, score, and timestamp. The rows returned come from whichever data you loaded (batch, stream, or both). LIMIT 10 caps the result set so the response is fast.
3. Run an aggregation
This query calculates the average score per subject and sorts the results from highest to lowest. Pinot executes aggregations directly on each server's segment data and merges the results at the Broker, making GROUP BY queries fast even on large datasets.
4. Run a count
This returns the total number of rows in the table. The exact count depends on which ingestion steps you completed:
Batch ingest only -- 4 rows
Stream ingest only -- the number of events you published (up to 12 in the tutorial)
Both -- the combined total
5. Run a filter
This filters rows to show only students with a score above 3.5. Pinot pushes filter predicates down to the servers so only matching rows are scanned and returned.
6. Try the REST API
The Query Console UI is convenient for exploration, but production applications query Pinot through its REST API. Open a terminal and run:
Port 8099 is the Broker, which handles all query requests. The Query Console UI uses the same API under the hood. The response is a JSON object containing the result rows, schema, and query execution metadata.
Verify
All five queries return results without errors. You have successfully completed the end-to-end onboarding flow: you set up a Pinot cluster, defined a schema and table, loaded data, and queried it through both the UI and the REST API.
What's next
You have finished the linear Start Here path. From here, explore the areas most relevant to your use case:
-- the full SQL reference for Pinot's query language
-- enable JOINs and complex queries across tables
-- understand how queries flow from Broker to Server and back
First Table + Schema
Create your first Pinot schema and table, ready for data ingestion.
Outcome
By the end of this page you will have a Pinot schema and an offline table called transcript registered in your cluster, ready to receive data.
Cluster
Learn to build and manage Apache Pinot clusters, uncovering key components for efficient data processing and optimized analysis.
A Pinot cluster is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see .
A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop:
Controller: Maintains cluster metadata and manages cluster resources.
Controller
Discover the controller component of Apache Pinot, enabling efficient data and query management.
The Pinot controller schedules and reschedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, the Pinot controller schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (for example, ingest of and ). The Pinot controller can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.
The controller exposes a for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.
The Pinot controller is responsible for the following:
Kafka Connector Versions
Choose the right Apache Kafka connector version for your Pinot deployment.
Apache Pinot provides multiple Kafka connector versions to match different Kafka broker deployments. Choose the connector that matches your Kafka cluster version.
Available Connectors
Connector Plugin
Kafka Client Version
Notes
Azure Data Lake Storage
This guide shows you how to import data from files stored in Azure Data Lake Storage Gen2 (ADLS Gen2)
Enable the Azure Data Lake Storage using the pinot-adls plugin. In the controller or server, add the config:
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro
Usage: StartServer
-serverHost <String> : Host name for controller. (required=false)
-serverPort <int> : Port number to start the server at. (required=false)
-serverAdminPort <int> : Port number to serve the server admin API at. (required=false)
-dataDir <string> : Path to directory containing data. (required=false)
-segmentDir <string> : Path to directory containing segments. (required=false)
-zkAddress <http> : Http address of Zookeeper. (required=false)
-clusterName <String> : Pinot cluster name. (required=false)
-configFileName <Config File Name> : Broker Starter Config file. (required=false)
-help : Print this message. (required=false)
#CONTROLLER
pinot.controller.storage.factory.class.[scheme]=className of the pinot file system
pinot.controller.segment.fetcher.protocols=file,http,[scheme]
pinot.controller.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
#SERVER
pinot.server.storage.factory.class.[scheme]=className of the Pinot file system
pinot.server.segment.fetcher.protocols=file,http,[scheme]
pinot.server.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
A running Pinot cluster. See the install guides for Local or Docker.
For Docker users: the cluster must be on the pinot-demo network.
Confirm your Pinot version. See the Version reference page and set the PINOT_VERSION environment variable:
Steps
1. Understand schemas
A Pinot schema defines every column in your table and assigns each one a column type. There are three column types:
Column type
Description
Dimension
Used in filters and GROUP BY clauses for slicing and dicing data.
Metric
Used in aggregations; represents quantitative measurements.
DateTime
Represents the timestamp associated with each row.
Every table must have a schema before it can accept data. The schema tells Pinot how to interpret, index, and store each field.
2. Create the data directory
3. Save the sample CSV data
Create the file /tmp/pinot-quick-start/rawdata/transcript.csv with the following contents:
In this dataset, studentID, firstName, lastName, gender, and subject are dimensions, score is a metric, and timestampInEpoch is the datetime column.
4. Save the schema
Create the file /tmp/pinot-quick-start/transcript-schema.json:
5. Understand table configs
A table config tells Pinot how to manage the table at runtime -- which columns to index, how many replicas to keep, which tenants to assign, and whether the table is OFFLINE (batch) or REALTIME (streaming). You pair one table config with one schema.
6. Save the offline table config
Create the file /tmp/pinot-quick-start/transcript-table-offline.json:
7. Upload the schema and table config
Replace pinot-controller with the actual container name of your Pinot controller if you used a different name during setup.
If the table appears, the schema and table config were registered successfully.
Next step
You now have an empty table. Continue to First batch ingest to import the CSV data into your transcript table.
Maintaining
global metadata
(e.g., configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store.
Hosting the Helix Controller and managing other Pinot components (brokers, servers, minions)
Maintaining the mapping of which servers are responsible for which segments. This mapping is used by the servers to download the portion of the segments that they are responsible for. This mapping is also used by the broker to decide which servers to route the queries to.
Serving admin endpoints for viewing, creating, updating, and deleting configs, which are used to manage and operate the cluster.
Serving endpoints for segment uploads, which are used in offline data pushes. They are responsible for initializing real-time consumption and coordination of persisting real-time segments into the segment store periodically.
Undertaking other management activities such as managing retention of segments, validations.
For redundancy, there can be multiple instances of Pinot controllers. Pinot expects that all controllers are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or ADLS.
Running the periodic task manually
The controller runs several periodic tasks in the background, to perform activities such as management and validation. Each periodic task has its own configuration to define the run frequency and default frequency. Each task runs at its own schedule or can also be triggered manually if needed. The task runs on the lead controller for each table.
Use the GET /periodictask/names API to fetch the names of all the periodic tasks running on your Pinot cluster.
To manually run a named periodic task, use the GET /periodictask/run API:
The Log Request Id (api-09630c07) can be used to search through pinot-controller log file to see log entries related to execution of the Periodic task that was manually run.
If tableName (and its type OFFLINE or REALTIME) is not provided, the task will run against all tables.
Recommended for Kafka 3.x clusters. Requires Scala dependency.
pinot-kafka-4.0
4.1.x
Recommended for Kafka 4.x clusters (KRaft mode). Pure Java — no Scala dependency.
The pinot-kafka-2.0 (kafka20) plugin has been removed. If your table config references org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory, you must migrate to either kafka30 or kafka40.
Kafka 4.0 Connector
The Kafka 4.0 connector (pinot-kafka-4.0) supports Apache Kafka 4.x brokers running in KRaft mode (ZooKeeper-free). It uses pure Java Kafka clients with no Scala dependency, resulting in a smaller deployment footprint.
When to use Kafka 4.0
Your Kafka cluster runs Kafka 4.0+ with KRaft mode
You want to eliminate the Scala transitive dependency
You are deploying new Pinot clusters against modern Kafka infrastructure
Configuration
The Kafka 4.0 connector uses the same configuration properties as the Kafka 3.0 connector. The only difference is the stream.kafka.consumer.factory.class.name:
Migration from Kafka 2.0 or 3.0
To migrate from an older Kafka connector to Kafka 3.0 or 4.0, update the consumer factory class name in your table configuration:
Ensure the pinot-kafka-4.0 plugin JAR is available in your Pinot plugin directory.
All other stream.kafka.* configuration properties remain the same.
The Kafka 4.0 connector is fully compatible with all existing Kafka consumer configuration properties including SSL/TLS, SASL authentication, isolation levels, and Schema Registry integration. See the main Kafka ingestion guide for detailed configuration examples.
Kafka 3.0 Connector
The Kafka 3.0 connector (pinot-kafka-3.0) supports Apache Kafka 3.x brokers. This is the most widely deployed connector version.
Configuration
Common Configuration Properties
All Kafka connector versions share the same configuration properties. See Ingest streaming data from Apache Kafka for the complete configuration reference, including:
You can pass any native Kafka consumer configuration property using the stream.kafka.consumer.prop. prefix:
,
pinot-kafka-3.0...
Azure Blob Storage provides the following options:
accountName: Name of the Azure account under which the storage is created.
accessKey: Access key required for the authentication.
fileSystemName: Name of the file system to use, for example, the container name (similar to the bucket name in S3).
enableChecksum: Enable MD5 checksum for verification. Default is false.
Each of these properties should be prefixed by pinot.[node].storage.factory.class.adl2. where node is either controller or server depending on the config, like this:
Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.
Broker: Accepts queries from client processes and forwards them to servers for processing.
Server: Provides storage for segment files and compute for query processing.
(Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).
The simplest possible Pinot cluster consists of four components: a server, a broker, a controller, and a Zookeeper node. In production environments, these components typically run on separate server instances, and scale out as needed for data volume, load, availability, and latency. Pinot clusters in production range from fewer than ten total instances to more than 1,000.
Helix is a cluster management solution that maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. Helix constantly monitors the cluster to ensure that the right hardware resources are allocated for the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.
Helix divides nodes into logical components based on their responsibilities:
Participant
Participants are the nodes that host distributed, partitioned resources
Pinot servers are modeled as participants. For details about server nodes, see Server.
Spectator
Spectators are the nodes that observe the current state of each participant and use that information to access the resources. Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).
Pinot brokers are modeled as spectators. For details about broker nodes, see Broker.
Controller
The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability.
Pinot controllers are modeled as controllers. For details about controller nodes, see Controller.
Logical view
Another way to visualize the cluster is a logical view, where:
Explore the Schema component in Apache Pinot, vital for defining the structure and data types of Pinot tables, enabling efficient data processing and analysis.
Each table in Pinot is associated with a schema. A schema defines:
Fields in the table with their data types.
Whether the table uses column-based or table-based null handling. For more information, see Null value support.
The schema is stored in Zookeeper along with the table configuration.
Schema naming in Pinot follows typical database table naming conventions, such as starting names with a letter, not ending with an underscore, and using only alphanumeric characters
Categories
A schema also defines what category a column belongs to. Columns in a Pinot table can be categorized into three categories:
Category
Description
Pinot does not enforce strict rules on which of these categories columns belong to, rather the categories can be thought of as hints to Pinot to do internal optimizations.
For example, metrics may be stored without a dictionary and can have a different default null value.
The categories are also relevant when doing segment merge and rollups. Pinot uses the dimension and time fields to identify records against which to apply merge/rollups.
Metrics aggregation is another example where Pinot uses dimensions and time are used as the key, and automatically aggregates values for the metric columns.
For configuration details, see .
Date and time fields
Since Pinot doesn't have a dedicated DATETIME datatype support, you need to input time in either STRING, LONG, or INT format. However, Pinot needs to convert the date into an understandable format such as epoch timestamp to do operations. You can refer to for more details on supported formats.
Creating a schema
First, Make sure your and running.
Let's create a schema and put it in a JSON file. For this example, we have created a schema for flight data.
For more details on constructing a schema file, see the .
Then, we can upload the sample schema provided above using either a Bash command or REST API call.
Check out the schema in the to make sure it was successfully uploaded
Pinot Storage Model
Apache Pinot™ uses a variety of terms which can refer to either abstractions that model the storage of data or infrastructure components that drive the functionality of the system, including:
Pinot has a distributed systems architecture that scales horizontally. Pinot expects the size of a table to grow infinitely over time. To achieve this, all data needs to be distributed across multiple nodes. Pinot achieves this by breaking data into smaller chunks known as (similar to shards/partitions in HA relational databases). Segments can also be seen as time-based partitions.
Table
Similar to traditional databases, Pinot has the concept of a —a logical abstraction to refer to a collection of related data. As is the case with relational database management systems (RDBMS), a table is a construct that consists of columns and rows (documents) that are queried using SQL. A table is associated with a , which defines the columns in a table as well as their data types.
As opposed to RDBMS schemas, multiple tables can be created in Pinot (real-time or batch) that inherit a single schema definition. Tables are independently configured for concerns such as indexing strategies, partitioning, tenants, data sources, and replication.
Pinot stores data in . A Pinot table is conceptually identical to a relational database table with rows and columns. Columns have the same name and data type, known as the table's .
Pinot schemas are defined in a JSON file. Because that schema definition is in its own file, multiple tables can share a single schema. Each table can have a unique name, indexing strategy, partitioning, data sources, and other metadata.
Pinot table types include:
real-time: Ingests data from a streaming source like Apache Kafka®
offline: Loads data from a batch source
hybrid: Loads data from both a batch source and a streaming source
Segment
Pinot tables are stored in one or more independent shards called . A small table may be contained by a single segment, but Pinot lets tables grow to an unlimited number of segments. There are different processes for creating segments (see ). Segments have time-based partitions of table data, and are stored on Pinot that scale horizontally as needed for both storage and computation.
Tenant
To support multi-tenancy, Pinot has first class support for tenants. A table is associated with a . This allows all tables belonging to a particular logical namespace to be grouped under a single tenant name and isolated from other tenants. This isolation between tenants provides different namespaces for applications and teams to prevent sharing tables or schemas. Development teams building applications do not have to operate an independent deployment of Pinot. An organization can operate a single cluster and scale it out as new tenants increase the overall volume of queries. Developers can manage their own schemas and tables without being impacted by any other tenant on a cluster.
Every table is associated with a , or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., and ) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data from separate workloads from being stored or processed on the same physical hardware.
By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster.
Cluster
A Pinot is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see .
Physical architecture
**
A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop.
Controller: Maintains cluster metadata and manages cluster resources.
Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.
Broker: Accepts queries from client processes and forwards them to servers for processing.
The simplest possible Pinot cluster consists of four components: a server, a broker, a controller, and a Zookeeper node. In production environments, these components typically run on separate server instances, and scale out as needed for data volume, load, availability, and latency. Pinot clusters in production range from fewer than ten total instances to more than 1,000.
Pinot uses as a distributed metadata store and and for cluster management.
Helix is a cluster management solution created by the authors of Pinot. Helix maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. It constantly monitors the cluster to ensure that the right hardware resources are allocated to implement the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.
Controller
A is the core orchestrator that drives the consistency and routing in a Pinot cluster. Controllers are horizontally scaled as an independent component (container) and has visibility of the state of all other components in a cluster. The controller reacts and responds to state changes in the system and schedules the allocation of resources for tables, segments, or nodes. As mentioned earlier, Helix is embedded within the controller as an agent that is a participant responsible for observing and driving state changes that are subscribed to by other components.
The Pinot schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of and ). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.
The controller exposes a for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.
Server
host segments (shards) that are scheduled and allocated across multiple nodes and routed on an assignment to a tenant (there is a single tenant by default). Servers are independent containers that scale horizontally and are notified by Helix through state changes driven by the controller. A server can either be a real-time server or an offline server.
A real-time and offline server have very different resource usage requirements, where real-time servers are continually consuming new messages from external systems (such as Kafka topics) that are ingested and allocated on segments of a tenant. Because of this, resource isolation can be used to prioritize high-throughput real-time data streams that are ingested and then made available for query through a broker.
Broker
Pinot take query requests from client processes, scatter them to applicable servers, gather the results, and return them to the client. The controller shares cluster metadata with the brokers that allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.
A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.
Pinot minion
Pinot minion is an optional component that can be used to run background tasks such as "purge" for GDPR (General Data Protection Regulation). As Pinot is an immutable aggregate store, records containing sensitive private data need to be purged on a request-by-request basis. Minion provides a solution for this purpose that complies with GDPR while optimizing Pinot segments and building additional indices that guarantees performance in the presence of the possibility of data deletion. One can also write a custom task that runs on a periodic basis. While it's possible to perform these tasks on the Pinot servers directly, having a separate process (Minion) lessens the overall degradation of query latency as segments are impacted by mutable writes.
A Pinot is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minon tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.
Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function with or without minions, they are typically present to support routine tasks like batch data ingest.
\
First Batch Ingest
Import your first batch of data into Pinot and see it appear in the query console.
Outcome
By the end of this page you will have imported CSV data into your transcript offline table and confirmed the rows are queryable.
Prerequisites
Completed -- the transcript_OFFLINE table must already exist.
The sample CSV file at /tmp/pinot-quick-start/rawdata/transcript.csv from the previous step.
For Docker users: set the PINOT_VERSION
Steps
1. Understand batch ingestion
Batch ingestion reads data from files (CSV, JSON, Avro, Parquet, and others), converts them into Pinot segments, and pushes those segments to the cluster. A job specification YAML file tells Pinot where to find the input data, what format it is in, and where to send the finished segments.
2. Create the ingestion job spec
Create the file /tmp/pinot-quick-start/batch-job-spec.yml:
When running inside Docker, the ingestion job container must reach the controller by its Docker network hostname, not localhost. Create the file /tmp/pinot-quick-start/batch-job-spec.yml:
Replace pinot-controller
3. Run the ingestion job
The job reads the CSV file, builds a segment, and pushes it to the controller. You should see log output ending with a success message.
Verify
Open the in your browser.
Run the following query:
You should see 4 rows returned, matching the CSV data you loaded:
studentID
firstName
lastName
gender
subject
score
timestampInEpoch
Next step
Continue to to learn how to set up real-time ingestion from Kafka.
Confluent Schema Registry Decoders
Decode Avro, JSON, and Protobuf messages from Kafka using Confluent Schema Registry.
Pinot supports decoding Kafka messages serialized with Confluent Schema Registry for Avro, JSON Schema, and Protocol Buffers formats. These decoders automatically fetch and cache schemas from the registry, ensuring data is deserialized according to the registered schema.
Available Decoders
Format
Decoder Class
Plugin
Common Configuration
All Confluent Schema Registry decoders share the same configuration properties:
Property
Required
Default
Description
SSL/TLS Configuration
To connect to a Schema Registry endpoint over SSL/TLS, add properties with the schema.registry. prefix:
Property
Description
Confluent Avro Decoder
Decodes Avro-serialized Kafka messages with schema managed by Confluent Schema Registry.
Confluent JSON Schema Decoder
Decodes JSON messages serialized with Confluent's JSON Schema serializer. Messages include a schema ID header that the decoder uses to fetch the JSON Schema from the registry for validation.
The JSON Schema decoder validates incoming messages against the schema registered in Schema Registry. Messages that don't match the magic byte format (non-Confluent messages) are silently dropped.
Confluent Protobuf Decoder
Decodes Protocol Buffer messages serialized with Confluent's Protobuf serializer. The decoder fetches the .proto schema definition from the registry and deserializes the binary payload.
SSL/TLS Example
To connect to a secured Schema Registry:
How Schema Resolution Works
Each Confluent-serialized message starts with a magic byte (0x00) followed by a 4-byte schema ID
The decoder extracts the schema ID from the message header
The schema is fetched from Schema Registry and cached locally (up to cached.schema.map.capacity)
Messages without the Confluent magic byte prefix are silently dropped and logged as errors.
See Also
— General Kafka ingestion guide
— Full connector configuration reference
— All supported input formats
Configure Indexes
Learn how to apply indexes to a Pinot table. This guide assumes that you have followed the Ingest data from Apache Kafka guide.
Pinot supports a series of different indexes that can be used to optimize query performance. In this guide, we'll learn how to add indexes to the events table that we set up in the Ingest data from Apache Kafka guide.
Why do we need indexes?
If no indexes are applied to the columns in a Pinot segment, the query engine needs to scan through every document, checking whether that document meets the filter criteria provided in a query. This can be a slow process if there are a lot of documents to scan.
When indexes are applied, the query engine can more quickly work out which documents satisfy the filter criteria, reducing the time it takes to execute the query.
What indexes does Pinot support?
By default, Pinot creates a forward index for every column. The forward index generally stores documents in insertion order.
However, before flushing the segment, Pinot does a single pass over every column to see whether the data is sorted. If data is sorted, Pinot creates a sorted (forward) index for that column instead of the forward index.
For real-time tables you can also explicitly tell Pinot that one of the columns should be sorted. For more details, see the [Sorted Index Documentation](../../../../build-with-pinot/indexing/forward-index.md#real-time-tables).
For filtering documents within a segment, Pinot supports the following indexing techniques:
Inverted index: Used for exact lookups.
Range index - Used for range queries.
Text index - Used for phrase, term, boolean, prefix, or regex queries.
View events table
Let's see how we can apply these indexing techniques to our data. To recap, the events table has the following fields:
Date Time Fields
Dimensions Fields
Metric Fields
We might want to write queries that filter on the ts and uuid columns, so these are the columns on which we would want to configure indexes.
Since the data we're ingesting into the Kafka topic is all implicitly ordered by timestamp, this means that the ts column already has a sorted index. This means that any queries that filter on this column are already optimised.
So that leaves us with the uuid column.
Add an inverted index
We're going to add an inverted index to the uuid column so that queries that filter on that column will return quicker. We need to add the following line:
To the tableIndexConfig section.
Copy the following to the clipboard:
/tmp/pinot/table-config-stream.json
Navigate to , click on Edit Table, paste the next table config, and then click Save.
Once you've done that, you'll need to click Reload All Segments and then Yes to apply the indexing change to all segments.
Check the index has been applied
We can check that the index has been applied to all our segments by querying Pinot's REST API. You can find Swagger documentation at .
The following query will return the indexes defined on the uuid column:
Output
We're using the to extract the fields that we're interested in.
We can see from looking at the inverted-index property that the index has been applied.
Querying
You can now run some queries that filter on the uuid column, as shown below:
You'll need to change the actual uuid value to a value that exists in your database, because the UUIDs are generated randomly by our script.
Multistage Lite Mode
Introduces the Multistage Engine Lite Mode
MSE Lite Mode is included in Pinot 1.4 and is currently in Beta. This Beta label applies to Lite Mode specifically, not to the core multi-stage engine, which is generally available.
**
Multistage Engine (MSE) Lite Mode is an optional, guardrail-oriented execution mode for self-service and high-QPS tenants. Without additional bounds, queries can scan a large number of records or run expensive operations, which can impact the reliability of a shared tenant and create friction in onboarding new use-cases. Lite Mode addresses this by capping the rows returned from each leaf stage and applying tighter resource bounds automatically.
It is based on the observation that most of the users need access to advanced SQL features like Window Functions, Subqueries, etc., and aren't interested in scanning a lot of data or running fully Distributed Joins.
Overview
MSE Lite Mode has the following key characteristics:
Users can still use all MSE query features like Window Functions, Subqueries, Joins, etc.
But, the maximum number of rows returned by a Leaf Stage will be set to a user configurable value. The default value is 100,000.
Query execution follows a scatter-gather paradigm, similar to the Single-stage Engine. This is different from regular MSE that uses shuffles across Pinot Servers.
Leaf Stage in a Multistage Engine query usually refers to Table Scan, an optional Project, an optional Filter and an optional Aggregate Plan Node.
At present, all joins in MSE Lite Mode are run in the Broker. This may change with the next release, since Colocated Joins can theoretically be run in the Servers.
Example
To illustrate how MSE Lite Mode applies automatic resource bounds, consider the query below based on the colocated_join Quickstart. If this query were allowed in production with the regular MSE, it would scan all the rows of the userFactEvents table. With Lite Mode, the full scan will be prevented because Lite Mode will automatically add a Sort to the leaf stage with a configurable limit (aka "fetch") value.
The query plan for this query would be as follows. The window function, the filter in the filtered-events table, and the aggregation would be run in the Pinot Broker using a single thread. We assume that the Pinot Broker is configured with the lite mode limit value of 100k records:
**
Enabling Lite Mode
To use Lite Mode, you can use the following query options.
Running Non-Leaf Stages in Pinot Servers
By default Lite Mode will run the non-leaf stage in the Broker. If you want to run the non-leaf stages in Pinot Servers, you can set the following query option to false. In this case, a random server will be picked for the non-leaf stage.
Configuration
You can set the following configs in your Pinot Broker.
Configuration Key
Default
Description
FAQ
Q1: What is the Lite Mode intended for?
Lite Mode was contributed by Uber and is inspired from . Lite Mode is an optional execution mode with tighter scan and resource bounds, designed for use-cases where users need advanced SQL features (window functions, subqueries, etc.) but do not need fully distributed execution of joins or CTEs. One can think of this as an advanced version of the Single-Stage Engine.
Q2: Why use a single thread in the broker for the non-leaf stages?
Using a single thread, or more importantly a single Operator Chain, means that the entire stage can be run without any Exchange. It also keeps the design simple and makes it easy to reason about performance and debugging.
Q3: Can Lite Mode be used in tandem with server/segment pruning for high QPS use-cases?
Yes, if you setup segmentPrunerTypes as in your Table Config, then segments and servers will be pruned. You can use this to scale out Read QPS.
Segment Compaction on Upserts
Use segment compaction on upsert-enabled real-time tables.
Overview of segment compaction
Compacting a segment replaces the completed segment with a compacted segment that only contains the latest version of records. For more information about how to use upserts on a real-time table in Pinot, see Stream Ingestion with Upsert.
The Pinot upsert feature stores all versions of the record ingested into immutable segments on disk. Even though the previous versions are not queried, they continue to add to the storage overhead. To remove older records (no longer used in query results) and reclaim storage space, we need to compact Pinot segments periodically. Segment compaction is done via a new minion task. To schedule Pinot tasks periodically, see the Minion documentation.
Compact segments on upserts in a real-time table
To compact segments on upserts, complete the following steps:
Ensure task scheduling is enabled and a minion is available.
Add the following to your table configuration. These configurations (except schedule)determine which segments to compact.
bufferTimePeriod: To compact segments once they are complete, set to “0d”. To delay compaction (as the configuration above shows by 7 days ("7d")), specify the number of days to delay compaction after a segment completes.
invalidRecordsThresholdPercent (Optional) Limits the older records allowed in the completed segment as a percentage of the total number of records in the segment. In the example above, the completed segment may be selected for compaction when 30% of the records in the segment are old.
When using the two in-memory types, if the server gets restarted, the upsert view gets back consistent once server re-ingests the data it has ingested before starting. The in-memory bitmaps are updated when server ingests data into consuming segment, even before the consuming segment gets committed. So if server gets restarted whlie still consuming data, the upsert view gets back consistent once it catches up the previously ingested data. Instead, the bitmap snapshots are only taken after committing the segment, thus can be more consistent on server restarts, but is eventually consistent as well if server gets restarted while ingesting data.
Because segment compaction is an expensive operation, we do not recommend setting invalidRecordsThresholdPercent and invalidRecordsThresholdCount too low (close to 1). By default, all configurations above are 0, so no thresholds are applied.
Example
The following example includes a dataset with 24M records and 240K unique keys that have each been duplicated 100 times. After ingesting the data, there are 6 segments (5 completed segments and 1 consuming segment) with a total estimated size of 22.8MB.
Example dataset
Submitting the query “set skipUpsert=true; select count(*) from transcript_upsert” before compaction produces 24,000,000 results:
Results before segment compaction
After the compaction tasks are complete, the reports the following.
Minion compaction task completed
Segment compactions generates a task for each segment to compact. Five tasks were generated in this case because 90% of the records (3.6–4.5M records) are considered ready for compaction in the completed segments, exceeding the configured thresholds.
If a completed segment only contains old records, Pinot immediately deletes the segment (rather than creating a task to compact it).
Submitting the query again shows the count matches the set of 240K unique keys.
Results after segment compaction
Once segment compaction has completed, the total number of segments remain the same and the total estimated size drops to 2.77MB.
To further improve query latency, merge small segments into larger one.
Tenant
Discover the tenant component of Apache Pinot, which facilitates efficient data isolation and resource management within Pinot clusters.
Every table is associated with a tenant, or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., brokers and servers) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data in separate workloads from being stored or processed on the same physical hardware.
By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster. If the cluster is planned to have multiple tenants, consider setting cluster.tenant.isolation.enable=false so that servers and brokers won't be tagged with DefaultTenant automatically while added into the cluster.
To support multi-tenancy, Pinot has first-class support for tenants. Every table is associated with a server tenant and a broker tenant, which controls the nodes used by the table as servers and brokers. Multi-tenancy lets Pinot group all tables belonging to a particular use case under a single tenant name.
The concept of tenants is very important when the multiple use cases are using Pinot and there is a need to provide quotas or some sort of isolation across tenants. For example, consider we have two tables Table A and Table B in the same Pinot cluster.
We can configure Table A with server tenant Tenant A and Table B with server tenant Tenant B. We can tag some of the server nodes for Tenant A and some for Tenant B. This will ensure that segments of Table A only reside on servers tagged with Tenant A, and segment of Table B only reside on servers tagged with Tenant B. The same isolation can be achieved at the broker level, by configuring broker tenants to the tables.
No need to create separate clusters for every table or use case!
Tenant configuration
This tenant is defined in the section of the table config.
This section contains two main fields broker and server , which decide the tenants used for the broker and server components of this table.
In the above example:
The table will be served by brokers that have been tagged as brokerTenantName_BROKER in Helix.
If this were an offline table, the offline segments for the table will be hosted in Pinot servers tagged in Helix as serverTenantName_OFFLINE
Create a tenant
Broker tenant
Here's a sample broker tenant config. This will create a broker tenant sampleBrokerTenant by tagging three untagged broker nodes as sampleBrokerTenant_BROKER.
To create this tenant use the following command. The creation will fail if number of untagged broker nodes is less than numberOfInstances.
Follow instructions in to get Pinot locally, and then
Check out the table config in the to make sure it was successfully uploaded.
Server tenant
Here's a sample server tenant config. This will create a server tenant sampleServerTenant by tagging 1 untagged server node as sampleServerTenant_OFFLINE and 1 untagged server node as sampleServerTenant_REALTIME.
To create this tenant use the following command. The creation will fail if number of untagged server nodes is less than offlineInstances + realtimeInstances.
Follow instructions in to get Pinot locally, and then
Check out the table config in the to make sure it was successfully uploaded.
Hadoop
Batch ingestion of data into Apache Pinot using Apache Hadoop.
Segment Creation and Push
Pinot supports as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.
You can follow the to build Pinot from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
Stream Ingestion with Dedup
Deduplication support in Apache Pinot.
Pinot provides native support for deduplication (dedup) during the real-time ingestion (v0.11.0+).
Prerequisites for enabling dedup
To enable dedup on a Pinot table, make the following table configuration and schema changes:
Ingest from Amazon Kinesis
This guide shows you how to ingest a stream of records from an Amazon Kinesis topic into a Pinot table.
To ingest events from an Amazon Kinesis stream into Pinot, set the following configs into your table config:
where the Kinesis specific properties are:
Property
Description
JOINs
Pinot supports JOINs, including left, right, full, semi, anti, lateral, and equi JOINs. Use JOINs to connect two table to generate a unified view, based on a related column between the tables.
This page explains the syntax used to write join. In order to get a more in deep knowledge of how joins work it is recommended to read and also from Star Tree.
Important: To query using JOINs, you must
Server: Provides storage for segment files and compute for query processing.
(Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).
Dimension columns are typically used in slice and dice operations for answering business queries. Some operations for which dimension columns are used: - GROUP BY - group by one or more dimension columns along with aggregations on one or more metric columns - Filter clauses such as WHERE
Metric
These columns represent the quantitative data of the table. Such columns are used for aggregation. In data warehouse terminology, these can also be referred to as fact or measure columns. Some operation for which metric columns are used: - Aggregation - SUM, MIN, MAX, COUNT, AVG etc - Filter clause such as WHERE
DateTime
This column represents time columns in the data. There can be multiple time columns in a table, but only one of them can be treated as primary. The primary time column is the one that is present in the segment config. The primary time column is used by Pinot to maintain the time boundary between offline and real-time data in a hybrid table and for retention management. A primary time column is mandatory if the table's push type is APPEND and optional if the push type is REFRESH . Common operations that can be done on time column: - GROUP BY - Filter clauses such as WHERE
Geospatial index - Based on H3, a hexagon-based hierarchical gridding. Used for finding points that exist within a certain distance from another point.
JSON index - Used for querying columns in JSON documents.
Star-Tree index - Pre-aggregates results across multiple columns.
invalidRecordsThresholdCount (Optional) Limits the older records allowed in the completed segment by record count. In the example above, if the segment contains more than 100K records, it may be selected for compaction.
tableMaxNumTasks (Optional) Limits the number of tasks allowed to be scheduled.
validDocIdsType (Optional) Specifies the source of validDocIds to fetch when running the data compaction. The valid types are SNAPSHOT, IN_MEMORY, IN_MEMORY_WITH_DELETE
SNAPSHOT: Default validDocIds type. This indicates that the validDocIds bitmap is loaded from the snapshot from the Pinot segment. UpsertConfig's enableSnapshot must be enabled for this type.
IN_MEMORY: This indicates that the validDocIds bitmap is loaded from the real-time server's in-memory.
IN_MEMORY_WITH_DELETE: This indicates that the validDocIds bitmap is read from the real-time server's in-memory. The valid document ids here does take account into the deleted records. UpsertConfig's deleteRecordColumn must be provided for this type.
Only supports TRIM_HORIZON to consume from earliest. The support for LATEST, AT_SEQUENCE_NUMBER, AFTER_SEQUENCE_NUMBER is in progress but unsupported at this point
maxRecordsToFetch
Specifies the maximum number of records to retrieve in a single getRecords API call to Kinesis. This parameter controls the batch size for data retrieval. Can be set between 1 and 10,000 (Kinesis API limit by AWS) Larger values reduce the number of API calls needed but may increase latency and memory usage per batch. Default value is set to max 10000. Only lower this when you have memory constraints
requests_per_second_limit
Controls the maximum number of getRecords requests per second that the consumer will make to a Kinesis shard. This parameter is crucial for avoiding AWS Kinesis API throttling. Kinesis enforces a hard limit of 5 getRecords requests per second per shard. Exceeding this limit results in ProvisionedThroughputExceededException. The default value of 1 is intentionally conservative to prevent throttling in replicated setups where multiple consumers might be reading from the same shard simultaneously. You should only increase this if you are experiencing slow consumption rates and do not have ProvisionedThroughputExceededException in the logs yet
Kinesis supports authentication using the DefaultCredentialsProviderChain. The credential provider looks for the credentials in the following order:
Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey
Web Identity Token credentials from the environment or container
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable,
Instance profile credentials delivered through the Amazon EC2 metadata service
You must provide all readaccess level permissions for Pinot to work with an AWS Kinesis data stream. See the AWS documentation for details.
Although you can also specify the accessKey and secretKey in the properties above, we don't recommend this insecure method. We recommend using it only for non-production proof-of-concept (POC) setups. You can also specify other AWS fields such as AWS_SESSION_TOKEN as environment variables and config and it will work.
Resharding
In Kinesis, whenever you reshard a stream, it is done via split or merge operations on shards. If you split a shard, the shard closes and creates 2 new children shards. So if you started with shard0, and then split it, it would result in shard1 and shard2. Similarly, if you merge 2 shards, both those will close and create a child shard. So in the same example, if you merge shards 1 and 2, you'll end up with shard3 as the active shard, while shard0, shard1, shard2 will remain closed forever.
In Pinot, resharding of any stream is detected by periodic task RealtimeValidationManager: docs. This runs hourly. If you rehsard, your new shards will not get detected unless:
We finish ingesting from parent shards completely
And after 1, the RealtimeValidationManager runs
You will see a period where the ideal state will show all segments ONLINE, as parents have naturally completed ingesting, and we're waiting for RealtimeValidationManager to kickstart the ingestion from children.
If you need the ingestion to happen sooner, you can manually invoke the RealtimeValidationManager: docs
Limitations
ShardID is of the format "shardId-000000000001". We use the numeric part as partitionId. Our partitionId variable is integer. If shardIds grow beyond Integer.MAX\_VALUE, we will overflow into the partitionId space.
Segment size based thresholds for segment completion will not work. It assumes that partition "0" always exists. However, once the shard 0 is split/merged, we will no longer have partition 0.
streamType
This should be set to "kinesis"
stream.kinesis.topic.name
Next, you need to change the execution config in the job spec to the following -
You can check out the sample job spec here.
Finally execute the hadoop job using the command -
Ensure environment variables PINOT_ROOT_DIR and PINOT_VERSION are set properly.
Data Preprocessing before Segment Creation
We’ve seen some requests that data should be massaged (like partitioning, sorting, resizing) before creating and pushing segments to Pinot.
The MapReduce job called SegmentPreprocessingJob would be the best fit for this use case, regardless of whether the input data is of AVRO or ORC format.
Check the below example to see how to use SegmentPreprocessingJob.
In Hadoop properties, set the following to enable this job:
In table config, specify the operations in preprocessing.operations that you'd like to enable in the MR job, and then specify the exact configs regarding those operations:
preprocessing.num.reducers
Minimum number of reducers. Optional. Fetched when partitioning gets disabled and resizing is enabled. This parameter is to avoid having too many small input files for Pinot, which leads to the case where Pinot server is holding too many small segments, causing too many threads.
preprocessing.max.num.records.per.file
Maximum number of records per reducer. Optional.Unlike, “preprocessing.num.reducers”, this parameter is to avoid having too few large input files for Pinot, which misses the advantage of muti-threading when querying. When not set, each reducer will finally generate one output file. When set (e.g. M), the original output file will be split into multiple files and each new output file contains at most M records. It does not matter whether partitioning is enabled or not.
For more details on this MR job, refer to this document.
The inner join selects rows that have matching values in both tables.
Syntax
Example of inner join
Joins a table containing user transactions with a table containing promotions shown to the users, to show the spending for every userID.
LEFT JOIN
A left join returns all values from the left relation and the matched values from the right table, or appends NULL if there is no match. Also referred to as a left outer join.
Syntax:
RIGHT JOIN
A right join returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also referred to as a right outer join.
Syntax:
FULL JOIN
A full join returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join.
Syntax:
CROSS JOIN
A cross join returns the Cartesian product of two relations. If no WHERE clause is used along with CROSS JOIN, this produces a result set that is the number of rows in the first table multiplied by the number of rows in the second table. If a WHERE clause is included with CROSS JOIN, it functions like an INNER JOIN.
Syntax:
SEMI JOIN
Semi-join returns rows from the first table where matches are found in the second table. Returns one copy of each row in the first table for which a match is found.
Syntax:
Some subqueries, like the following are also implemented as a semi-join under the hood:
ANTI JOIN
Anti-join returns rows from the first table where no matches are found in the second table. Returns one copy of each row in the first table for which no match is found.
Syntax:
Some subqueries, like the following are also implemented as an anti-join under the hood:
Equi join
An equi join uses an equality operator to match a single or multiple column values of the relative tables.
Syntax:
ASOF JOIN
An ASOF JOIN selects rows from two tables based on a "closest match" algorithm.
Syntax:
The comparison operator in the MATCH_CONDITION can be one out of - <, >, <=, >=. Similar to an inner join, an ASOF join first calculate the set of matching rows in the right table for each row in the left table based on the ON condition. But instead of returning all of these rows, the only one returned is the closest match (if one exists) based on the match condition. Note that the two columns in the MATCH_CONDITION should be of the same type.
The join condition in ON is mandatory and has to be a conjunction of equality comparisons (i.e., non-equi join conditions and clauses joined with OR aren't allowed). ON true can be used in case the join should only be performed using the MATCH_CONDITION.
LEFT ASOF JOIN
A LEFT ASOF JOIN is similar to the ASOF JOIN, with the only difference being that all rows from the left table are returned, even those without a match in the right table with the unmatched rows being padded with NULL values (similar to the difference between an INNER JOIN and a LEFT JOIN).
SELECT *
FROM events
WHERE uuid = 'f4a4f'
LIMIT 10
SET useMultistageEngine = true;
SET usePhysicalOptimizer = true;
SET useLiteMode = true;
EXPLAIN PLAN FOR WITH ordered_events AS (
SELECT
cityName,
tripAmount,
ROW_NUMBER() OVER (
ORDER BY ts DESC
) as row_num
FROM userFactEvents
),
filtered_events AS (
SELECT
*
FROM ordered_events
WHERE row_num < 1000
)
SELECT
cityName,
SUM(tripAmount) as cityTotal
FROM filtered_events
GROUP BY cityName
PhysicalAggregate(group=[{0}], agg#0=[$SUM0($1)], aggType=[DIRECT])
PhysicalFilter(condition=[<($3, 1000)])
PhysicalWindow(window#0=[window(order by [2 DESC] rows between UNBOUNDED PRECEDING and CURRENT ROW aggs [ROW_NUMBER()])])
PhysicalExchange(exchangeStrategy=[SINGLETON_EXCHANGE], collation=[[2 DESC]])
PhysicalSort(fetch=[100000], collation=[[2 DESC]]) <== added by Lite Mode
PhysicalProject(cityName=[$3], tripAmount=[$7], ts=[$9])
PhysicalTableScan(table=[[default, userFactEvents]])
SET useMultistageEngine=true;
SET usePhysicalOptimizer=true; -- enables the new Physical MSE Query Optimizer
SET useLiteMode=true; -- enables Lite Mode
# executionFrameworkSpec: Defines ingestion jobs to be running.
executionFrameworkSpec:
# name: execution framework name
name: 'hadoop'
# segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentGenerationJobRunner'
# segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentTarPushJobRunner'
# segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentUriPushJobRunner'
# segmentMetadataPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentMetadataPushJobRunner'
# extraConfigs: extra configs for execution framework.
extraConfigs:
# stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
stagingDir: your/local/dir/staging
export PINOT_VERSION=1.4.0 #set to the Pinot version you have installed
export PINOT_DISTRIBUTION_DIR=${PINOT_ROOT_DIR}/build/
export HADOOP_CLIENT_OPTS="-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml"
hadoop jar \\
${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \\
org.apache.pinot.tools.admin.PinotAdministrator \\
LaunchDataIngestionJob \\
-jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/hadoopIngestionJobSpec.yaml
SELECT myTable.column1,myTable.column2,myOtherTable.column1,....
FROM mytable INNER JOIN table2
ON table1.matching_column = myOtherTable.matching_column;
SELECT
p.userID, t.spending_val
FROM promotion AS p JOIN transaction AS t
ON p.userID = t.userID
WHERE
p.promotion_val > 10
AND t.transaction_type IN ('CASH', 'CREDIT')
AND t.transaction_epoch >= p.promotion_start_epoch
AND t.transaction_epoch < p.promotion_end_epoch
SELECT myTable.column1,table1.column2,myOtherTable.column1,....
FROM myTable LEFT JOIN myOtherTable
ON myTable.matching_column = myOtherTable.matching_column;
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
RIGHT JOIN table2
ON table1.matching_column = table2.matching_column;
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
FULL JOIN table2
ON table1.matching_column = table2.matching_column;
SELECT *
FROM table1
CROSS JOIN table2;
SELECT myTable.column1, myOtherTable.column1
FROM myOtherTable
WHERE EXISTS [ join_criteria ]
SELECT table1.strCol
FROM table1
WHERE table1.intCol IN (select table2.anotherIntCol from table2 where ...)
SELECT myTable.column1, myOtherTable.column1
FROM myOtherTable
WHERE NOT EXISTS [ join_criteria ]
SELECT table1.strCol
FROM table1
WHERE table1.intCol NOT IN (select table2.anotherIntCol from table2 where ...)
SELECT *
FROM table1
JOIN table2
[ON (join_condition)]
OR
SELECT column_list
FROM table1, table2....
WHERE table1.column_name =
table2.column_name;
SELECT * FROM table1 ASOF JOIN table2
MATCH_CONDITION(table1.col1 <comparison_operator> table2.col1))
ON table1.col2 = table2.col2;
SELECT * FROM table1 LEFT ASOF JOIN table2
MATCH_CONDITION(table1.col1 <comparison_operator> table2.col1))
ON table1.col2 = table2.col2;
If this were a real-time table, the real-time segments (both consuming as well as completed ones) will be hosted in pinot servers tagged in Helix as
To be able to dedup records, a primary key is needed to uniquely identify a given record. To define a primary key, add the field primaryKeyColumns to the schema definition.
Note this field expects a list of columns, as the primary key can be composite.
While ingesting a record, if its primary key is found to be already present, the record will be dropped.
Partition the input stream by the primary key
An important requirement for the Pinot dedup table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the send API. If the original stream is not partitioned, then a streaming processing job (e.g. Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.
Use strictReplicaGroup for routing
The dedup Pinot table can use only the low-level consumer for the input streams. As a result, it uses the partitioned replica-group assignment for the segments. Moreover, dedup poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires strictReplicaGroup as the routing strategy. To use that, configure instanceSelectorType in Routing as the following:
instance assignment is persisted. Note that numInstancesPerPartition should always be 1 in replicaGroupPartitionConfig.
Other limitations
The incoming stream must be partitioned by the primary key such that, all records with a given primaryKey must be consumed by the same Pinot server instance.
Enable dedup in the table configurations
To enable dedup for a REALTIME table, add the following to the table config.
Supported values for hashFunction are NONE, MD5 and MURMUR3, with the default being NONE.
Metadata TTL
Server stores the existing primary keys in dedup metadata map kept on JVM heap. As the dedup metadata grows, the heap memory pressure increases, which may affect the performance of ingestion and queries. One can set a positive metadata TTL to enable the TTL mechanism to keep the metadata size bounded. By default, the table's time colum is used as the dedup time column. The time unit of TTL is the same as the dedup time column. The TTL should be set long enough so that new records can be deduplicated before the primary keys gets removed. Time column must be NUMERIC data type when metadataTTl is enabled.
Enable preload for faster server restarts
When ingesting new records, the server has to read the metadata map to check for duplicates. But when server restarts, the documents in existing segments are all unique as ensured by the dedup logic during real-time ingestion. So we can do write-only to bootstrap the metadata map faster.
The feature also requires you to specify pinot.server.instance.max.segment.preload.threads: N in the server config where N should be replaced with the number of threads that should be used for preload. It's 0 by default to disable the preloading feature. This preloading thread pool is shared with upsert table's preloading.
Immutable dedup configuration fields
Certain dedup and schema configuration fields cannot be modified after table creation.
Changing these fields on an existing dedup table can lead to data inconsistencies or data loss between replicas. Pinot uses these configurations to determine which records to keep or discard, so altering them after data has been ingested will cause existing metadata to become inconsistent with the new configuration.
The following fields are immutable after table creation:
Schema fields:
primaryKeyColumns
dedupConfig fields:
hashFunction
dedupTimeColumn
timeColumnName
Attempting to update these fields will return an error:
Recommended workaround: Create a new table with the desired configuration and reingest all data.
Alternative (use with caution): If you must modify these fields without recreating the table, you can use the force=true query parameter on the table config update API. Before doing so, pause consumption and restart all servers. Note that this approach only guarantees consistency for newly ingested keys; existing data may remain inconsistent.
Best practices
Unlike other real-time tables, Dedup table takes up more memory resources as it needs to bookkeep the primary key and its corresponding segment reference, in memory. As a result, it's important to plan the capacity beforehand, and monitor the resource usage. Here are some recommended practices of using Dedup table.
Create the Kafka topic with more partitions. The number of Kafka partitions determines the partition numbers of the Pinot table. The more partitions you have in the Kafka topic, more Pinot servers you can distribute the Pinot table to and therefore more you can scale the table horizontally. But note that, like upsert tables, you can't increase the partitions in future for dedup enabled tables so you need to start with good enough partitions (atleast 2-3X the number of pinot servers).
For Dedup tables, updating primary key columns or the dedupTimeColumn is not recommended, as it may lead to data loss and inconsistencies between replicas. If a change is unavoidable, ensure that consumption is paused and all servers are restarted for the change to take effect. Even then, consistency is not guaranteed.
Dedup table maintains an in-memory map from the primary key to the segment reference. So it's recommended to use a simple primary key type and avoid composite primary keys to save the memory cost. In addition, consider the hashFunction config in the Dedup config, which can be MD5 or MURMUR3, to store the 128-bit hashcode of the primary key instead. This is useful when your primary key takes more space. But keep in mind, this hash may introduce collisions, though the chance is very low.
Monitoring: Set up a dashboard over the metric pinot.server.dedupPrimaryKeysCount.tableName to watch the number of primary keys in a table partition. It's useful for tracking its growth which is proportional to the memory usage growth.
Capacity planning: It's useful to plan the capacity beforehand to ensure you will not run into resource constraints later. A simple way is to measure the amount of the primary keys in the Kafka throughput per partition and time the primary key space cost to approximate the memory usage. A heap dump is also useful to check the memory usage so far on an dedup table instance.
Pinot Data Explorer
Pinot Data Explorer is a user-friendly interface in Apache Pinot for interactive data exploration, querying, and visualization.
Once you have set up a cluster, you can start exploring the data and the APIs using the Pinot Data Explorer.
The first screen that you'll see when you open the Pinot Data Explorer is the Cluster Manager. The Cluster Manager provides a UI to operate and manage your cluster, giving you an overview of tenants, instances, tables, and their current status.
If you want to view the contents of a server, click on its instance name. You'll then see the following:
Table management
To view a table, click on its name from the tables list. From the table detail screen, you can edit or delete the table, edit or adjust its schema, and perform several other operations.
For example, if we want to add yearID to the list of inverted indexes, click on Edit Table, add the extra column, and click Save:
Pause and resume consumption
For real-time tables, the table detail screen includes a Pause/Resume Consumption button (). This lets you pause ingestion on a real-time table directly from the UI without issuing REST API calls, and resume it when ready. This is useful during maintenance windows or when you need to temporarily halt data ingestion.
Consuming segments info
A Consuming Segments Info button () is available on real-time tables, providing a quick view of all currently consuming segments. This shows details such as the partition, current offset, and consumption state, making it easier to monitor real-time ingestion health.
Reset segment
The UI now supports a Reset Segment operation (), allowing you to reset a segment directly from the table detail screen. This is helpful when a segment is stuck in an error state and needs to be re-processed.
Segment state filter
A segment state filter () has been added to the table detail screen. You can filter segments by their state (e.g., ONLINE, CONSUMING, ERROR) to quickly locate segments that need attention, which is especially valuable for tables with a large number of segments.
Table rebalance
The table detail screen also provides access to table rebalance operations. Several UI fixes and improvements () have been made to improve the reliability and usability of the rebalance workflow, including better parameter validation and progress display.
Logical table management
Starting with Pinot 1.4 and later, the Data Explorer includes a logical table management UI (). Logical tables are collections of physical tables (REALTIME and OFFLINE) that can be queried as a single unified table.
The logical tables listing is accessible from the main Tables page, alongside physical tables and schemas. From there you can:
Browse all logical tables in the cluster with search support.
View details of a logical table, including its configuration, the list of physical tables it maps to, and metadata.
Edit a logical table's configuration.
For more information about logical tables, see the section in the 1.4.0 release notes.
Query Console
Navigate to to see the querying interface. The Query Console lets you run SQL queries against your Pinot cluster and view the results interactively.
We can see our baseballStats table listed on the left (you will see meetupRSVP or airlineStats if you used the streaming or the hybrid ). Click on the table name to display all the names along with the data types of the columns of the table.
You can also execute a sample query select * from baseballStats limit 10 by typing it in the text box and clicking the Run Query button.
Cmd + Enter can also be used to run the query when focused on the console.
Here are some sample queries you can try:
Pinot uses SQL for querying. For the complete syntax reference, see the . For query options, examples, and engine details, see .
Time-series query execution
The Query Console also supports time-series query execution (), introduced as part of the Time Series Engine beta. This feature provides a dedicated interface for running and visualizing time-series queries using languages such as PromQL. It connects to a Prometheus-compatible /query_range endpoint () exposed by the Pinot controller, letting you explore time-series data and inspect query execution plans directly from the UI.
REST API
The contains all the APIs that you will need to operate and manage your cluster. It provides a set of APIs for Pinot cluster management including health check, instances management, schema and table management, data segments management.
Let's check out the tables in this cluster by going to , click Try it out, and then click Execute. We can see thebaseballStats table listed here. We can also see the exact cURL call made to the controller API.
You can look at the configuration of this table by going to , click Try it out, type baseballStats in the table name, and then click Execute.
Let's check out the schemas in the cluster by going to , click Try it out, and then click Execute. We can see a schema called baseballStats in this list.
Take a look at the schema by going to , click Try it out, type baseballStats in the schema name, and then click Execute.
Finally, let's check out the data segments in the cluster by going to , click Try it out, type in baseballStats in the table name, and then click Execute. There's 1 segment for this table, called baseballStats_OFFLINE_0.
To learn how to upload your own data and schema, see or .
Filtering with IdSet
Learn how to write fast queries for looking up IDs in a list of values.
Filtering with IdSet is only supported with the single-stage query engine (v1).
A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but using IN doesn't perform well with large lists of IDs. For large lists of IDs, we recommend using an IdSet.
This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:
INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
Other types - Bloom Filter
The following parameters are used to configure the Bloom Filter:
expectedInsertions - Number of expected insertions for the BloomFilter, must be positive
fpp - False positive probability to use for the BloomFilter. Must be positive and less than 1.0.
Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.
IN_ID_SET
IN_ID_SET(columnName, base64EncodedIdSet)
This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.
IN_SUBQUERY
IN_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.
IN__PARTITIONED__SUBQUERY
IN_PARTITIONED_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.
This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.
The query passed to IN_SUBQUERY can be run on any table - they aren't restricted to the table used in the parent query.
The query passed to IN__PARTITIONED__SUBQUERY must be run on the same table as the parent query.
Examples
Create IdSet
You can create an IdSet of the values in the yearID column by running the following:
idset(yearID)
When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:
idset(playerName)
idset(playerName)
We can also configure the fpp parameter:
idset(playerName)
Filter by values in IdSet
We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for _yearID_s in the IdSet, run the following:
Filter by values not in IdSet
To return rows for _yearID_s not in the IdSet, run the following:
Filter on broker
To filter rows for _yearID_s in the IdSet on a Pinot Broker, run the following query:
To filter rows for _yearID_s not in the IdSet on a Pinot Broker, run the following query:
Filter on server
To filter rows for _yearID_s in the IdSet on a Pinot Server, run the following query:
To filter rows for _yearID_s not in the IdSet on a Pinot Server, run the following query:
Physical Optimizer
Describes the new Multistage Engine Physical Query Optimizer
The Physical Optimizer is an optional query optimizer for the multi-stage engine, included in Pinot 1.4 and currently in Beta. This Beta label applies to the Physical Optimizer specifically, not to the core multi-stage engine, which is generally available.
We have added a new query optimizer in the Multistage Engine that computes and tracks precise Data Distribution across the entire plan before running some critical optimizations like Sort Pushdown, Aggregate Split/Pushdown, etc.
One of the biggest features of this Optimizer is that it can eliminate Shuffles or simplify Exchanges, when applicable, for arbitrarily complex queries, without requiring any Query Hints.
To enable this Optimizer for your MSE query, you can use the following Query Options:
Key Features
The examples below are based on the COLOCATED_JOIN Quickstart.
Automatic Colocated Joins and Shuffle Simplification
Consider the query below which consists of 3 Joins. With the new query optimizer, the entire query can run without any cross-server data exchange, since the data is partitioned by userUUID into a compatible number of partitions (see the "Setting Up Table Data Distribution" section below).
The query plan for this query is shown below. You can see that the entire query leverages IDENTITY_EXCHANGE, which is a 1:1 Exchange as defined in Exchange Types below.
Shuffle Simplification with Different Servers / Partition Count
The new optimizer can simplify shuffles even if:
The Servers used by either side of a Join are different
The Partition Count for the join inputs are different
In the example below, we have a Join performed across two tables: orange (left) and green (right).
The orange table has 4 partitions and the green table has 2 partitions. The servers selected for the Orange and Green tables are [S0, S1] and [S0, S2] respectively. The Join is performed in the servers [S0, S1], because Physical Optimizer by default uses the same Workers as the leftmost input operator.
If the hash-function used for partitioning the two tables is the same, we can leverage an Identity Exchange and skip re-partitioning the data on either side of the join. This is because S0 will consist of records from partitions and of the Orange table, which together contain all records that would make up partition modulo 2. i.e.
Note that Identity Exchange does not imply that the servers in the sender and receiver will be the same. It only implies that there will be a 1:1 mapping from senders to receivers. In the example below, the data transfer from S2 to S1 will be over the network.
**
Automatically Skip Aggregate Exchange
To evaluate something like GROUP BY userUUID accurately you would need to distribute records based on the userUUID column. The old query optimizer would add a Partitioning Exchange under each Aggregate, unless one used the query hint is_partitioned_by_group_by_keys.
The Physical Optimizer can detect when data is already partitioned by the required column, and will automatically skip adding an Exchange. This has two advantages:
We avoid unnecessary Data Exchanges
We avoid splitting the Aggregate, since by default when an Aggregate exists on top of an Exchange, a copy of the Aggregate is added under the Exchange (unless is_skip_leaf_stage_group_by query hint is set)
This optimization can be seen in action in the query example shared above. Since data is already partitioned by userUUID, all aggregations are run in DIRECT mode, i.e. without splitting the aggregate into multiple aggregates.
Segment / Server Pruning
Similar to the Single Stage Engine, if you have enabled segmentPrunerTypes in your table's Routing config, the Physical Optimizer will prune segments and servers using time, partition or other pruner types for the Leaf Stage. e.g. the following query will only select segments which satisfy the following constraint:
If partitioning is done in a way that segments corresponding to a given partition are present on only 1 server, then the entire query above will run within a single server, simulating shard-local execution from other systems.
Solve Constant Queries in Pinot Broker
Apache Calcite is capable of detecting Filter Expressions that will always evaluate to False. In such cases, the query plan may not have any Table Scans at all. Physical Optimizer solves such queries within the Broker itself, without involving any servers.
Worker Assignment
At present, Worker Assignment follows these simple rules:
Leaf Stage will have workers assigned based on Table Scan and Filters, using the Routing configs set in the Table Config.
Other Stages will use the same workers as the left-most input stage.
Some Plan Nodes, such as Sort(fetch=..), may require data to be collected in a single Worker. In such a case, that stage will be run on a single Worker, which will be randomly selected from one of the input workers.
Limitations
Some of the features of the existing MSE query optimizer are not yet available in the Physical Optimizer. We aim to add support for most these in Pinot 1.5:
Spools.
Dynamic filters for semi-join
First Stream Ingest
Set up real-time streaming ingestion from Kafka and watch data arrive in Pinot.
By the end of this page you will have a realtime Pinot table consuming data from a Kafka topic, with 12 rows visible in the query console.
Prerequisites
Completed -- the transcript schema must already exist in the cluster.
A running Pinot cluster. See the install guides for or .
For Docker users: set the PINOT_VERSION environment variable. See the page.
Steps
1. Understand streaming ingestion
Streaming ingestion lets Pinot consume data from a message queue in real time. As messages arrive in a Kafka topic, Pinot reads them and makes the rows queryable within seconds. The realtime table config specifies the Kafka broker, topic, and decoder so that Pinot knows how to connect and interpret incoming records.
2. Start Kafka
Start Kafka on port 9876 using the same ZooKeeper from the Pinot quick-start:
Kafka 4.0 runs in KRaft mode and does not require ZooKeeper:
3. Create a Kafka topic
Download if you have not already, then create the topic:
4. Save the realtime table config
Create the file /tmp/pinot-quick-start/transcript-table-realtime.json:
The Docker version uses kafka:9092 as the broker address because both the Kafka and Pinot containers are on the same pinot-demo Docker network.
5. Upload the realtime table config
As soon as the realtime table is created, Pinot begins consuming from the Kafka topic.
If the transcript schema was already uploaded during , you can omit the -schemaFile flag. Including it is safe -- Pinot will skip re-creating an identical schema.
6. Save the sample streaming data
Create the file /tmp/pinot-quick-start/rawdata/transcript.json:
7. Push data into the Kafka topic
Verify
Open the in your browser.
Run the following query:
You should see 12 rows of streaming data. Pinot ingests from Kafka in real time, so the rows appear within seconds of being pushed to the topic.
Next step
Continue to to learn how to write analytical queries against your Pinot tables.
Docker
Start a Pinot cluster using Docker containers.
Outcome
Start a multi-component Pinot cluster using Docker, suitable for local evaluation and CI environments.
Complex Type (Array, Map) Handling
Complex type handling in Apache Pinot.
Commonly, ingested data has a complex structure. For example, Avro schemas have and while JSON supports and .
Apache Pinot's data model supports primitive data types (including int, long, float, double, BigDecimal, string, bytes), and limited multi-value types, such as an array of primitive types. Simple data types allow Pinot to build fast indexing structures for good query performance, but does require some handling of the complex structures.
There are two options for complex type handling:
Convert the complex-type data into a JSON string and then build a JSON index.
Amazon S3
This guide shows you how to import data from files stored in Amazon S3.
Enable the file system backend by including the pinot-s3 plugin. In the controller or server configuration, add the config:
S3A URI scheme support
Starting in Pinot 1.3.0, the pinot-s3 plugin supports both the s3://
Spark
Batch ingestion of data into Apache Pinot using Apache Spark.
Pinot supports Apache Spark 3.x as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.
To set up Spark, do one of the following:
Use the Spark-Pinot Connector. For more information, see the .
Querying Pinot
A practical entry point for querying Pinot.
Pinot queries run through the broker and are written in SQL. This page is the wayfinding layer for people who want to query data, understand which engine to use, and know where to look when a query needs tuning.
You can follow the local install guide to build Pinot from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
If you do build Pinot from Source, you should consider opting into using the build-shaded-jar jar profile with -Pbuild-shaded-jar. While Pinot does not bundle spark into its jar, it does bundle certain hadoop libraries.
Next, you need to change the execution config in the job spec to the following:
To run Spark ingestion, you need the following jars in your classpath
pinot-batch-ingestion-spark plugin jar - available in plugins-external directory in the package
pinot-all jar - available in lib directory in the package
These jars can be specified using spark.driver.extraClassPath or any other option.
For loading any other plugins that you want to use, use:
The complete spark-submit command should look like this:
Ensure environment variables PINOT_ROOT_DIR and PINOT_VERSION are set properly.
Note: You should change the master to yarn and deploy-mode to cluster for production environments.
The spark-core dependency is not included in Pinot jars since the 0.10.0 release. If you run into runtime issues, make sure your Spark environment provides the dependency, or build from source with the matching Spark profile.
Running in Cluster Mode on YARN
If you want to run the spark job in cluster mode on YARN/EMR cluster, the following needs to be done -
Build Pinot from source with option -DuseProvidedHadoop
Copy Pinot binaries to S3, HDFS or any other distributed storage that is accessible from all nodes.
Copy Ingestion spec YAML file to S3, HDFS or any other distributed storage. Mention this path as part of --files argument in the command
Add --jars options that contain the s3/hdfs paths to all the required plugin and pinot-all jar
Point classPath to spark working directory. Generally, just specifying the jar names without any paths works. Same should be done for main jar as well as the spec YAML file
Example
FAQ
Q - I am getting the following exception - Class has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
Since 0.8.0 release, Pinot binaries are compiled with JDK 11. If you are using Spark along with Hadoop 2.7+, you need to use the Java8 version of Pinot. Currently, you need to build jdk 8 version from source.
Q - I am not able to find pinot-batch-ingestion-spark jar.
Since Pinot 0.10.0, the spark plugin is located in the pinot-external directory of the binary distribution (in older versions it was in plugin).
Q - Spark is not able to find the jarsleading tojava.nio.file.NoSuchFileException
This means the classpath for spark job has not been configured properly. If you are running spark in a distributed environment such as Yarn or k8s, make sure both spark.driver.classpath and spark.executor.classpath are set. Also, the jars in driver.classpath should be added to --jars argument in spark-submit so that spark can distribute those jars to all the nodes in your cluster. You also need to take provide appropriate scheme with the file path when running the jar. In this doc, we have used local:\\ but it can be different depending on your cluster setup.
Q - Spark job failing while pushing the segments.
It can be because of misconfigured controllerURI in job spec yaml file. If the controllerURI is correct, make sure it is accessible from all the nodes of your YARN or k8s cluster.
If already set to APPEND, this is likely due to a missing timeColumnName in your table config. If you can't provide a time column, use our segment name generation configs in ingestion spec. Generally using inputFile segment name generator should fix your issue.
Q - I am getting java.lang.RuntimeException: java.io.IOException: Failed to create directory: pinot-plugins-dir-0/plugins/*
Removing -Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins from spark.driver.extraJavaOptions should fix this. As long as plugins are mentioned in classpath and jars argument it should not be an issue.
Q - Getting Class not found: exception
Check if extraClassPath arguments contain all the plugin jars for both driver and executors. Also, all the plugin jars are mentioned in the --jars argument. If both of these are correct, check if the extraClassPath contains local filesystem classpaths and not s3 or hdfs or any other distributed file system classpaths.
Failed to update table '<tableName>': Cannot modify [<field>] as it may lead to data inconsistencies. Please create a new table instead.
SELECT ID_SET(yearID)
FROM baseballStats
WHERE teamID = 'WS1'
SELECT ID_SET(playerName, 'expectedInsertions=10')
FROM baseballStats
WHERE teamID = 'WS1'
SELECT ID_SET(playerName, 'expectedInsertions=100')
FROM baseballStats
WHERE teamID = 'WS1'
SELECT ID_SET(playerName, 'expectedInsertions=100;fpp=0.01')
FROM baseballStats
WHERE teamID = 'WS1'
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_ID_SET(
yearID,
'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
) = 1
GROUP BY yearID
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_ID_SET(
yearID,
'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
) = 0
GROUP BY yearID
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_SUBQUERY(
yearID,
'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
) = 1
GROUP BY yearID
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_SUBQUERY(
yearID,
'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
) = 0
GROUP BY yearID
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_PARTITIONED_SUBQUERY(
yearID,
'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
) = 1
GROUP BY yearID
SELECT yearID, count(*)
FROM baseballStats
WHERE IN_PARTITIONED_SUBQUERY(
yearID,
'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
) = 0
GROUP BY yearID
SET useMultistageEngine=true;
SET usePhysicalOptimizer=true;
SET useMultistageEngine = true;
SET usePhysicalOptimizer = true;
WITH filtered_users AS (
SELECT
userUUID
FROM userAttributes
WHERE userUUID NOT IN (
SELECT
userUUID
FROM userGroups
WHERE groupUUID = 'group-1'
)
AND userUUID IN (
SELECT
userUUID
FROM userGroups
WHERE groupUUID = 'group-2'
)
)
SELECT
userUUID,
SUM(tripAmount)
FROM userFactEvents
WHERE
userUUID IN (
SELECT userUUID FROM filtered_users
)
GROUP BY userUUID
SET useMultistageEngine = true;
SET usePhysicalOptimizer = true;
WITH user_events AS (
SELECT
productCode, tripAmount
FROM
userFactEvents
WHERE
userUUID = 'user-1'
ORDER BY
ts
DESC
LIMIT 100
)
SELECT
productCode,
SUM(tripAmount)
FROM
user_events
GROUP BY productCode
SET useMultistageEngine = true;
SET usePhysicalOptimizer = true;
SELECT
COUNT(*)
FROM
userFactEvents
WHERE
userUUID = 'user-1' AND userUUID = 'user-2'
# executionFrameworkSpec: Defines ingestion jobs to be running.
executionFrameworkSpec:
# name: execution framework name
name: 'spark'
# segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
# segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
# segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
#segmentMetadataPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'
# extraConfigs: extra configs for execution framework.
extraConfigs:
# stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
stagingDir: your/local/dir/staging
Create a file called docker-compose.yml with the following content:
Launch the cluster:
To also start Kafka for real-time streaming:
Create a network
Start ZooKeeper
Start Pinot Controller
Start Pinot Broker
Start Pinot Server
Start Pinot Minion (optional)
Start Kafka (optional)
Kafka 4.0 runs in KRaft mode and does not require ZooKeeper:
Verify
Check that all containers are running:
You should see containers for ZooKeeper, Controller, Broker, Server, and Minion all in a healthy state. Open the Pinot Query Console at http://localhost:9000 to confirm the cluster is ready.
Use the built-in complex-type handling rules in the ingestion configuration.
On this page, we'll show how to handle these complex-type structures with each of these two approaches. We will process some example data, consisting of the field group from the Meetup events Quickstart example.
This object has two child fields and the child group is a nested array with elements of object type.
Example JSON data
JSON indexing
Apache Pinot provides a powerful JSON index to accelerate the value lookup and filtering for the column. To convert an object group with complex type to JSON, add the following to your table configuration.
The config transformConfigs transforms the object group to a JSON string group_json, which then creates the JSON indexing with configuration jsonIndexColumns. To read the full spec, see meetupRsvpJson_realtime_table_config.json.
Also, note that group is a reserved keyword in SQL and therefore needs to be quoted in transformFunction.
The columnName can't use the same name as any of the fields in the source JSON data, for example, if our source data contains the field group and we want to transform the data in that field before persisting it, the destination column name would need to be something different, like group_json.
Note that you do not need to worry about the maxLength of the field group_json on the schema, because "JSON" data type does not have a maxLength and will not be truncated. This is true even though "JSON" is stored as a string internally.
With this, you can start to query the nested fields under group. For more details about the supported JSON function, see guide).
Ingestion configurations
Though JSON indexing is a handy way to process the complex types, there are some limitations:
It’s not performant to group by or order by a JSON field, because JSON_EXTRACT_SCALAR is needed to extract the values in the GROUP BY and ORDER BY clauses, which invokes the function evaluation.
Alternatively, from Pinot 0.8, you can use the complex-type handling in ingestion configurations to flatten and unnest the complex structure and convert them into primitive types. Then you can reduce the complex-type data into a flattened Pinot table, and query it via SQL. With the built-in processing rules, you do not need to write ETL jobs in another compute framework such as Flink or Spark.
To process this complex type, you can add the configuration complexTypeConfig to the ingestionConfig. For example:
With the complexTypeConfig , all the map objects will be flattened to direct fields automatically. And with unnestFields , a record with the nested collection will unnest into multiple records. For instance, the example at the beginning will transform into two rows with this configuration example.
Flattened/unnested data
Note that:
The nested field group_id under group is flattened to group.group_id. The default value of the delimiter is . You can choose another delimiter by specifying the configuration delimiter under complexTypeConfig. This flattening rule also applies to maps in the collections to be unnested.
The nested array group_topics under group is unnested into the top-level, and converts the output to a collection of two rows. Note the handling of the nested field within group_topics, and the eventual top-level field of group.group_topics.urlkey. All the collections to unnest shall be included in the configuration fieldsToUnnest.
Collections not specified in fieldsToUnnestwill be serialized into JSON string, except for the array of primitive values, which will be ingested as a multi-value column by default. The behavior is defined by the collectionNotUnnestedToJson config, which takes the following values:
NON_PRIMITIVE- Converts the array to a multi-value column. (default)
You can find the full specifications of the table config here and the table schema here.
You can then query the table with primitive values using the following SQL query:
. is a reserved character in SQL, so you need to quote the flattened columns in the query.
Infer the Pinot schema from the Avro schema and JSON data
When there are complex structures, it can be challenging and tedious to figure out the Pinot schema manually. To help with schema inference, Pinot provides utility tools to take the Avro schema or JSON data as input and output the inferred Pinot schema.
To infer the Pinot schema from Avro schema, you can use a command like this:
Note you can input configurations like fieldsToUnnest similar to the ones in complexTypeConfig. And this will simulate the complex-type handling rules on the Avro schema and output the Pinot schema in the file specified in outputDir.
Similarly, you can use the command like the following to infer the Pinot schema from a file of JSON objects.
You can check out an example of this run in this PR.
URI schemes. Both schemes use the same underlying AWS SDK v2 client and identical configuration — the only difference is the URI prefix. This allows Pinot to integrate with Hadoop-based ecosystems and tools that standardize on the
s3a://
scheme.
To use the s3a:// scheme, specify it in your deep store paths and file system configuration:
All configuration properties documented below work identically for both the s3 and s3a schemes.
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-3.0...
You can configure the S3 file system using the following options:
Configuration
Description
region
The AWS Data center region in which the bucket is located
accessKey
(Optional) AWS access key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
secretKey
(Optional) AWS secret key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
Each of these properties should be prefixed by pinot.[node].storage.factory.s3. where node is either controller or server depending on the config
e.g.
S3 Filesystem supports authentication using the DefaultCredentialsProviderChain. The credential provider looks for the credentials in the following order -
Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey
Web Identity Token credentials from the environment or container
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable,
Instance profile credentials delivered through the Amazon EC2 metadata service
You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups.
Checksum validation
Checksum configuration is available starting in Pinot 1.4.
Starting with AWS SDK 2.30.0, the S3 client enables request and response checksum validation by default. Pinot exposes configuration properties to control this behavior.
Request and response checksums
By default, Pinot sets both requestChecksumCalculation and responseChecksumValidation to WHEN_SUPPORTED, which means the S3 client calculates checksums on uploads and validates them on downloads whenever the API supports it. This provides data integrity verification for segment files stored in your deep store.
If you want to disable automatic checksums and only use them when the S3 API strictly requires it, set both properties to WHEN_REQUIRED:
Value
Behavior
WHEN_SUPPORTED
Calculate/validate checksums whenever the API supports it (default)
WHEN_REQUIRED
Only calculate/validate checksums when the API requires it
LegacyMd5Plugin for S3-compatible stores
Some S3-compatible object stores (e.g. MinIO, Ceph, or older AWS configurations) require the legacy Content-MD5 header on requests. After the AWS SDK 2.30.0 upgrade, these stores may return errors like:
To restore the pre-2.30.0 MD5 checksum behavior, enable the useLegacyMd5Plugin option:
This adds the LegacyMd5Plugin to the S3 client, which sends the Content-MD5 header that these stores expect.
Only enable useLegacyMd5Plugin if your S3-compatible store requires the legacy MD5 header. For standard AWS S3, the default checksum behavior is recommended.
Decide whether the single-stage engine is enough or whether you need multi-stage features such as joins and subqueries.
Use query options to control runtime behavior.
Inspect the plan or result shape when you need to debug performance.
What matters most
Pinot SQL uses the Apache Calcite parser with the MYSQL_ANSI dialect. In practice, that means you should pay attention to identifier quoting, literal quoting, and engine-specific capabilities.
If you are debugging a slow or surprising query, the most useful follow-up pages are:
Single-stage execution is the default path for straightforward filtering, aggregation, and top-K style queries.
Use multi-stage execution when you need features that are not available in single-stage mode, such as:
joins
subqueries
window functions
more complex distributed query shapes
As a rule of thumb: use SSE for simple filtering, aggregation, and top-K queries; use MSE when your query shape requires joins, subqueries, window functions, or other advanced relational operators. For a detailed comparison, see SSE vs MSE.
Double quotes(") are used to force string identifiers, e.g. column names
Single quotes(') are used to enclose string literals. If the string literal also contains a single quote, escape this with a single quote e.g '''Pinot''' to match the string literal 'Pinot'
Misusing those might cause unexpected query results, like the following examples:
WHERE a='b' means the predicate on the column a equals to a string literal value 'b'
WHERE a="b" means the predicate on the column a equals to the value of the column b
If your column names use reserved keywords (e.g. timestamp or date) or special characters, you will need to use double quotes when referring to them in queries.
Note: Define decimal literals within quotes to preserve precision.
Pinot supports queries on BYTES column using hex strings. The query response also uses hex strings to represent bytes values.
The query below fetches all the rows for a given UID:
Stream Ingestion with CLP
Support for encoding fields with CLP during ingestion.
This is an experimental feature. Configuration options and usage may change frequently until it is stabilized.
When performing stream ingestion of JSON records using Kafka, users can encode specific fields with CLP by using a CLP-specific StreamMessageDecoder.
CLP is a compressor designed to encode unstructured log messages in a way that makes them more compressible while retaining the ability to search them. It does this by decomposing the message into three fields:
the message's static text, called a log type;
repetitive variable values, called dictionary variables; and
non-repetitive variable values (called encoded variables since we encode them specially if possible).
Searches are similarly decomposed into queries on the individual fields.
Although CLP is designed for log messages, other unstructured text like file paths may also benefit from its encoding.
For example, consider this JSON record:
If the user specifies the fields message and logPath should be encoded with CLP, then the StreamMessageDecoder will output:
In the fields with the _logtype suffix, \x11 is a placeholder for an integer variable, \x12 is a placeholder for a dictionary variable, and \x13 is a placeholder for a float variable. In message_encoedVars, the float variable 0.335 is encoded as an integer using CLP's custom encoding.
All remaining fields are processed in the same way as they are in org.apache.pinot.plugin.inputformat.json.JSONRecordExtractor. Specifically, fields in the table's schema are extracted from each record and any remaining fields are dropped.
Configuration
Table Index
Assuming the user wants to encode message and logPath as in the example, they should change/add the following settings to their tableIndexConfig (we omit irrelevant settings for brevity):
stream.kafka.decoder.prop.fieldsForClpEncoding is a comma-separated list of names for fields that should be encoded with CLP.
We use for the logtype and dictionary variables since their length can vary significantly.
Schema
For the table's schema, users should configure the CLP-encoded fields as follows (we omit irrelevant settings for brevity):
We use the maximum possible length for the logtype and dictionary variable columns.
The dictionary and encoded variable columns are multi-valued columns.
Searching and decoding CLP-encoded fields
To decode CLP-encoded fields, use .
To search CLP-encoded fields, you can combine CLPDECODE with LIKE. Note, this may decrease performance when querying a large number of rows.
We are working to integrate efficient searches on CLP-encoded columns as another UDF. The development of this feature is being tracked in this .
CLP Forward Index V2
Starting in Pinot 1.3.0, the CLP forward index was upgraded to V2 (CLPMutableForwardIndexV2), which is now the default for CLP-encoded columns during real-time ingestion. Key improvements include:
Dynamic encoding with cardinality monitoring
V2 monitors dictionary cardinality during ingestion and dynamically switches encoding modes:
CLP dictionary encoding: Used when log type and dictionary variable cardinality remains below a configurable threshold relative to the document count.
Raw string fallback: When cardinality exceeds the threshold (docs/cardinality ratio drops below 10), V2 automatically falls back to a raw string forward index to avoid the memory and I/O overhead of maintaining a large dictionary.
Improved compression
V2 uses fixed-byte encoding with Zstandard chunk compression instead of V1's uncompressed fixed-bit encoding. This significantly improves compression ratios for most real-world log data.
Compression codec options
You can select the compression codec for CLP-encoded columns using the compressionCodec in fieldConfig:
Codec
Description
Example field config:
Immutable CLP Forward Index
When mutable (real-time) segments are converted to immutable segments, V2 directly copies the mutable dictionary and index data without re-encoding, eliminating the serialization/deserialization overhead present in V1. The resulting immutable forward index is memory-mapped for efficient random access during queries.
Dimension Table
Batch ingestion of data into Apache Pinot using dimension tables.
Dimension tables are a special kind of offline table designed for join-like enrichment of fact data at query time. They are used together with the (single-stage engine) or the (multi-stage engine) to decorate query results with reference data.
When to use dimension tables
Use a dimension table when you need to enrich a large fact table with attributes from a small, relatively static reference dataset at query time. Common examples include:
Ingest Records with Dynamic Schemas
Storing records with dynamic schemas in a table with a fixed schema.
Some domains (e.g., logging) generate records where each record can have a different set of keys, whereas Pinot tables have a relatively static schema. For records with varying keys, it's impractical to store each field in its own table column. However, most (if not all) fields may be important, so fields should not be dropped unnecessarily.
Additionally, searching patterns on such table could also be complex and change frequently. Exact match, range query, prefix/suffix match, wildcard search and aggregation functions could be used on any old or newly created keys or values.
SchemaConformingTransformer
Flink
Batch ingestion of data into Apache Pinot using Apache Flink.
Apache Pinot supports using Apache Flink as a processing framework to generate and upload segments. The Pinot distribution includes a that can be integrated into Flink applications (streaming or batch) to directly write data as segments into Pinot tables.
The PinotSinkFunction supports offline tables, realtime tables, and upsert tables (full upsert only). Data is buffered in memory and flushed as segments when the configured threshold is reached, then uploaded to the Pinot cluster.
Maven Dependency
select playerName, max(hits)
from baseballStats
group by playerName
order by max(hits) desc
select sum(hits), sum(homeRuns), sum(numberOfGames)
from baseballStats
where yearID > 2010
SET useMultistageEngine = true;
SELECT city, COUNT(*)
FROM stores
GROUP BY city
LIMIT 10;
//default to limit 10
SELECT *
FROM myTable
SELECT *
FROM myTable
LIMIT 100
SELECT "date", "timestamp"
FROM myTable
SELECT COUNT(*), MAX(foo), SUM(bar)
FROM myTable
SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz
FROM myTable
GROUP BY bar, baz
LIMIT 50
SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz
FROM myTable
GROUP BY bar, baz
ORDER BY bar, MAX(foo) DESC
LIMIT 50
SELECT COUNT(*)
FROM myTable
WHERE foo = 'foo'
AND bar BETWEEN 1 AND 20
OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))
SELECT COUNT(*)
FROM myTable
WHERE foo IS NOT NULL
AND foo = 'foo'
AND bar BETWEEN 1 AND 20
OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))
SELECT *
FROM myTable
WHERE quux < 5
LIMIT 50
SELECT foo, bar
FROM myTable
WHERE baz > 20
ORDER BY bar DESC
LIMIT 100
SELECT foo, bar
FROM myTable
WHERE baz > 20
ORDER BY bar DESC
LIMIT 50, 100
SELECT COUNT(*)
FROM myTable
WHERE REGEXP_LIKE(airlineName, '^U.*')
GROUP BY airlineName LIMIT 10
SELECT
CASE
WHEN price > 30 THEN 3
WHEN price > 20 THEN 2
WHEN price > 10 THEN 1
ELSE 0
END AS price_category
FROM myTable
SELECT
SUM(
CASE
WHEN price > 30 THEN 30
WHEN price > 20 THEN 20
WHEN price > 10 THEN 10
ELSE 0
END) AS total_cost
FROM myTable
SELECT COUNT(*)
FROM myTable
GROUP BY DATETIMECONVERT(timeColumnName, '1:MILLISECONDS:EPOCH', '1:HOURS:EPOCH', '1:HOURS')
SELECT *
FROM myTable
WHERE UID = 'c8b3bce0b378fc5ce8067fc271a34892'
endpoint
(Optional) Override endpoint for s3 client.
disableAcl
If this is set tofalse, bucket owner is granted full access to the objects created by pinot. Default value is true.
serverSideEncryption
(Optional) The server-side encryption algorithm used when storing this object in Amazon S3 (Now supports aws:kms), set to null to disable SSE.
ssekmsKeyId
(Optional, but required when serverSideEncryption=aws:kms) Specifies the AWS KMS key ID to use for object encryption. All GET and PUT requests for an object protected by AWS KMS will fail if not made via SSL or using SigV4.
ssekmsEncryptionContext
(Optional) Specifies the AWS KMS Encryption Context to use for object encryption. The value of this header is a base64-encoded UTF-8 string holding JSON with the encryption context key-value pairs.
requestChecksumCalculation
(Optional) Controls whether checksums are calculated for request payloads. Default: WHEN_SUPPORTED. Options: WHEN_SUPPORTED, WHEN_REQUIRED.
responseChecksumValidation
(Optional) Controls whether checksums are validated on response payloads. Default: WHEN_SUPPORTED. Options: WHEN_SUPPORTED, WHEN_REQUIRED.
useLegacyMd5Plugin
(Optional) When set to true, uses the LegacyMd5Plugin to restore pre-2.30.0 MD5 checksum behavior. Default: false.
enableCrossRegionAccess
(Optional) If you want to copy objects b/w two buckets that lie in different regions. Defaults to true if not configured.
ALL- Converts the array of primitive values to JSON string.
NONE- Does not do any conversion.
Looking up a human-readable team name from a team ID.
Enriching clickstream events with product catalog attributes.
Decorating transaction records with customer or store metadata.
If any of the following apply, a regular offline or real-time table is a better fit:
The reference data is large (hundreds of millions of rows or multiple gigabytes).
The data changes frequently and requires real-time ingestion.
You need time-based partitioning, retention policies, or a hybrid table setup.
You need to query the reference data with complex aggregations independently.
How dimension tables work
When a table is marked as a dimension table, Pinot replicates all of its segments to every server in the tenant. On each server the data is loaded into an in-memory hash map keyed by the table's primary key, which enables constant-time lookups during query execution.
Because the data is fully replicated and held in memory, dimension tables must be small enough to fit comfortably in each server's heap. They are not intended for large datasets.
Memory loading modes
Pinot supports two loading modes controlled by the disablePreload setting in dimensionTableConfig:
Mode
disablePreload
Memory usage
Lookup speed
Description
Fast lookup (default)
false
Higher
Faster
All rows are fully materialized into an in-memory hash map (Object[] -> Object[]). Every column value is stored in the map for constant-time retrieval.
Choose the memory-optimized mode when the dimension table is relatively large and you want to reduce heap pressure, at the cost of slightly slower lookups.
Size limits and memory considerations
Cluster-level maximum size: The controller configuration property controller.dimTable.maxSize sets the maximum storage quota allowed for any single dimension table. The default is 200 MB. Table creation fails if the requested quota.storage exceeds this limit.
Heap impact: In fast-lookup mode, the entire table is materialized in Java heap on every server. A table that is 100 MB on disk may consume significantly more memory after deserialization. Monitor server heap usage when adding or growing dimension tables.
Replication overhead: Because every server in the tenant holds a full copy, adding a dimension table multiplies its memory footprint by the number of servers.
As a guideline, keep dimension tables under a few hundred thousand rows and well under the controller.dimTable.maxSize limit. Tables that approach or exceed available heap will cause out-of-memory errors on servers.
Configuration
Table configuration
Mark a table as a dimension table by setting the following properties in the table config:
Property
Required
Description
isDimTable
Yes
Set to true to designate the table as a dimension table.
Must be set to REFRESH. Dimension tables use segment replacement rather than append semantics so that the in-memory hash map is rebuilt with the latest data.
Schema configuration
Dimension table schemas use dimensionFieldSpecs instead of metricFieldSpecs. A primaryKeyColumns array is required -- it defines the key used for lookups.
Example table configuration
Example schema configuration
Querying with the LOOKUP function
The primary way to use a dimension table is through the LOOKUP UDF in the single-stage query engine. This function performs a primary-key lookup against the dimension table and returns a column value.
Syntax
dimTable -- name of the dimension table (string literal).
dimColToLookUp -- column to retrieve from the dimension table (string literal).
dimJoinKey / factJoinKey -- pairs of join keys: the dimension table column name (string literal) and the corresponding fact table column expression.
Single-key lookup
Composite-key lookup
When the dimension table has a composite primary key, provide multiple key pairs in the same order as primaryKeyColumns in the schema:
Multi-stage engine
In the multi-stage engine (MSE), use a standard JOIN with the lookup join strategy hint instead of the LOOKUP UDF:
Because dimension tables use segmentIngestionType: REFRESH, uploading a new segment replaces the existing segment and triggers a full reload of the in-memory hash map on every server. There is no incremental update mechanism.
Typical refresh patterns:
Scheduled batch job: Run a periodic ingestion job (e.g., daily or hourly) that rebuilds the segment from the source of truth and uploads it to Pinot.
On-demand refresh: Trigger a segment upload through the Pinot REST API whenever the reference data changes.
During a refresh, the old hash map remains active for lookups until the new one is fully loaded. There is no query downtime during a refresh, but there is a brief period where the old data is served.
Handling duplicate primary keys
When multiple segments contain the same primary key, the default behavior is last-loaded-segment-wins (segments are ordered by creation time). Set errorOnDuplicatePrimaryKey: true in dimensionTableConfig to fail fast if duplicates are detected. With REFRESH ingestion, there is typically only one segment, so duplicates across segments are uncommon.
Performance best practices
Keep tables small. Dimension tables are loaded entirely into memory on every server. Target thousands to low hundreds of thousands of rows.
Use narrow schemas. Include only the columns needed for lookups to reduce memory consumption.
Choose the right loading mode. Use fast lookup (default) for the best query performance. Switch to memory-optimized mode (disablePreload: true) only if heap usage is a concern.
Set a storage quota. Always configure quota.storage to prevent accidentally uploading oversized data.
Minimize refresh frequency. Each refresh triggers a full reload of the hash map. Avoid refreshing more often than necessary.
Monitor server heap. After adding a dimension table, check server JVM heap metrics to confirm adequate headroom.
Limitations
Offline only. Dimension tables must be offline tables. They cannot be real-time or hybrid tables.
Full replication. All segments are replicated to every server in the tenant, so memory usage scales with the number of servers.
No incremental updates. The entire segment must be replaced on each refresh; row-level updates are not supported.
Primary key required. The schema must define primaryKeyColumns. Lookups without a primary key are not supported.
Single-stage LOOKUP UDF limitations. Dimension table column references in the LOOKUP function must be string literals, not column identifiers, because they reference a table that is not part of the query's FROM clause.
No time-based partitioning or retention. Dimension tables do not support segment retention policies or time-based partitioning.
The SchemaConformingTransformer is a RecordTransformer that can transform records with dynamic schemas such that they can be ingested in a table with a static schema. The transformer takes record fields that don't exist in the schema and stores them in a type of catchall field. Moreover, it builds a __mergedTextIndex field and takes advantage of Lucene to fulfill text search.
For example, consider this record:
Let's say the table's schema contains the following fields:
arrayField
mapField
nestedFields
nestedFields.stringField
json_data
json_data_no_idx
__mergedTextIndex
Without this transformer, stringField field and fields ends with _noIdx would be dropped. mapField and nestedFields fields' storage needs to rely on the global setup in complexTransformers without granular customizations. However, with this transformer, the record would be transformed into the following:
Notice that there are 3 reserved (and configurable) fields json_data, json_data_no_idx and __mergedTextIndex. And the transformer does the following:
Flattens nested fields all the way to the leaf node and:
Conducts special treatments if necessary according to the config
If the key path matches the schema, put the data into the dedicated field
Otherwise, put them into json_data or json_data_no_idx depending on its key suffix
For keys in dedicated columns or json_data, puts them into __mergedTextIndex in the form of "Begin Anchor + value + Separator + key + End Anchor" to power the text matches.
Additional functionalities by configurations
Drop fields fieldPathsToDrop
Preserve the subtree without flattening fieldPathsToPreserveInput and fieldPathsToPreserveInputWithIndex
Table Configurations
SchemaConformingTransformer Configuration
To use the transformer, add the schemaConformingTransformerConfig option in the ingestionConfig section of your table configuration, as shown in the following example.
Other index config of 3 reserved columns could be set like:
Specifically, customizable json index could be set according to json index indexPaths.
Power the text search
Schema Design
With the help of SchemaConformingTransformer, all data could be kept even without specifying special dedicated columns in table schema. However, to optimize the storage and various query patterns, dedicated columns should be created based on the usage:
Fields with frequent exact match query, e.g. region, log_level, runtime_env
Fields with range query, e.g. timestamp
High frequency fields from messages
Reduce json index size
Optimize group by queries
Text Search
After putting each key/value pairs into the __mergedTextIndex field, there will neeed to be luceneAnalyzerClass to tokenize the document and luceneQueryParserClass to query by tokens. Some example common searching patterns and their queries are:
Exact key/value match TEXT_MATCH(__mergedTextIndex, '"valuer:key"')
Wildcard value search in a key TEXT_MATCH(__mergedTextIndex, '/.* value .*:key/')
Global value exact match TEXT_MATCH(__mergedTextIndex, '/"value"/')
Global value wildcard match TEXT_MATCH(__mergedTextIndex, '/.* value .*/')
The luceneAnalyzerClass and luceneQueryParserClass usually need to have similar delimiter set. It also needs to consider the values below.
With given example, each key/value pair would be stored as "\u0002value\u001ekey\u0003". The prefix and suffix match on key or value need to be adjusted accordingly in the luceneQueryParserClass.
To use the Pinot Flink Connector in your Flink job, add the following dependency to your pom.xml:
Replace 1.5.0-SNAPSHOT with the Pinot version you're using. For the latest stable version, check the Apache Pinot releases.
Note: The connector transitively includes dependencies for:
pinot-controller - For controller client APIs
pinot-segment-writer-file-based - For segment generation
flink-streaming-java and flink-java - Flink core dependencies
Offline Table Ingestion
Quick Start Example
Table Configuration
The PinotSinkFunction uses the TableConfig to determine batch ingestion settings for segment generation and upload. Here's an example table configuration:
Required configurations:
outputDirURI - Directory where segments are written before upload
push.controllerUri - Pinot controller URL for segment upload
For standard realtime tables without upsert, use the same approach as offline tables, but specify REALTIME as the table type:
Upsert Tables
Full Upsert Tables
Flink connector supports backfilling full upsert tables where each record contains all columns. The uploaded segments will correctly participate in upsert semantics based on the comparison column value.
Requirements:
Partitioning: Data must be partitioned using the same strategy as the upstream stream (e.g., Kafka)
Parallelism: Flink job parallelism must match the number of upstream stream/table partitions
Comparison Column: The values of the comparison column must have ordering consistent with the upstream stream. This ensures that Pinot can correctly resolve which record is the latest for a given key. See Pinot upsert comparison column docs for important considerations.
Example:
How Partitioning Works:
When uploading segments for upsert tables, Pinot uses a special segment naming convention UploadedRealtimeSegmentName that encodes the partition ID. The format is:
Example: flink__myTable__0__1724045187__1
Each Flink subtask generates segments for a specific partition based on its subtask index. The segments are then assigned to the same server instances that handle that partition for stream-consumed segments, ensuring correct upsert behavior across all segments.
Configuration Options:
You can customize segment generation using additional constructor parameters:
Partial Upsert Tables
WARNING: Flink-based upload is not recommended for partial upsert tables.
In partial upsert tables, uploaded segments contain only a subset of columns or an intermdiate row for a primary key. If the uploaded row is not in its final state and subsequent updates arrive via the stream, the partial upsert merger may produce inconsistent results between replicas. This can lead to data inconsistency that is difficult to detect and resolve.
For partial upsert tables, prefer stream-based ingestion only or ensure uploaded data represents the final state for each primary key.
Advanced Configuration
Segment Flush Control
Control when segments are flushed and uploaded:
Segment Naming
Customize segment naming and upload time for better organization:
This guide shows you how to ingest a stream of records from an Apache Pulsar topic into a Pinot table.
Pinot supports consuming data from Apache Pulsar via the pinot-pulsar plugin. You need to enable this plugin so that Pulsar specific libraries are present in the classpath.
Enable the Pulsar plugin with the following config at the time of Pinot setup: -Dplugins.include=pinot-pulsar
The pinot-pulsar plugin is included in the official binary distribution since Pinot 0.11.0. If you are running an older version, you can download the plugin from and add it to the plugins directory.
Set up Pulsar table
Here is a sample Pulsar stream config. You can use the streamConfigs section from this sample and make changes for your corresponding table.
Pulsar configuration options
You can change the following Pulsar specifc configurations for your tables
Property
Description
Authentication
The Pinot-Pulsar connector supports authentication using security tokens. To generate a token, follow the instructions in . Once generated, add the following property to streamConfigs to add an authentication token for each request:
OAuth2 Authentication
The Pinot-Pulsar connector supports authentication using OAuth2, for example, if connecting to a StreamNative Pulsar cluster. For more information, see how to . Once configured, you can add the following properties to streamConfigs:
TLS support
The Pinot-pulsar connector also supports TLS for encrypted connections. You can follow to enable TLS on your pulsar cluster. Once done, you can enable TLS in pulsar connector by providing the trust certificate file location generated in the previous step.
Also, make sure to change the brokers url from pulsar://localhost:6650 to pulsar+ssl://localhost:6650 so that secure connections are used.
For other table and stream configurations, you can headover to
Supported Pulsar versions
Pinot currently relies on Pulsar client version 4.0.x. Make sure the Pulsar broker is compatible with this client version.
Extract record headers as Pinot table columns
Pinot's Pulsar connector supports automatically extracting record headers and metadata into the Pinot table columns. Pulsar supports a large amount of per-record metadata. Reference the for the meaning of the metadata fields.
The following table shows the mapping for record header/metadata to Pinot table column names:
Pulsar Message
Pinot table Column
Comments
Available By Default
In order to enable the metadata extraction in a Pulsar table, set the stream config metadata.populate to true. The fields eventTime, publishTime, brokerPublishTime, and key are populated by default. If you would like to extract additional fields from the Pulsar Message, populate the metadataFields config with a comma separated list of fields to populate. The fields are referenced by the field name in the Pulsar Message. For example, setting:
Will make the __metadata$messageId, __metadata$messageBytes, __metadata$eventTime, and __metadata$topicName, fields available for mapping to columns in the Pinot schema.
In addition to this, if you want to use any of these columns in your table, you have to list them explicitly in your table's schema.
For example, if you want to add only the offset and key as dimension columns in your Pinot table, it can listed in the schema as follows:
Once the schema is updated, these columns are similar to any other pinot column. You can apply ingestion transforms and / or define indexes on them.
Remember to follow the when updating schema of an existing table!
Ingestion Aggregations
Many data analytics use-cases only need aggregated data. For example, data used in charts can be aggregated down to one row per time bucket per dimension combination.
Doing this results in much less storage and better query performance. Configuring this for a table is done via the Aggregation Config in the table config.
Note that Ingestion aggregation only works with realtime Pinot tables. Furthermore, this is done at a segment level. Cross-segment aggregation still requires query-time processing
Aggregation Config
The aggregation config controls the aggregations that happen during real-time data ingestion. Offline aggregations must be handled separately.
Below is a description of the config, which is defined in the ingestion config of the table config.
Requirements
The following are required for ingestion aggregation to work:
Ingestion aggregation config is effective only for real-time tables. (There is no ingestion time aggregation support for offline tables. We need use or pre-process aggregations in the offline data flow using batch processing engines like Spark/MapReduce).
type must be lowLevel.
All metrics must have aggregation configs.
Example Scenario
Here is an example of sales data, where only the daily sales aggregates per product are needed.
You can also find it when running RealtimeQuickStart, there is a table called dailySales
**
Example Input Data
Schema
Note that the schema only reflects the final table structure.
Table Config
From the below aggregation config example, note that price exists in the input data while total_sales exists in the Pinot Schema.
Example Final Table
**
product_name
sales_count
total_sales
daysSinceEpoch
Allowed Aggregation Functions
function name
notes
Frequently Asked Questions
Why not use a Startree?
Startrees can only be added to real-time segments after the segments has sealed, and creating startrees is CPU-intensive. Ingestion Aggregation works for consuming segments and uses no additional CPU.
Startrees take additional memory to store, while ingestion aggregation stores less data than the original dataset.
When to not use ingestion aggregation?
If the original rows in non-aggregated form are needed, then ingestion-aggregation cannot be used.
I already use the aggregateMetrics setting?
The aggregateMetrics works the same as Ingestion Aggregation, but only allows for the SUM function.
The current changes are backward compatible, so no need to change your table config unless you need a different aggregation function.
Does this config work for offline data?
Ingestion Aggregation only works for real-time ingestion. For offline data, the offline process needs to generate the aggregates separately.
Why do all metrics need to be aggregated?
If a metric isn't aggregated then it will result in more than one row per unique set of dimensions.
Why no data show up when I enabled AggregationConfigs?
Check if ingestion is normal without AggregationConfigs, this is to isolate the problem
Check Pinot Server log for any warning or error log, especially related to class MutableSegmentImpland method aggregateMetrics.
For JSON data, please ensure you don't double quote numbers, as they are parsed as string internally and won't be able to do the value based aggregation, e.g. sum. Using the above example, data ingestion not working with row:
Segment
Discover the segment component in Apache Pinot for efficient data storage and querying within Pinot clusters, enabling optimized data processing and analysis.
Pinot tables are stored in one or more independent shards called segments. A small table may be contained by a single segment, but Pinot lets tables grow to an unlimited number of segments. There are different processes for creating segments (see ingestion). Segments have time-based partitions of table data, and are stored on Pinot servers that scale horizontally as needed for both storage and computation.
Pinot achieves this by breaking the data into smaller chunks known as segments (similar to shards/partitions in relational databases). Segments can be seen as time-based partitions.
A segment is a horizontal shard representing a chunk of table data with some number of rows. The segment stores data for all columns of the table. Each segment packs the data in a columnar fashion, along with the dictionaries and indices for the columns. The segment is laid out in a columnar format so that it can be directly mapped into memory for serving queries.
Columns can be single or multi-valued and the following types are supported: STRING, BOOLEAN, INT, LONG, FLOAT, DOUBLE, TIMESTAMP or BYTES. Only single-valued BIG_DECIMAL data type is supported.
Columns may be declared to be metric or dimension (or specifically as a time dimension) in the schema. Columns can have default null values. For example, the default null value of a integer column can be 0. The default value for bytes columns must be hex-encoded before it's added to the schema.
Pinot uses dictionary encoding to store values as a dictionary ID. Columns may be configured to be “no-dictionary” column in which case raw values are stored. Dictionary IDs are encoded using minimum number of bits for efficient storage (e.g. a column with a cardinality of 3 will use only 2 bits for each dictionary ID).
A forward index is built for each column and compressed for efficient memory use. In addition, you can optionally configure inverted indices for any set of columns. Inverted indices take up more storage, but improve query performance. Specialized indexes like Star-Tree index are also supported. For more details, see .
Creating a segment
Once the table is configured, we can load some data. Loading data involves generating pinot segments from raw data and pushing them to the pinot cluster. Data can be loaded in batch mode or streaming mode. For more details, see the page.
Load data in batch
Prerequisites
Below are instructions to generate and push segments to Pinot via standalone scripts. For a production setup, you should use frameworks such as Hadoop or Spark. For more details on setting up data ingestion jobs, see
Job Spec YAML
To generate a segment, we need to first create a job spec YAML file. This file contains all the information regarding data format, input data location, and pinot cluster coordinates. Note that this assumes that the controller is RUNNING to fetch the table config and schema. If not, you will have to configure the spec to point at their location. For full configurations, see .
Create and push segment
To create and push the segment in one go, use the following:
Sample Console Output
Alternately, you can separately create and then push, by changing the jobType to SegmentCreation or SegmenTarPush.
Templating Ingestion Job Spec
The Ingestion job spec supports templating with Groovy Syntax.
This is convenient if you want to generate one ingestion job template file and schedule it on a daily basis with extra parameters updated daily.
e.g. you could set inputDirURI with parameters to indicate the date, so that the ingestion job only processes the data for a particular date. Below is an example that templates the date for input and output directories.
You can pass in arguments containing values for ${year}, ${month}, ${day} when kicking off the ingestion job: -values $param=value1 $param2=value2...
This ingestion job only generates segments for date 2014-01-03
Load data in streaming
Prerequisites
Below is an example of how to publish sample data to your stream. As soon as data is available to the real-time stream, it starts getting consumed by the real-time servers.
Kafka
Run below command to stream JSON data into Kafka topic: flights-realtime
Run below command to stream JSON data into Kafka topic: flights-realtime
Grouping Algorithm
In this guide we will learn about the heuristics used for trimming results in Pinot's grouping algorithm (used when processing GROUP BY queries) to make sure that the server doesn't run out of memory.
SSE (Single-Stage Engine)

Group by results approximation at various stages of SSE query execution
Within segment
When grouping rows within a segment, Pinot keeps a maximum of numGroupsLimit groups per segment. This value is set to 100,000 by default and can be configured by the pinot.server.query.executor.num.groups.limit property.
If the number of groups of a segment reaches this value, the extra groups will be ignored and the results returned may not be completely accurate. The numGroupsLimitReached property will be set to true in the query response if the value is reached.
Trimming tail groups
After the inner segment groups have been computed, the Pinot query engine optionally trims tail groups. Tail groups are ones that have a lower rank based on the ORDER BY clause used in the query.
When segment group trim is enabled, the query engine will trim the tail groups and keep only max(minSegmentGroupTrimSize, 5 * LIMIT) ,
where LIMIT is the maximum number of records returned by query - usually set via LIMIT clause). Pinot keeps at least 5 * LIMIT groups when trimming tail groups to ensure the accuracy of results. Trimming is performed only when ordering and limit is specified.
This value can be overridden on a query by query basis by passing the following option:
Cross segments
Once grouping has been done within a segment, Pinot will merge segment results and trim tail groups and keep max(minServerGroupTrimSize, 5 * LIMIT) groups if it gets more groups.
minServerGroupTrimSize is set to 5,000 by default and can be adjusted by configuring the pinot.server.query.executor.min.server.group.trim.size property. Cross segments trim can be disabled by setting the property to -1.
When cross segments trim is enabled, the server will trim the tail groups before sending the results back to the broker. To reduce memory usage while merging per-segment results, It will also trim the tail groups when the number of groups reaches the trimThreshold.
trimThreshold is the upper bound of groups allowed in a server for each query to protect servers from running out of memory. To avoid too frequent trimming, the actual trim size is bounded to trimThreshold / 2. Combining this with the above equation, the actual trim size for a query is calculated as min(max(minServerGroupTrimSize, 5 * LIMIT), trimThreshold / 2).
This configuration is set to 1,000,000 by default and can be adjusted by configuring the pinot.server.query.executor.groupby.trim.threshold property.
A higher threshold reduces the amount of trimming done, but consumes more heap memory. If the threshold is set to more than 1,000,000,000, the server will only trim the groups once before returning the results to the broker.
This value can be overridden on a query by query basis by passing the following option:
At Broker
When broker performs the final merge of the groups returned by various servers, there is another level of trimming that takes place. The tail groups are trimmed and
max(minBrokerGroupTrimSize, 5 * LIMIT) groups are retained.
Default value of minBrokerGroupTrimSize is set to 5000. This can be adjusted by configuring pinot.broker.min.group.trim.size property.
GROUP BY behavior
Pinot sets a default LIMIT of 10 if one isn't defined and this applies to GROUP BY queries as well. Therefore, if no limit is specified, Pinot will return 10 groups.
Pinot will trim tail groups based on the ORDER BY clause to reduce the memory footprint and improve the query performance. It keeps at least 5 * LIMIT groups so that the results give good enough approximation in most cases. The configurable min trim size can be used to increase the groups kept to improve the accuracy but has a larger extra memory footprint.
HAVING behavior
If the query has a HAVING clause, it is applied on the merged GROUP BY results that already have the tail groups trimmed. If the HAVING clause is the opposite of the ORDER BY order, groups matching the condition might already be trimmed and not returned. e.g.
Increase min trim size to keep more groups in these cases.
Examples
For a simple keyed aggregation query such as:
a simplified execution plan, showing where trimming happens, looks like:
For sake of brevity, plan above doesn't mention that actual number of groups left is
min( trim_value, 5*limit ) .
MSE (Multi-Stage Engine)
Compared to the SSE, the MSE uses a similar algorithm, but there are notable differences:
MSE doesn't implicitly limit number of query results (to 10)
MSE doesn't limit number of groups when aggregating cross-segment data
MSE doesn't trim results by default in any stage
The default MSE algorithm is shown on the following diagram:

Default MSE group by results approximation
Apart from limiting number of groups on segment level, similar limit is applied at intermediate stage. Since the multi-stage engine (MSE) allows for subqueries, in an execution plan, there could be arbitrary number of stages doing intermediate aggregation between leaf (bottom-most) and top-most stages, and each stage can be implemented with many instances of AggregateOperator (shown as PinotLogicalAggregate in output).
The operator limits number of distinct groups to 100,000 by default, which can be overridden with numGroupsLimit option or num_groups_limit aggregate hint. The limit applies to a single operator instance, meaning that next stage could receive a total of num_instances * num_groups_limit.
It is possible to enable group limiting and trimming at other stages with:
is_enable_group_trim hint - it enables trimming at all SSE/MSE levels and group limiting at cross-segment level. minSegmentGroupTrimSize value needs to be set separately.
Default value: false
mse_min_group_trim_size hint - triggers sorting and trimming of group by results at intermediate stage. Requires is_enable_group_trim hint.
Default value: 5000
When the above hints are used, query processing looks as follows:

Group by results trimming at various stages of MSE query execution utilizing SSE in leaf stage
The actual processing depends on the query, which may not contain an SSE leaf stage aggregate component, and rely on AggregateOperator on all levels. Moreover, since trimming relies on order and limit propagation, it may not happen in a subquery if order by column(s) are not available.
Examples
If hints are applied to query mentioned in SSE examples above, that is :\
then execution plan should be as follows:\
In the plan above trimming happens in three operators: GroupBy, CombineGroupBy and AggregateOperator (which is the physical implementation of PinotLogicalAggregate). \
Configuration Parameters
Parameter
Default
Query Override
Description
(*) SSQ - Single-Stage Query
(**) MSQ - Multi-Stage Query
Lookup UDF Join
For more information about using JOINs with the multi-stage query engine, see JOINs.
Lookup UDF Join is only supported with the single-stage query engine (v1). Lookup joins can be executed using in the multi-stage query engine. For more information about using JOINs with the multi-stage query engine, see .
Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with in Pinot.
// Same setup as offline table example above...
// Fetch table config for realtime table
TableConfig tableConfig = PinotConnectionUtils.getTableConfig(client, "myTable", "REALTIME");
// Same sink configuration
srcRows.addSink(new PinotSinkFunction<>(
new FlinkRowGenericRowConverter(typeInfo),
tableConfig,
schema));
execEnv.execute();
// Set up Flink environment
StreamExecutionEnvironment execEnv = StreamExecutionEnvironment.getExecutionEnvironment();
execEnv.setParallelism(2); // MUST match number of partitions in stream/table
// Configure row type matching your upsert table schema
RowTypeInfo typeInfo = new RowTypeInfo(
new TypeInformation[]{Types.INT, Types.STRING, Types.STRING, Types.FLOAT, Types.LONG, Types.BOOLEAN},
new String[]{"playerId", "name", "game", "score", "timestampInEpoch", "deleted"});
DataStream<Row> srcRows = execEnv.addSource(new FlinkKafkaConsumer<Row>(...));
// Fetch schema and table config (same as offline table example)
// HttpClient httpClient = HttpClient.getInstance();
// ControllerRequestClient client = ...
Schema schema = PinotConnectionUtils.getSchema(client, "myUpsertTable");
TableConfig tableConfig = PinotConnectionUtils.getTableConfig(client, "myUpsertTable", "REALTIME");
// IMPORTANT: Partition data by primary key using the SAME logic as the stream
srcRows.partitionCustom(
(Partitioner<Integer>) (key, partitions) -> key % partitions,
r -> (Integer) r.getField("playerId")) // Primary key field
.addSink(new PinotSinkFunction<>(
new FlinkRowGenericRowConverter(typeInfo),
tableConfig,
schema));
execEnv.execute();
new PinotSinkFunction<>(
recordConverter,
tableConfig,
schema,
segmentFlushMaxNumRecords, // Default: 500,000, number of rows per segment
executorPoolSize, // Default: 5, number of threads to use to upload segment
segmentNamePrefix, // Default: "flink"
segmentUploadTimeMs // Default: current time, upload time value to encode in segment name
)
// Same setup as previous examples...
long segmentFlushMaxNumRecords = 1000000; // Flush after 1M records
int executorPoolSize = 10; // Thread pool size for async uploads
srcRows.addSink(new PinotSinkFunction<>(
new FlinkRowGenericRowConverter(typeInfo),
tableConfig,
schema,
segmentFlushMaxNumRecords,
executorPoolSize
));
// Same setup as previous examples...
String segmentNamePrefix = "flink_job_daily";
Long segmentUploadTimeMs = 1724045185000L; // Group segments by upload run time
srcRows.addSink(new PinotSinkFunction<>(
new FlinkRowGenericRowConverter(typeInfo),
tableConfig,
schema,
DEFAULT_SEGMENT_FLUSH_MAX_NUM_RECORDS,
DEFAULT_EXECUTOR_POOL_SIZE,
segmentNamePrefix,
segmentUploadTimeMs
));
Skip storaging the fields but still indexing it (message in the example) fieldPathsToSkipStorage
Skip indexing the fields unindexableFieldSuffix
Optimize case insensitive search optimizeCaseInsensitiveSearch
Map input key path to a schema name with customizations columnNameToJsonKeyPathMap
Support anonymous dot, {'a.b': 'c'} vs {'a': {'b': 'c}} useAnonymousDotInFieldNames
Truncate value by length mergedTextIndexDocumentMaxLength
Double ingestion to support schema evolution fieldsToDoubleIngest
Memory-optimized
true
Lower
Slightly slower
Only the primary key and a segment/docId reference are stored in the hash map. Column values are read from the segment on each lookup. This trades lookup speed for lower heap usage.
quota.storage
Recommended
Storage quota for the table. Must not exceed the cluster-level controller.dimTable.maxSize (default 200 MB).
dimensionTableConfig.disablePreload
No
Set to true to use memory-optimized mode (store only primary key and segment reference instead of full rows). Defaults to false (fast lookup).
dimensionTableConfig.errorOnDuplicatePrimaryKey
No
Set to true to fail segment loading if duplicate primary keys are detected across segments. Defaults to false (last-loaded segment wins).
SELECT *
FROM ...
OPTION(minSegmentGroupTrimSize=value)
SELECT *
FROM ...
OPTION(groupTrimThreshold=value)
SELECT SUM(colA)
FROM myTable
GROUP BY colB
HAVING SUM(colA) < 100
ORDER BY SUM(colA) DESC
LIMIT 10
SELECT i, j, count(*) AS cnt
FROM tab
GROUP BY i, j
ORDER BY i ASC, j ASC
LIMIT 3;
BROKER_REDUCE(sort:[i, j],limit:10) <- sort and trim groups to minBrokerGroupTrimSize
COMBINE_GROUP_BY <- sort and trim groups to minServerGroupTrimSize
PLAN_START
GROUP_BY <- limit to numGroupsLimit, then sort and trim to minSegmentGroupTrimSize
PROJECT(i, j)
DOC_ID_SET
FILTER_MATCH_ENTIRE_SEGMENT
SELECT /*+ aggOptions(is_enable_group_trim='true', mse_min_group_trim_size='10') */
i, j, count(*) as cnt
FROM myTable
GROUP BY i, j
ORDER BY i ASC, j ASC
LIMIT 3
LogicalSort
PinotLogicalSortExchange(distribution=[hash])
LogicalSort
PinotLogicalAggregate <- aggregate up to num_groups_limit groups, then sort and trim output to group_trim_size
PinotLogicalExchange(distribution=[hash[0, 1]])
LeafStageCombineOperator(table=[mytable])
StreamingInstanceResponse
CombineGroupBy <- aggregate up to minSegmentGroupTrimSize groups
GroupBy <- aggregate up to numGroupsLimit groups, optionally sort and trim to minSegmenGroupTrimSize
Project
DocIdSet
FilterMatchEntireSegment
select /*+ aggOptions(is_enable_group_trim='true', mse_min_group_trim_size='3') */
t1.i, t1.j, count(*) as cnt
from tab t1
join tab t2 on 1=1
group by t1.i, t1.j
order by t1.i asc, t1.j asc
limit 5
LogicalSort
PinotLogicalSortExchange(distribution=[hash])
LogicalSort
PinotLogicalAggregate(aggType=[FINAL]) <- aggregate up to num_groups_limit groups, then sort and trim output to group_trim_size
PinotLogicalExchange(distribution=[hash[0, 1]])
PinotLogicalAggregate(aggType=[LEAF]) <- aggregate up to num_groups_limit groups, then sort and trim output to group_trim_size
LogicalJoin(condition=[true])
PinotLogicalExchange(distribution=[random])
LeafStageCombineOperator(table=[mytable])
...
FilterMatchEntireSegment
PinotLogicalExchange(distribution=[broadcast])
LeafStageCombineOperator(table=[mytable])
...
FilterMatchEntireSegment
All metrics must be noDictionaryColumns.
aggregatedFieldName must be in the Pinot schema and originalFieldName must not exist in Pinot schema
18193
truck
1
700.00
18199
car
2
3200.00
18200
truck
1
800.00
18202
car
3
3700.00
18202
DISTINCTCOUNTHLL
Specify as DISTINCTCOUNTHLL(field, log2m), default is 12. See for how to define log2m. Cannot be changed later, a new field must be used. The schema for the output field should be BYTES type.
DISTINCTCOUNTHLLPLUS
Specify as DISTINCTCOUNTHLLPLUS(field, s, p). See for how to define s and p, they cannot be changed later. The schema for the output field should be BYTES type.
SUMPRECISION
Specify as SUMPRECISION(field, precision), precision must be defined. Used to compute the maximum possible size of the field. Cannot be changed later, a new field must be used. The schema for the output field should be BIG_DECIMAL type.
dimTable Name of the dim table to perform the lookup on.
dimColToLookUp The column name of the dim table to be retrieved to decorate our result.
dimJoinKey The column name on which we want to perform the lookup i.e. the join column name for dim table.
factJoinKey The column name on which we want to perform the lookup against e.g. the join column name for fact table
Noted that:
all the dim-table-related expressions are expressed as literal strings, this is the LOOKUP UDF syntax limitation: we cannot express column identifier which doesn't exist in the query's main table, which is the factTable table.
the syntax definition of [ '''dimJoinKey''', factJoinKey ]* indicates that if there are multiple dim partition columns, there should be multiple join key pair expressed.
Examples
Here are some of the examples
Single-partition-key-column Example
Consider the table baseballStats
Column
Type
playerID
STRING
yearID
INT
teamID
STRING
and dim table dimBaseballTeams
Column
Type
teamID
STRING
teamName
STRING
teamAddress
STRING
several acceptable queries are:
Dim-Fact LOOKUP example
playerName
teamID
teamName
teamAddress
David Allan
BOS
Boston Red Caps/Beaneaters (from 1876–1900) or Boston Red Sox (since 1953)
4 Jersey Street, Boston, MA
David Allan
Self LOOKUP example
teamID
nameFromLocal
nameFromLookup
ANA
Anaheim Angels
Anaheim Angels
ARI
Arizona Diamondbacks
Arizona Diamondbacks
Complex-partition-key-columns Example
Consider a single dimension table with schema:
BILLING SCHEMA
Column
Type
customerId
INT
creditHistory
STRING
firstName
STRING
Self LOOKUP example
customerId
missedPayment
lookedupCity
341
Paid
Palo Alto
374
Paid
Mountain View
Usage FAQ
The data return type of the UDF will be that of the dimColToLookUp column type.
when multiple primary key columns are used for the dimension table (e.g. composite primary key), ensure that the order of keys appearing in the lookup() UDF is the same as the order defined in the primaryKeyColumns from the dimension table schema.
Learn about Logical Tables in Apache Pinot, which provide a unified query interface over multiple physical tables for flexible data organization.
A logical table in Pinot provides a unified query interface over multiple physical tables. Instead of querying individual tables separately, users can query a single logical table that transparently routes the query to all underlying physical tables and aggregates the results.
Overview
Logical tables are useful for:
Geographic/Regional partitioning: Split data by region (e.g., ordersUS, ordersEU, ordersAPAC) while providing a unified orders table for queries
Table partitioning strategies: Organize data across multiple physical tables based on business logic
Time-based table splitting: Combine historical and recent data from different physical tables
Logical tables require that all underlying physical tables share the same schema structure. A schema with the same name as the logical table must be created before creating the logical table.
How It Works
When you query a logical table, Pinot:
Resolves the logical table name to its list of physical tables
Routes the query to all relevant physical tables (both offline and realtime)
Aggregates results from all physical tables
For hybrid logical tables (containing both offline and realtime physical tables), Pinot uses a configurable time boundary strategy to determine which segments to query from each table type, avoiding duplicate data.
Segment Pruning Optimization
Pinot performs automatic cross-table segment pruning when querying logical tables. Instead of pruning segments independently for each physical table, segment pruning operates once across all physical tables collectively. This optimization is particularly beneficial for queries using ORDER BY with LIMIT, where the SelectionQuerySegmentPruner can now prune segments across the entire logical table.
For example, with a logical table spanning three physical tables (US, EU, APAC), a query like:
Previously, the pruner would prune segments within each physical table independently, potentially returning more segments than necessary. Now, pruning happens across all physical tables together, allowing the pruner to identify and return only the minimum set of segments needed to satisfy the query requirements.
Key benefits:
Improved query performance by reducing segments processed
Automatic optimization with no configuration changes required
Particularly effective for ORDER BY + LIMIT queries across logical tables
Logical Table Configuration
A logical table configuration defines the mapping between the logical table and its physical tables.
Configuration Properties
Property
Description
Required
Example Configuration
Hybrid Logical Table Configuration
For logical tables that combine both offline and realtime physical tables:
Creating a Logical Table
Step 1: Create the Schema
Create a schema that matches the structure of your physical tables:
Upload the schema:
Step 2: Create the Logical Table
Managing Logical Tables
List Logical Tables
Get Logical Table Configuration
Update Logical Table
Delete Logical Table
Deleting a logical table only removes the logical table configuration. The underlying physical tables and their data are not affected.
Querying Logical Tables
Query a logical table just like any other Pinot table:
Logical tables work with both the single-stage and multi-stage query engines.
Time Boundary Configuration
For hybrid logical tables that contain both offline and realtime physical tables, you must configure a time boundary strategy to avoid querying duplicate data.
Available Strategies
Strategy
Description
Configuration Example
The includedTables parameter specifies which physical tables should be considered when computing the time boundary.
Query Configuration
Logical tables support query-level configurations:
Property
Description
Quota Configuration
Apply rate limiting to logical tables:
Storage quota (quota.storage) is not supported for logical tables since they don't store data directly.
Managing Logical Tables via the Controller UI
The Pinot Controller UI provides full CRUD management for logical tables, accessible directly from the main Tables page.
Accessing Logical Tables
Open the Controller UI (default: http://<controller-host>:9000).
Navigate to Tables in the left sidebar.
The Tables page displays physical tables and logical tables in separate sections.
Supported Operations
Operation
Description
All operations are also available via the REST API at /logicalTables/{tableName} using GET, PUT, and DELETE.
Quick Start Example
Try the logical table quickstart to see the feature in action:
This quickstart:
Creates three physical tables: ordersUS_OFFLINE, ordersEU_OFFLINE, and ordersAPAC_OFFLINE
Creates a logical table orders that unifies all three
Validation Rules
When creating or updating a logical table, Pinot validates:
Table name does not end with _OFFLINE or _REALTIME
All physical tables exist (unless marked as multiCluster)
Physical tables are in the same database as the logical table
Limitations
All physical tables must have compatible schemas
Storage quota is not supported
Physical tables in the same logical table should ideally have consistent indexing for optimal query performance
Pluggable LogicalTableConfig Serialization
By default, LogicalTableConfig is serialized to and deserialized from ZooKeeper using a built-in JSON format. For advanced use cases requiring a custom storage format, implement LogicalTableConfigSerDe and register it via LogicalTableConfigSerDeProvider.
When to Use This
You need a compact binary format for deployments with a very large number of logical tables
Your ZooKeeper schema requires a specific non-default encoding
You are integrating Pinot with an external metadata system with its own serialization requirements
Implementation
Step 1: Implement the LogicalTableConfigSerDe interface:
Step 2: Implement LogicalTableConfigSerDeProvider to return your custom SerDe.
Step 3: Register the provider using the Java Service Provider Interface (SPI) by creating the file:
containing the fully-qualified class name of your provider implementation.
This is an advanced extension point for specialized deployments. Most users should rely on the default JSON-based serialization.
See Also
Upload Pinot Segment Using CLI
Upload existing Pinot segments to a controller.
This guide explains how to upload already-built Pinot segments to a Pinot controller, which REST endpoint to call, and when to use tar push, URI push, or metadata push.
Use this flow when your segment .tar.gz files already exist outside Pinot, for example when migrating from an old cluster, backfilling from another system, or re-registering segments that already live in deep storage.
or confirm one exists that matches the segment you want to upload.
If needed, upload the schema and table configs.
Make sure the controller can read the segment source:
For tar push, the client must be able to stream the segment tar file to the controller.
For URI push and metadata push, the controller must be able to access the URI scheme you use. For PinotFS-backed schemes such as HDFS, S3, GCS, and ADLS, configure the matching . For custom schemes, implement a .
Controller upload endpoints
The controller exposes three upload endpoints:
Endpoint
Use case
Content type
Notes
/v2/segments is the endpoint to document and use by default. The legacy /segments endpoint is still present for backward compatibility. Its JSON-based URI push path keeps the original DOWNLOAD_URI instead of moving the segment into a Pinot-chosen final location, so new integrations should use /v2/segments.
Common request options
Query parameters
All three upload modes use the same query parameters:
Query parameter
Required
Default
Description
Example:
Headers
Header
Required
Applies to
Description
Push modes
Tar push
Tar push is the original and default upload mode. Use it when the client can stream the full segment tar file to the controller.
Request shape
Endpoint: POST /v2/segments
Content type: multipart/form-data
Headers: UPLOAD_TYPE omitted or set to SEGMENT
What the controller does
Stores the uploaded segment in the controller's segment directory or deep store.
Extracts segment metadata.
Adds or refreshes the segment in the target table.
Example:
If you prefer the Pinot CLI, pinot-admin.sh UploadSegment uses tar push for local segment directories:
URI push
URI push is best when the segment tar file already exists in deep storage or another controller-readable remote system.
Request shape
Endpoint: POST /v2/segments
Content type: application/json
Headers:
What the controller does
Downloads the segment tar from DOWNLOAD_URI.
Stores it in the controller's segment directory or deep store.
Extracts metadata.
Example:
Use URI push only when the controller can resolve the URI scheme. If the source is on HDFS, S3, GCS, ADLS, or a custom system, configure Pinot with the appropriate or .
Metadata push
Metadata push is the most controller-efficient option when the segment tar already exists in a reachable storage system.
Instead of uploading the full segment tar, the client uploads segment metadata and tells the controller where the tar already lives.
Request shape
Endpoint: POST /v2/segments
Content type: multipart/form-data
Headers:
The metadata tarball contains the segment metadata files, typically creation.meta and metadata.properties.
What the controller does
Reads the uploaded metadata bundle.
Uses DOWNLOAD_URI as the segment download location.
Adds or refreshes the segment in the table without downloading the full tar just to inspect metadata.
If you set COPY_SEGMENT_TO_DEEP_STORE: true, the controller copies the segment from DOWNLOAD_URI into Pinot deep store and stores the final deep-store URI in segment metadata. This is useful when the ingestion job writes to a staging location instead of the final deep-store path.
Example:
COPY_SEGMENT_TO_DEEP_STORE is only useful for metadata push. The staging URI and Pinot deep store should use the same storage scheme because the copy happens through PinotFS.
Batch metadata push
If you need to metadata-push many segments in one call, use POST /segments/batchUpload.
Request shape
Endpoint: POST /segments/batchUpload
Content type: multipart/form-data
Query parameters: tableName and tableType
This endpoint is only for metadata push.
Job types and Pinot Admin mapping
If you are pushing from a batch ingestion job, the jobType maps to controller upload mode like this:
Job type
Push mode
Controller endpoint
For ingestion jobs, define the push behavior in the . Example:
Then launch it with:
Choosing the right mode
Mode
Use it when
Tradeoff
For production clusters with deep store configured, SegmentCreationAndMetadataPush is generally the preferred ingestion-job mode.
Explain Plan
Query execution within Pinot is modeled as a sequence of operators that are executed in a pipelined manner to produce the final result. The EXPLAIN PLAN FOR syntax can be used to obtain the execution plan of a query, which can be useful to further optimize them.
The explain plan output format is still under development and may change in future releases. This under-development label applies to the explain plan output format specifically, not to the core multi-stage engine, which is generally available. Pinot explain plans are human-readable and are intended to be used for debugging and optimization purposes. This is especially important when using the explain plan in automated scripts or tools. The explain plan, even the ones returned as tables or JSON, are not guaranteed to be stable across releases.
Pinot supports different type of explain plans depending on the query engine and the granularity or details we want to obtain.
Different plans for different segments
Segments are the basic unit of data storage and processing in Pinot. When a query is executed, it is executed on each segment and the results are merged together. Not all segments have the data distribution, indexes, etc. Therefore the query engine may decide to execute the query differently on different segments. This includes:
Segments that were not refreshed since indexes were added or removed on the table config.
Realtime segments that are being ingested, where some indexes (like range indexes) cannot be used.
Data distribution, specially min and max values for columns, which can affect the query plan.
Given a Pinot query can touch thousands of segments, Pinot tries to minimize the number of different queries shown when explaining a query. By default, Pinot tries to analyze the plan for each segment and returns a simplified plan. How this simplification is done depends on the query engine, you can read more about that below.
There is a verbose mode that can be used to show the plan for each segment. This mode is activated by setting the explainPlanVerbose query option to true, prefixing SET explainPlanVerbose=true; to the explain plan sentence.
Explain on multi-stage query engine
Following the more complex nature of the multi-stage query engine, its explain plan can be customized to get a plan specialized on different aspects of the query execution.
There are 3 different types of explain plans for the multi-stage query engine:
Mode
Syntax by default
Syntax if segment plan is enabled
Description
The syntax used to select each explain plan mode is confusing and it may be changed in the future.
Segment plan
The plan with segments is a detailed representation of the query execution plan that includes the segment specific information, like data distribution, indexes, etc.
This mode was introduced in Pinot 1.3.0 and it is planned to be the default in future releases. Meanwhile it can be used by setting the explainAskingServers query option to true, prefixing SET explainAskingServers=true; to the explain plan sentence. Alternatively this mode can be activated by default by changing the broker configuration pinot.query.multistage.explain.include.segment.plan to true.
Independently of how it is activated, once this mode is enabled, EXPLAIN PLAN FOR syntax will include segment information.
Verbose and brief mode
As explained in Different plans for different segments, by default Pinot tries to minimize the number of different query shown when explaining a query. In multi-stage, the brief mode includes all different plans, but each equivalent plan is aggregated. For example, if the same plan is executed on 100 segments, the brief mode will show it only once and stats like the number of docs will be summed.
In the verbose mode, one plan is shown per segment, including the segment name and all the segment specific information. This may be useful to know which segments are not using indexes, or which segments are using a different data distribution.
Example
Returns
Logical Plan
The logical plan is a high-level representation of the query execution plan. This plan is calculated on the broker without asking the servers for their segment specific plans. This means that the logical plan does not include the segment specific information, like data distribution, indexes, etc.
In Pinot 1.3.0, the logical plan is enabled by default and can be obtained by using EXPLAIN PLAN FOR syntax. Optionally, the segment plan can be enabled by default, in which case the logical plan can be obtained by using EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR syntax.
The recommended way to ask for logical plan is to use EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR given this syntax is available in all versions of Pinot, independently of the configuration.
Example:
Returns:
Workers plan
There have been some discussion about how to name this explain mode and it may change in future versions. The term worker is leaking an implementation detail that is not explained anywhere else in the user documentation.
The workers plan is a detailed representation of the query execution plan that includes information on how the query is distributed among different servers and workers inside them. This plan does not include the segment specific information, like data distribution, indexes, etc. and it is probably the less useful of the plans for normal use cases.
Their main use case is to try to reduce data shuffling between workers by verifying that, for example, a join is executed in colocated fashion.
Example
Returns:
Interpreting multi-stage explain plans
Multi-stage plans are more complex than single-stage plans. This section explains how to interpret them.
You can use the EXPLAIN PLAN syntax to obtain the logical plan of a query. There are different formats for the output, but all of them represent the logical plan of the query.
The query
Can produce the following output:
Each node in the tree represents an operation, and each operator has attributes. For example the LogicalJoin operator has a condition attribute that specifies the join condition and a joinType.
Understanding indexed references
Expressions like $2 are indexed references into the input row for each operator. To understand them, look at the operator's children to see which attributes are being referenced, usually starting from the leaf operators.
For example, LogicalTableScan always returns the whole row of the table, so its attributes are the columns of the table:
The LogicalProject operator selects columns o_custkey and o_shippriority (at positions $5 and $10 in the table row) and generates a row with two columns. The PinotLogicalExchange distributes rows using hash[0], meaning the hash of the first column from LogicalProject — which is o_custkey.
Virtual rows in joins
The LogicalJoin operator receives rows from two upstream stages. The virtual row seen by the join is the concatenation of the left-hand side plus the right-hand side.
In the example above, the left stage sends [c_address, c_custkey] and the right stage sends [o_custkey, o_shippriority]. The join sees a row with columns [c_address, c_custkey, o_custkey, o_shippriority]. The condition =($1, $2) joins on c_custkey and o_custkey. The join passes through all columns unchanged, so its downstream LogicalProject selecting $0 and $3 produces [c_address, o_shippriority].
LogicalSort without ORDER BY
A LogicalSort operator can appear even when the SQL query has no ORDER BY. In relational algebra, a sort node is used to express LIMIT. When no sort condition is specified, no actual sorting is performed — only the row limit is applied.
Explain on single stage query engine
Explain plan for single stage query engine is described in deep in
Explain plan for single stage query engine is simpler and less customized, but returns the information in a tabular format. For example, the query EXPLAIN PLAN FOR SELECT playerID, playerName FROM baseballStats.
Returns the following table:
Where Operator column describes the operator that Pinot will run whereas the Operator_Id and Parent_Id columns show the parent-child relationship between operators, which forms the execution tree. For example, the plan above should be understood as:
2024/11/04 00:24:27.760 ERROR [RealtimeSegmentDataManager_dailySales__0__0__20241104T0824Z] [dailySales__0__0__20241104T0824Z] Caught exception while indexing the record at offset: 9 , row: {
"fieldToValueMap" : {
"price" : "1000.00",
"daysSinceEpoch" : 18202,
"sales_count" : 0,
"total_sales" : 0.0,
"product_name" : "car",
"timestamp" : 1572678000000
},
"nullValueFields" : [ "sales_count", "total_sales" ]
}
java.lang.ClassCastException: class java.lang.String cannot be cast to class java.lang.Number (java.lang.String and java.lang.Number are in module java.base of loader 'bootstrap')
at org.apache.pinot.segment.local.aggregator.SumValueAggregator.applyRawValue(SumValueAggregator.java:25) ~[classes/:?]
at org.apache.pinot.segment.local.indexsegment.mutable.MutableSegmentImpl.aggregateMetrics(MutableSegmentImpl.java:855) ~[classes/:?]
at org.apache.pinot.segment.local.indexsegment.mutable.MutableSegmentImpl.index(MutableSegmentImpl.java:577) ~[classes/:?]
at org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager.processStreamEvents(RealtimeSegmentDataManager.java:641) ~[classes/:?]
at org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager.consumeLoop(RealtimeSegmentDataManager.java:477) ~[classes/:?]
at org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager$PartitionConsumer.run(RealtimeSegmentDataManager.java:734) ~[classes/:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
-- Query the logical table
SELECT COUNT(*) FROM orders
-- Filter by region
SELECT orderId, customerId, region, status
FROM orders
WHERE region = 'us'
LIMIT 10
-- Aggregate across all regions
SELECT region, COUNT(*) as orderCount
FROM orders
GROUP BY region
ORDER BY region
For historical reasons, null support is disabled in Apache Pinot by default. This is expected to be changed in future versions.
For historical reasons, null support is disabled by default in Apache Pinot. When null support is disabled, all columns are treated as not null. Predicates like IS NOT NULL evaluates to true, and IS NULL evaluates to false. Aggregation functions like COUNT, SUM, AVG, MODE, etc. treat all columns as not null.
For example, the predicate in the query below matches all records.
To handle null values in your data, you must:
Indicate Pinot to store null values in your data before ingesting the data. See .
Use one of the . By default Pinot will use a where only IS NULL and IS NOT NULL predicates are supported, but the can be enabled.
The following table summarizes the behavior of null handling support in Pinot:
disabled (default)
basic (enabled at ingestion time)
advanced (enabled at query time)
How Pinot stores null values
Pinot always stores column values in a . Forward index never stores null values but have to store a value for each row. Therefore independent of the null handling configuration, Pinot always stores a default value for nulls rows in the forward index. The default value used in a column can be specified in the configuration by setting the defaultNullValue field spec. The defaultNullValue depends on the type of data.
Remember that in the JSON used as table configuration, defaultNullValue must always be a String. If the column type is not String, Pinot will convert that value to the column type automatically.
Disabled null handling
By default, Pinot does not store null values at all. This means that by default whenever a null value is ingested, Pinot stores the default null value (defined above) instead.
In order to store null values the table has to be configured to do so as explained below.
Store nulls at ingestion time
When null storing is enabled, Pinot creates a new index called the null index or null vector index. This index stores the document IDs of the rows that have null values for the column.
Although null storing can be enabled after data has been ingested, data ingested before this mode is enabled will not store the null index and therefore it will be treated as not null.
Null support is configured per table. You can configure one table to store nulls, and configure another table to not store nulls. There are two ways to define null storing support in Pinot:
, where each column in a table is configured as nullable or not nullable. We recommend enabling null storing support by column. This is the only way to support null handling in the .
, where all columns in the table are considered nullable. This is how null values were handled before Pinot 1.1.0 and now deprecated.
Remember that Column based null storing has priority over Table based null storing. In case both modes are enabled, Column based null storing will be used.
Column based null storing
We recommend configuring column based null storing, which lets you specify null handling per column and supports null handling in the multi-stage query engine.
To enable column based null handling:
Set to true in the schema configuration before ingesting data.
Then specify which columns are not nullable using the notNull field spec, which defaults to false.
Table based null storing
This is the only way to enable null storing in Pinot before 1.1.0, but it is deprecated since then. Table based null storing is more expensive in terms of disk space and query performance than column based null storing. Also, it is not possible to support null handling in multi-stage query engine using table based null storing.
When table based null storing is enabled, all columns will be considered nullable. To enable this mode you need to:
Enable the nullHandlingEnabled configuration in
Disable in the schema.
Remember nullHandlingEnabled table configuration enables table based null handling while enableNullHandling is the query option that enables advanced null handling at query time. See for more information.
As an example:
Null handling at query time
To enable basic null handling by at query time, enable Pinot to . Advanced null handling support can be optionally enabled.
The multi-stage query engine requires column based null storing. Tables with table based null storing are considered not nullable.
If you are converting from null support for the single-stage query engine, you can modify your schema to set enableColumnBasedNullHandling. There is no need to change your table config to remove or set nullHandlingEnabled to false. In fact we recommend to keep it as true to make it clear that the table may contain nulls. Also, when converting:
No reingestion is needed.
If the columns are changed from nullable to not nullable and there is a value that was previously null, the default value will be used instead.
Basic null support
The basic null support is automatically enabled when null values are stored on a segment (see ).
In this mode, Pinot is able to handle simple predicates like IS NULL or IS NOT NULL. Other transformation functions (like CASE, COALESCE, +, etc.) and aggregations functions (like COUNT, SUM, AVG, etc.) will use the default value specified in the schema for null values.
For example, in the following table:
rowId
col1
If the default value for col1 is 1, the following query:
Will return the following result:
rowId
col1
While
While return the following:
rowId
col1
And queries like
Will return
rowId
col1
Also
count
mode
Given that neither count or mode function will ignore null values as expected but read instead the default value (in this case 1) stored in the forward index.
Advanced null handling support
Advanced null handling has two requirements:
Segments must store null values (see ).
The query must enable null handling by setting the enableNullHandling to true.
The later can be done in one of the following ways:
Set enableNullHandling=true at the beginning of the query.
If using JDBC, set the connection option enableNullHandling=true (either in the URL or as a property).
Alternatively, if you want to enable advanced null handling for all queries by default, the broker configuration pinot.broker.query.enable.null.handling can be set to true. Individual queries can override this to false using the enableNullHandling query option if required.
Even though they have similar names, the nullHandlingEnabled table configuration and the enableNullHandling query option are different. Remember that the nullHandlingEnabled table configuration modifies how segments are stored and the enableNullHandling query option modifies how queries are executed.
When the enableNullHandling option is set to true, the Pinot query engine uses a different execution path that interprets nulls in a standard SQL way. This means that IS NULL and IS NOT NULL predicates will evaluate to true or false according to whether a null is detected (like in basic null support mode) but also aggregation functions like COUNT, SUM, AVG, MODE, etc. will deal with null values as expected (usually ignoring null values).
In this mode, some indexes may not be usable, and queries may be significantly more expensive. Performance degradation impacts all the columns in the table, including columns in the query that do not contain null values. This degradation happens even when table uses column based null storing.
Examples queries
Select Query
Filter Query
Aggregate Query
Aggregate Filter Query
Group By Query
Order By Query
Transform Query
Appendix: Workarounds to handle null values without storing nulls
If you're not able to generate the null index for your use case, you may filter for null values using a default value specified in your schema or a specific value included in your query.
The following example queries work when the null value is not used in a dataset. Unexpected values may be returned if the specified null value is a valid value in the dataset.
Filter for default null value(s) specified in your schema
Specify a default null value (defaultNullValue) in your for dimension fields, (dimensionFieldSpecs), metric fields (metricFieldSpecs), and date time fields (dateTimeFieldSpecs).
Ingest the data.
Filter for a specific value in your query
Filter for a specific value in your query that will not be included in the dataset. For example, to calculate the average age, use -1 to indicate the value of Age is null.
Rewrite the following query:
To cover null values as follows:
Table
Explore the table component in Apache Pinot, a fundamental building block for organizing and managing data in Pinot clusters, enabling effective data processing and analysis.
Pinot stores data in tables. A Pinot table is conceptually identical to a relational database table with rows and columns. Columns have the same name and data type, known as the table's .
Pinot schemas are defined in a JSON file. Because that schema definition is in its own file, multiple tables can share a single schema. Each table can have a unique name, indexing strategy, partitioning, data sources, and other metadata.
Pinot table types include:
real-time: Ingests data from a streaming source like Apache Kafka®
HDFS
This guide shows you how to configure HDFS for use with Pinot, including data import and deep storage.
Enable the using the pinot-hdfs plugin. In the controller or server, add the config:
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro
-- SET explainAskingServer= true is required if
-- pinot.query.multistage.explain.include.segment.plan is false,
-- optional otherise
SET explainAskingServers=true;
EXPLAIN PLAN FOR
SELECT DISTINCT deviceOS, groupUUID
FROM userAttributes AS a
JOIN userGroups AS g
ON a.userUUID = g.userUUID
WHERE g.groupUUID = 'group-1'
LIMIT 100
-- WITHOUT IMPLENTATION qualifier can be used to ensure logical plan is used
-- It can be used in any version of Pinot even when segment plan is enabled by default
EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR
SELECT DISTINCT deviceOS, groupUUID
FROM userAttributes AS a
JOIN userGroups AS g
ON a.userUUID = g.userUUID
WHERE g.groupUUID = 'group-1'
LIMIT 100
EXPLAIN IMPLEMENTATION PLAN FOR
SELECT DISTINCT deviceOS, groupUUID
FROM userAttributes AS a
JOIN userGroups AS g
ON a.userUUID = g.userUUID
WHERE g.groupUUID = 'group-1'
LIMIT 100
hybrid: Loads data from both a batch source and a streaming source
Pinot breaks a table into multiple segments and stores these segments in a deep-store such as Hadoop Distributed File System (HDFS) as well as Pinot servers.
In the Pinot cluster, a table is modeled as a Helix resource and each segment of a table is modeled as a Helix Partition.
Table naming in Pinot follows typical naming conventions, such as starting names with a letter, not ending with an underscore, and using only alphanumeric characters.
Pinot supports the following types of tables:
Type
Description
Offline
Offline tables ingest pre-built Pinot segments from external data stores and are generally used for batch ingestion.
Real-time
Real-time tables ingest data from streams (such as Kafka) and build segments from the consumed data.
Hybrid
Hybrid Pinot tables have both real-time as well as offline tables under the hood. By default, all tables in Pinot are hybrid.
The user querying the database does not need to know the type of the table. They only need to specify the table name in the query.
e.g. regardless of whether we have an offline table myTable_OFFLINE, a real-time table myTable_REALTIME, or a hybrid table containing both of these, the query will be:
Table configuration is used to define the table properties, such as name, type, indexing, routing, and retention. It is written in JSON format and is stored in Zookeeper, along with the table schema.
Use the following properties to make your tables faster or leaner:
Segment
Indexing
Tenants
Segments
A table is comprised of small chunks of data known as segments. Learn more about how Pinot creates and manages segments here.
For offline tables, segments are built outside of Pinot and uploaded using a distributed executor such as Spark or Hadoop. For details, see Batch Ingestion.
For real-time tables, segments are built in a specific interval inside Pinot. You can tune the following for the real-time segments.
Flush
The Pinot real-time consumer ingests the data, creates the segment, and then flushes the in-memory segment to disk. Pinot allows you to configure when to flush the segment in the following ways:
Number of consumed rows: After consuming the specified number of rows from the stream, Pinot will persist the segment to disk.
Number of rows per segment: Pinot learns and then estimates the number of rows that need to be consumed. The learning phase starts by setting the number of rows to 100,000 (this value can be changed) and adjusts it to reach the appropriate segment size. Because Pinot corrects the estimate as it goes along, the segment size might go significantly over the correct size during the learning phase. You should set this value to optimize the performance of queries.
Max time duration to wait: Pinot consumers wait for the configured time duration after which segments are persisted to the disk.
Replicas
A segment can have multiple replicas to provide higher availability. You can configure the number of replicas for a table segment using the CLI.
Completion Mode
By default, if the in-memory segment in the non-winner server is equivalent to the committed segment, then the non-winner server builds and replaces the segment. If the available segment is not equivalent to the committed segment, the server just downloads the committed segment from the controller.
However, in certain scenarios, the segment build can get very memory-intensive. In these cases, you might want to enforce the non-committer servers to just download the segment from the controller instead of building it again. You can do this by setting completionMode: "DOWNLOAD" in the table configuration.
A Pinot server might fail to download segments from the deep store, such as HDFS, after its completion. However, you can configure servers to download these segments from peer servers instead of the deep store. Currently, only HTTP and HTTPS download schemes are supported. More methods, such as gRPC/Thrift, are planned be added in the future.
Dictionary-encoded forward index with bit compression
Raw value forward index
Sorted forward index with run-length encoding
Bitmap inverted index
Sorted inverted index
For more details on each indexing mechanism and corresponding configurations, see Indexing.
Set up Bloomfilters on columns to make queries faster. You can also keep segments in off-heap instead of on-heap memory for faster queries.
Pre-aggregation
Aggregate the real-time stream data as it is consumed to reduce segment sizes. We add the metric column values of all rows that have the same values for all dimension and time columns and create a single row in the segment. This feature is only available on REALTIME tables.
The only supported aggregation is SUM. The columns to pre-aggregate need to satisfy the following requirements:
All metrics should be listed in noDictionaryColumns.
No multi-value dimensions
All dimension columns are treated to have a dictionary, even if they appear as noDictionaryColumns in the config.
The following table config snippet shows an example of enabling pre-aggregation during real-time ingestion:
Tenants
Each table is associated with a tenant. A segment resides on the server, which has the same tenant as itself. For details, see Tenant.
Optionally, override if a table should move to a server with different tenant based on segment status. The example below adds a tagOverrideConfig under the tenants section for real-time tables to override tags for consuming and completed segments.
In the above example, the consuming segments will still be assigned to serverTenantName_REALTIME hosts, but once they are completed, the segments will be moved to serverTenantName_OFFLINE.
You can specify the full name of any tag in this section. For example, you could decide that completed segments for this table should be in Pinot servers tagged as allTables_COMPLETED). To learn more about, see the Moving Completed Segments section.
Hybrid table
A hybrid table is a table composed of two tables, one offline and one real-time, that share the same name. In a hybrid table, offline segments can be pushed periodically. The retention on the offline table can be set to a high value because segments are coming in on a periodic basis, whereas the retention on the real-time part can be small.
Once an offline segment is pushed to cover a recent time period, the brokers automatically switch to using the offline table for segments for that time period and use the real-time table only for data not available in the offline table.
To learn how time boundaries work for hybrid tables, see Broker.
A typical use case for hybrid tables is pushing deduplicated, cleaned-up data into an offline table every day while consuming real-time data as it arrives. Data can remain in offline tables for as long as a few years, while the real-time data would be cleaned every few days.
Examples
Create a table config for your data, or see examples for all possible batch/streaming tables.
Check out the table config in the Rest API to make sure it was successfully uploaded.
Streaming table creation
Start Kafka
Create a Kafka topic
Create a streaming table
Sample output
Start Kafka-Zookeeper
Start Kafka
Create stream table
Check out the table config in the Rest API to make sure it was successfully uploaded.
Logical table
A logical table provides a unified query interface over multiple physical tables. This is useful for geographic partitioning, table sharding strategies, or creating abstraction layers over complex table hierarchies.
HDFS implementation provides the following options:
hadoop.conf.path: Absolute path of the directory containing Hadoop XML configuration files, such as hdfs-site.xml, core-site.xml .
hadoop.write.checksum: Create checksum while pushing an object. Default is false
hadoop.kerberos.principle
hadoop.kerberos.keytab
Each of these properties should be prefixed by pinot.[node].storage.factory.class.hdfs. where node is either controller or server depending on the config
The kerberos configs should be used only if your Hadoop installation is secured with Kerberos. Refer to the Hadoop in secure mode documentation for information on how to secure Hadoop using Kerberos.
You must provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.
Push HDFS segment to Pinot Controller
To push HDFS segment files to Pinot controller, send the HDFS path of your newly created segment files to the Pinot Controller. The controller will download the files.
This curl example requests tells the controller to download segment files to the proper table:
Examples
Job spec
Standalone Job:
Hadoop Job:
Controller config
Server config
Minion config
HDFS as deep storage
To use HDFS as deep storage, configure each Pinot component with the HDFS plugin and the appropriate storage factory and segment fetcher properties. The sections below provide complete configuration and startup examples for each component.
Server setup
Configuration
Executable
Controller setup
Configuration
Executable
Broker setup
Configuration
Executable
Kerberos authentication
When using HDFS with Kerberos security enabled, Pinot provides two ways to authenticate:
1. Automatic authentication (recommended)
By configuring the storage.factory Kerberos properties shown above, Pinot will automatically handle Kerberos authentication using the specified keytab and principal. This eliminates the need for manual kinit commands and ensures continuous authentication even after ticket expiration.
Why these properties are required
The storage.factory Kerberos properties serve a critical purpose in Pinot's HDFS integration:
For Controller:
The controller uses controller.data.dir to store segment metadata and other data in HDFS
When controller.data.dir points to an HDFS path (e.g., hdfs://namenode:8020/pinot/data), the HadoopPinotFS plugin needs Kerberos credentials to access it
Without storage.factory Kerberos properties, the controller would fail to read/write to HDFS, causing segment upload and metadata operations to fail
These properties enable the HadoopPinotFS plugin to programmatically authenticate using the keytab file
For Server:
The server uses HadoopPinotFS for various HDFS operations including segment downloads and deep storage access
When servers need to access segments stored in HDFS deep storage, they require valid Kerberos credentials
The storage.factory properties provide persistent authentication that survives across server restarts and ticket expirations
Understanding the two sets of Kerberos properties
You may notice two sets of Kerberos properties in the configuration:
Purpose: These properties configure Kerberos authentication for the HadoopPinotFS storage factory, which handles controller and server deep storage operations and general HDFS filesystem operations through the storage factory.
Why needed: The storage factory is initialized at startup and used throughout the component's lifecycle for HDFS access. Without these properties, any HDFS operation through the storage factory would fail with authentication errors.
segment.fetcher properties (legacy, for backward compatibility):
pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle (note: typo "principle" instead of "principal" maintained for compatibility)
Eliminates the need to run kinit commands manually, reducing operational overhead and human error
Kerberos tickets typically expire after 24 hours (configurable); with keytab-based authentication, Pinot automatically renews tickets internally, preventing service disruptions
Keytab files provide secure, long-term credentials without storing passwords in scripts or configuration
2. Manual authentication (legacy)
Alternatively, you can manually authenticate using kinit before starting Pinot components:
Limitations of manual authentication:
Ticket expiration: Kerberos tickets typically expire after 24 hours, requiring re-authentication
Service interruption: If tickets expire while Pinot is running, HDFS operations will fail until re-authentication
Operational burden: Requires monitoring and manual intervention, especially problematic for 24/7 production systems
Automation challenges: Difficult to integrate into automated deployment pipelines
Manual authentication is not recommended for production environments. Always use the storage.factory Kerberos properties for production deployments.
Troubleshooting
HDFS FileSystem issues
If you receive an error that says No FileSystem for scheme"hdfs", the problem is likely to be a class loading issue.
To fix, try adding the following property to core-site.xml:
And then export /opt/pinot/lib/hadoop-common-<release-version>.jar in the classpath.
Kerberos authentication issues
Error: "Failed to authenticate with Kerberos"
Possible causes:
Incorrect keytab path: Ensure the keytab file path is absolute and accessible by the Pinot process
Wrong principal name: Verify the principal name matches the one in the keytab file
Keytab file permissions: The keytab file must be readable by the user running Pinot (typically chmod 400 or chmod 600)
Solution:
Error: "GSSException: No valid credentials provided"
Cause: This typically occurs when the storage.factory Kerberos properties are not set, the keytab file path is incorrect or the file doesn't exist, or the Kerberos configuration (krb5.conf) is not properly configured.
Solution:
Verify all storage.factory Kerberos properties are correctly set in the configuration
Ensure the keytab file exists and has correct permissions
Check that /etc/krb5.conf (or $JAVA_HOME/jre/lib/security/krb5.conf) is properly configured with your Kerberos realm settings
Error: "Unable to obtain Kerberos password" or "Clock skew too great"
Cause: Time synchronization issue between Pinot server and Kerberos KDC.
Solution:
Kerberos requires clock synchronization within 5 minutes (default) between client and KDC.
Error: "HDFS operation fails after running for several hours"
Cause: This typically indicates that manual kinit was used instead of storage.factory properties, and Kerberos tickets have expired (default 24 hours).
Solution:
Configure storage.factory Kerberos properties to enable automatic ticket renewal
Remove any manual kinit from startup scripts
Restart Pinot components to apply the configuration
Verifying Kerberos configuration
To verify your Kerberos setup is working correctly:
Best practices
Use absolute paths for keytab files in configuration
Secure keytab files with appropriate permissions (400 or 600)
Use service principals (e.g., pinot/hostname@REALM) rather than user principals for production
Monitor Kerberos ticket expiration in logs to ensure automatic renewal is working
Keep keytab files backed up in secure locations
Test configuration in a non-production environment first
Understand how the components of Apache Pinot™ work together to create a scalable OLAP database that can deliver low-latency, high-concurrency queries at scale.
Apache Pinot™ is a distributed OLAP database designed to serve real-time, user-facing use cases, which means handling large volumes of data and many concurrent queries with very low query latencies. Pinot supports the following requirements:
Ultra low-latency queries (as low as 10ms P95)
High query concurrency (as many as 100,000 queries per second)
High data freshness (streaming data available for query immediately upon ingestion)
Large data volume (up to petabytes)
Distributed design principles
To accommodate large data volumes with stringent latency and concurrency requirements, Pinot is designed as a distributed database that supports the following requirements:
Highly available: Pinot has no single point of failure. When tables are configured for replication, and a node goes down, the cluster is able to continue processing queries.
Horizontally scalable: Operators can scale a Pinot cluster by adding new nodes when the workload increases. There are even two node types ( and ) to scale query volume, query complexity, and data size independently.
Immutable data
Core components
As described in the Pinot , Pinot has four node types:
Apache Helix and ZooKeeper
Distributed systems do not maintain themselves, and in fact require sophisticated scheduling and resource management to function. Pinot uses for this purpose. Helix exists as an independent project, but it was designed by the original creators of Pinot for Pinot's own cluster management purposes, so the architectures of the two systems are well-aligned. Helix takes the form of a process on the controller, plus embedded agents on the brokers and servers. It uses as a fault-tolerant, strongly consistent, durable state store.
Helix maintains a picture of the intended state of the cluster, including the number of servers and brokers, the configuration and schema of all tables, connections to streaming ingest sources, currently executing batch ingestion jobs, the assignment of table segments to the servers in the cluster, and more. All of these configuration items are potentially mutable quantities, since operators routinely change table schemas, add or remove streaming ingest sources, begin new batch ingestion jobs, and so on. Additionally, physical cluster state may change as servers and brokers fail or suffer network partition. Helix works constantly to drive the actual state of the cluster to match the intended state, pushing configuration changes to brokers and servers as needed.
There are three physical node types in a Helix cluster:
Participant: These nodes do things, like store data or perform computation. Participants host resources, which are Helix's fundamental storage abstraction. Because Pinot servers store segment data, they are participants.
Spectator: These nodes see things, observing the evolving state of the participants through events pushed to the spectator. Because Pinot brokers need to know which servers host which segments, they are spectators.
In addition, Helix defines two logical components to express its storage abstraction:
Partition. A unit of data storage that lives on at least one participant. Partitions may be replicated across multiple participants. A Pinot segment is a partition.
Resource. A logical collection of partitions, providing a single view over a potentially large set of data stored across a distributed system. A Pinot table is a resource.
In summary, the Pinot architecture maps onto Helix components as follows:
Pinot Component
Helix Component
Helix uses ZooKeeper to maintain cluster state. ZooKeeper sends Helix spectators notifications of changes in cluster state (which correspond to changes in ZNodes). Zookeeper stores the following information about the cluster:
Resource
Stored Properties
Zookeeper, as a first-class citizen of a Pinot cluster, may use the well-known ZNode structure for operations and troubleshooting purposes. Be advised that this structure can change in future Pinot releases.
Controller
The Pinot schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of and ). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.
Fault tolerance
Only one controller can be active at a time, so when multiple controllers are present in a cluster, they elect a leader. When that controller instance becomes unavailable, the remaining instances automatically elect a new leader. Leader election is achieved using Apache Helix. A Pinot cluster can serve queries without an active controller, but it can't perform any metadata-modifying operations, like adding a table or consuming a new segment.
Controller REST interface
The controller provides a REST interface that allows read and write access to all logical storage resources (e.g., servers, brokers, tables, and segments). See for more information on the web-based admin tool.
Broker
The responsibility is to route queries to the appropriate instances, or in the case of multi-stage queries, to compute a complete query plan and distribute it to the servers required to execute it. The broker collects and merges the responses from all servers into a final result, then sends the result back to the requesting client. The broker exposes an HTTP endpoint that accepts SQL queries in JSON format and returns the response in JSON.
Each broker maintains a query routing table. The routing table maps segments to the servers that store them. (When replication is configured on a table, each segment is stored on more than one server.) The broker computes multiple routing tables depending on the configured strategy for a table. The default strategy is to balance the query load across all available servers.
Advanced routing strategies are available, such as replica-aware routing, partition-based routing, and minimal server selection routing.
Query processing
Every query processed by a broker uses the single-stage engine or the . For single-stage queries, the broker does the following:
Computes query routes based on the routing strategy defined in the configuration.
Computes the list of segments to query on each . (See for further details on this process.)
Sends the query to each of those servers for local execution against their segments.
For multi-stage queries, the broker performs the following:
Computes a query plan that runs on multiple sets of servers. The servers selected for the first stage are selected based on the segments required to execute the query, which are determined in a process similar to single-stage queries.
Sends the relevant portions of the query plan to one or more servers in the cluster for each stage of the query plan.
The servers that received query plans each execute their part of the query. For more details on this process, read about the .
Server
host on locally attached storage and process queries on those segments. By convention, operators speak of "real-time" and "offline" servers, although there is no difference in the server process itself or even its configuration that distinguishes between the two. This is merely a convention reflected in the assignment strategy to confine the two different kinds of workloads to two groups of physical instances, since the performance-limiting factors differ between the two kinds of workloads. For example, offline servers might optimize for larger storage capacity, whereas real-time servers might optimize for memory and CPU cores.
Offline servers
Offline servers host segments created by ingesting batch data. The controller writes these segments to the offline server according to the table's replication factor and segment assignment strategy. Typically, the controller writes new segments to the , and affected servers download the segment from deep store. The controller then notifies brokers that a new segment exists, and is available to participate in queries.
Because offline tables tend to have long retention periods, offline servers tend to scale based on the size of the data they store.
Real-time servers
Real-time servers ingest data from streaming sources, like Apache Kafka®, Apache Pulsar®, or AWS Kinesis. Streaming data ends up in conventional segment files just like batch data, but is first accumulated in an in-memory data structure known as a consuming segment. Each message consumed from a streaming source is written immediately to the relevant consuming segment, and is available for query processing from the consuming segment immediately, since consuming segments participate in query processing as first-class citizens. Consuming segments get flushed to disk periodically based on a completion threshold, which can be calculated by row count, ingestion time, or segment size. A flushed segment on a real-time table is called a completed segment, and is functionally equivalent to a segment created during offline ingest.
Real-time servers tend to be scaled based on the rate at which they ingest streaming data.
Minion
A Pinot is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minion tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.
Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function without minions, they are typically present to support routine tasks like ingesting batch data.
Data ingestion overview
Pinot exist in two varieties: offline (or batch) and real-time. Offline tables contain data from batch sources like CSV, Avro, or Parquet files, and real-time tables contain data from streaming sources like like Apache Kafka®, Apache Pulsar®, or AWS Kinesis.
Offline (batch) ingest
Pinot ingests batch data using an , which follows a process like this:
The job transforms a raw data source (such as a CSV file) into . This is a potentially complex process resulting in a file that is typically several hundred megabytes in size.
The job then transfers the file to the cluster's and notifies the that a new segment exists.
The controller (in its capacity as a Helix controller) updates the ideal state of the cluster in its cluster metadata map.
Real-time ingest
Ingestion is established at the time a real-time table is created, and continues as long as the table exists. When the controller receives the metadata update to create a new real-time table, the table configuration specifies the source of the streaming input data—often a topic in a Kafka cluster. This kicks off a process like this:
The controller picks one or more servers to act as direct consumers of the streaming input source.
The controller creates consuming segments for the new table. It does this by creating an entry in the global metadata map for a new consuming segment for each of the real-time servers selected in step 1.
Through Helix functionality on the controller and the relevant servers, the servers proceed to create consuming segments in memory and establish a connection to the streaming input source. When this input source is Kafka, each server acts as a Kafka consumer directly, with no other components involved in the integration.
Quick Start Examples
This section describes quick start commands that launch all Pinot components in a single process.
Pinot ships with QuickStart commands that launch Pinot components in a single process and import pre-built datasets. These quick start examples are a good place if you're just getting started with Pinot. The examples begin with the example, after the following notes:
Prerequisites
You must have either or . The examples are available in each option and work the same. The decision of which to choose depends on your installation preference and how you generally like to work. If you don't know which to choose, using Docker will make your cleanup easier after you are done with the examples.
Ingestion Transformations
Raw source data often needs to undergo some transformations before it is pushed to Pinot.
Transformations include extracting records from nested objects, applying simple transform functions on certain columns, filtering out unwanted columns, as well as more advanced operations like joining between datasets.
A preprocessing job is usually needed to perform these operations. In streaming data sources, you might write a Samza job and create an intermediate topic to store the transformed data.
For simple transformations, this can result in inconsistencies in the batch/stream data source and increase maintenance and operator overhead.
To make things easier, Pinot supports transformations that can be applied via the .
Supported Data Formats
This section contains a collection of guides that will show you how to import data from a Pinot-supported input format.
Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time spent doing serialization-deserialization and speed up the ingestion.
Configuring input formats
To change the input format, adjust the recordReaderSpec config in the ingestion job specification.
Batch Ingestion Guide
Batch ingestion of data into Apache Pinot.
With batch ingestion you create a table using data already present in a file system such as S3. This is particularly useful when you want to use Pinot to query across large data with minimal latency or to test out new features using a simple data file.
Choosing a Batch Ingestion Mode
Pinot provides several batch ingestion modes. Use the table below to pick the one that fits your environment and data scale.
Complex Type Examples (Unnest)
Additional examples that demonstrate handling of complex types.
Additional examples that demonstrate handling of complex types.
Unnest Root Level Collection
In this example, we would look at un-nesting json records that are batched together as part of a single key at the root level. We will make use of the configs to persist the individual student records as separate rows in Pinot.
select count(*) from my_table where column IS NOT NULL
pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
pinot.server.storage.factory.hdfs.hadoop.conf.path=/path/to/hadoop/conf/directory/
# For server, instructing the HadoopPinotFS plugin to use the specified keytab and principal when accessing HDFS paths
pinot.server.storage.factory.hdfs.hadoop.kerberos.principle=<hdfs-principle>
pinot.server.storage.factory.hdfs.hadoop.kerberos.keytab=<hdfs-keytab>
pinot.server.segment.fetcher.protocols=file,http,hdfs
pinot.server.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
pinot.set.instance.id.to.hostname=true
pinot.server.instance.dataDir=/path/in/local/filesystem/for/pinot/data/server/index
pinot.server.instance.segmentTarDir=/path/in/local/filesystem/for/pinot/data/server/segment
pinot.server.grpc.enable=true
pinot.server.grpc.port=8090
controller.data.dir=hdfs://path/in/hdfs/for/controller/segment
controller.local.temp.dir=/tmp/pinot/
controller.zk.str=<ZOOKEEPER_HOST:ZOOKEEPER_PORT>
controller.enable.split.commit=true
controller.access.protocols.http.port=9000
controller.helix.cluster.name=PinotCluster
pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
pinot.controller.storage.factory.hdfs.hadoop.conf.path=/path/to/hadoop/conf/directory/
# For controller, instructing the HadoopPinotFS plugin to use the specified keytab and principal when accessing the HDFS path defined in controller.data.dir
pinot.controller.storage.factory.hdfs.hadoop.kerberos.principle=<hdfs-principle>
pinot.controller.storage.factory.hdfs.hadoop.kerberos.keytab=<hdfs-keytab>
pinot.controller.segment.fetcher.protocols=file,http,hdfs
pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
controller.vip.port=9000
controller.port=9000
pinot.set.instance.id.to.hostname=true
pinot.server.grpc.enable=true
Purpose: These configure Kerberos for the segment fetcher component specifically.
Why both are needed: While there is some functional overlap, having both ensures complete coverage of all HDFS access patterns, backward compatibility with existing deployments, and independent operation of the segment fetcher.
: Pinot assumes all stored data is immutable, which helps simplify the parts of the system that handle data storage and replication. However, Pinot still supports upserts on streaming entity data and background purges of data to comply with data privacy regulations.
Dynamic configuration changes: Operations like adding new tables, expanding a cluster, ingesting data, modifying an existing table, and adding indexes do not impact query availability or performance.
Controller: This node observes and manages the state of participant nodes. The controller is responsible for coordinating all state transitions in the cluster and ensures that state constraints are satisfied while maintaining cluster stability.
Broker
A Helix Spectator that observes the cluster for changes in the state of segments and servers. To support multi-tenancy, brokers are also modeled as Helix Participants.
Minion
Helix Participant that performs computation rather than storing data
Receives the results from each server and merges them.
Sends the query result to the client.
The broker receives a complete result set from the final stage of the query, which is always a single server.
The broker sends the query result to the client.
The controller then assigns the segment to one or more "offline" servers (depending on replication factor) and notifies them that new segments are available.
The servers then download the newly created segments directly from the deep store.
The cluster's brokers, which watch for state changes as Helix spectators, detect the new segments and update their segment routing tables accordingly. The cluster is now able to query the new offline segments.
Through Helix functionality on the controller and all of the cluster's brokers, the brokers become aware of the consuming segments, and begin including them in query routing immediately.
The consuming servers simultaneously begin consuming messages from the streaming input source, storing them in the consuming segment.
When a server decides its consuming segment is complete, it commits the in-memory consuming segment to a conventional segment file, uploads it to the deep store, and notifies the controller.
The controller and the server create a new consuming segment to continue real-time ingestion.
The controller marks the newly committed segment as online. Brokers then discover the new segment through the Helix notification mechanism, allowing them to route queries to it in the usual fashion.
Segment
Helix Partition
Table
Helix Resource
Controller
Helix Controller or Helix agent that drives the overall state of the cluster
Server
Controller
- Controller that is assigned as the current leader
Servers and Brokers
- List of servers and brokers - Configuration of all current servers and brokers - Health status of all current servers and brokers
Tables
- List of tables - Table configurations - Table schema - List of the table's segments
- Exact server locations of a segment - State of each segment (online/offline/error/consuming) - Metadata about each segment
Pinot versions in examples
The Docker-based examples on this page use pinot:latest, which instructs Docker to pull and use the most recent release of Apache Pinot. If you prefer to use a specific release instead, you can designate it by replacing latest with the release number, like this: pinot:0.12.1.
The local install-based examples that are run using the launcher scripts will use the Apache Pinot version you installed.
Stopping a running example
To stop a running example, enter Ctrl+C in the same terminal where you ran the docker run command to start the example.
macOS Monterey Users
By default the Airplay receiver server runs on port 7000, which is also the port used by the Pinot Server in the Quick Start. You may see the following error when running these examples:
If you disable the Airplay receiver server and try again, you shouldn't see this error message anymore.
Command Options
All QuickStart commands support the following optional parameters in addition to -type:
Option
Aliases
Description
-type
The quickstart type to run (see sections below).
-tmpDir
-quickstartDir, -dataDir
Directory to store quickstart data. Use this to persist data across restarts so that tables and segments are reloaded from disk instead of being regenerated.
Example: Persist data across restarts
Example: Use an external ZooKeeper and custom config
Example: Load custom tables into an empty cluster
Batch Processing
This example demonstrates how to do batch processing with Pinot. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates the baseballStats table
Launches a standalone data ingestion job that builds one segment for a given CSV data file for the baseballStats table and pushes the segment to the Pinot Controller.
Issues sample queries to Pinot
Batch JSON
This example demonstrates how to import and query JSON documents in Pinot. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates the githubEvents table
Launches a standalone data ingestion job that builds one segment for a given JSON data file for the githubEvents table and pushes the segment to the Pinot Controller.
Issues sample queries to Pinot
Batch with complex data types
This example demonstrates how to do batch processing in Pinot where the the data items have complex fields that need to be unnested. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates the githubEvents table
Launches a standalone data ingestion job that builds one segment for a given JSON data file for the githubEvents table and pushes the segment to the Pinot Controller.
Issues sample queries to Pinot
Streaming
This example demonstrates how to do stream processing with Pinot. The command:
Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot
Issues sample queries to Pinot
Streaming with minion cleanup
This example demonstrates how to do stream processing in Pinot with RealtimeToOfflineSegmentsTask and MergeRollupTask minion tasks continuously optimizing segments as data gets ingested. The command:
Publishes data to a Kafka topic githubEvents that is subscribed to by Pinot.
Issues sample queries to Pinot
Streaming with complex data types
This example demonstrates how to do stream processing in Pinot where the stream contains items that have complex fields that need to be unnested. The command:
Launches a standalone data ingestion job that builds segments under a given directory of Avro files for the airlineStats table and pushes the segments to the Pinot Controller.
Launches a stream of flights stats
Publishes data to a Kafka topic airlineStatsEvents that is subscribed to by Pinot.
Issues sample queries to Pinot
Join
This example demonstrates how to do joins in Pinot using the Lookup UDF. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server in the same container.
Creates the baseballStats table
Launches a data ingestion job that builds one segment for a given CSV data file for the baseballStats table and pushes the segment to the Pinot Controller.
Creates the dimBaseballTeams table
Launches a data ingestion job that builds one segment for a given CSV data file for the dimBaseballStats table and pushes the segment to the Pinot Controller.
Issues sample queries to Pinot
Logical Table
This example demonstrates how to use logical tables in Pinot, which provide a unified query interface over multiple physical tables. The command:
Creates three physical tables (ordersUS_OFFLINE, ordersEU_OFFLINE, ordersAPAC_OFFLINE) representing regional order data
Creates a logical table (orders) that provides a unified view over all regional tables
Issues sample queries to both physical and logical tables
For more details on logical tables, see Logical Table.
Empty
This example starts a bare Pinot cluster with no tables or data loaded. Use this when you want to set up your own tables and schemas from scratch. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
No tables or data are created
Multi-Stage Query Engine
This example demonstrates the multi-stage query engine with self-joins, dimension table joins, and vector distance queries. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates the baseballStats table and a fine food reviews table
Launches data ingestion jobs to build segments and push them to the Pinot Controller.
Issues sample multi-stage queries including joins and vector distance queries
Partial Upsert
This example demonstrates how to do stream processing with partial upsert in Pinot, where individual fields can be updated independently while preserving other column values. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates a table with geospatial indexes
Launches a data ingestion job and pushes segments to the Pinot Controller.
Issues sample geospatial queries to Pinot
Null Handling
This example demonstrates null value handling features in Pinot. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates a table containing null values
Launches a data ingestion job and pushes segments to the Pinot Controller.
Issues sample queries demonstrating IS NULL, IS NOT NULL, and aggregate behavior with nulls
TPC-H
This example loads the 8 TPC-H benchmark tables (customer, lineitem, nation, orders, part, partsupp, region, supplier) for multi-stage query testing. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates all 8 TPC-H tables
Launches data ingestion jobs to build segments for each table and pushes them to the Pinot Controller.
Issues sample TPC-H benchmark queries using the multi-stage query engine
Colocated Join
This example demonstrates colocated join operations using the multi-stage query engine with various partition configurations and parallelism hints. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates tables with matching partition configurations for colocated joins
Launches data ingestion jobs and pushes segments to the Pinot Controller.
Issues sample colocated join queries
Lookup Join
This example demonstrates the lookup join strategy using dimension tables with the multi-stage query engine. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates fact and dimension tables
Launches data ingestion jobs and pushes segments to the Pinot Controller.
Issues sample lookup join queries
Auth
This example demonstrates how to run Pinot with basic authentication enabled. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server with basic auth configured.
Creates tables and loads data with authentication enabled
Issues sample authenticated queries to Pinot
Sorted Column
This example demonstrates sorted column indexing in Pinot with a generated dataset containing sorted columns. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates a table with sorted column configuration
Generates a 100,000-row dataset and ingests it into Pinot
Issues sample queries demonstrating sorted index performance
Timestamp Index
This example demonstrates timestamp index functionality, showing timestamp extraction at different granularities and dateTrunc bucketing. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates the airlineStats table with timestamp indexes
Launches a data ingestion job and pushes segments to the Pinot Controller.
Issues sample queries demonstrating timestamp extraction and bucketing
GitHub Events
This example sets up a streaming demo using GitHub events data. The command:
Publishes GitHub event data to a Kafka topic that is subscribed to by Pinot
Issues sample analytical queries on the GitHub event data
Multi-Cluster
This example demonstrates cross-cluster querying via logical tables by initializing two independent Pinot clusters. The command:
Starts two independent Pinot clusters, each with their own Zookeeper, Controller, Broker, and Server.
Creates physical tables in each cluster
Creates a logical table that spans both clusters
Issues sample cross-cluster queries
Batch with Multi-Directory (Tiered Storage)
This example demonstrates multi-directory (tiered storage) support with hot and cold tiers. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server with tiered storage configured.
Creates the airlineStats table with hot and cold storage tiers
Launches a data ingestion job and pushes segments to the Pinot Controller.
Issues sample queries that run across storage tiers
Time Series
For production use, you should ideally implement your own Time Series Language Plugin. The one included in the Pinot distribution is only for demonstration purposes.
This examples demonstrates Pinot's Time Series Engine, which supports running pluggable Time Series Query Languages via a Language Plugin architecture. The default Pinot binary includes a toy Time Series Query Language using the same name as Uber's language "m3ql". You can try the following query as an example:
If a new column is added to your table or schema configuration during ingestion, incorrect data may appear in the consuming segment(s). To ensure accurate values are reloaded, see how to add a new column during ingestion.
Transformation functions
Pinot supports the following functions:
Groovy functions
Built-in functions
A transformation function cannot mix Groovy and built-in functions; only use one type of function at a time.
Groovy functions
Groovy functions can be defined using the syntax:
Any valid Groovy expression can be used.
⚠️Enabling Groovy
Allowing executable Groovy in ingestion transformation can be a security vulnerability. To enable Groovy for ingestion, set the following controller configuration:
controller.disable.ingestion.groovy=false
If not set, Groovy for ingestion transformation is disabled by default.
Built-in Pinot functions
All the functions defined in this directory annotated with @ScalarFunction (for example, toEpochSeconds) are supported ingestion transformation functions.
Below are some commonly used built-in Pinot functions for ingestion transformations.
DateTime functions
These functions enable time transformations.
toEpochXXX
Converts from epoch milliseconds to a higher granularity.
Function name
Description
toEpochSeconds
Converts epoch millis to epoch seconds. Usage:"toEpochSeconds(millis)"
toEpochMinutes
Converts epoch millis to epoch minutes Usage: "toEpochMinutes(millis)"
toEpochHours
Converts epoch millis to epoch hours Usage: "toEpochHours(millis)"
toEpochXXXRounded
Converts from epoch milliseconds to another granularity, rounding to the nearest rounding bucket. For example, 1588469352000 (2020-05-01 42:29:12) is 26474489 minutesSinceEpoch. `toEpochMinutesRounded(1588469352000) = 26474480 (2020-05-01 42:20:00)
Function Name
Description
toEpochSecondsRounded
Converts epoch millis to epoch seconds, rounding to nearest rounding bucket"toEpochSecondsRounded(millis, 30)"
toEpochMinutesRounded
Converts epoch millis to epoch seconds, rounding to nearest rounding bucket"toEpochMinutesRounded(millis, 10)"
toEpochHoursRounded
Converts epoch millis to epoch seconds, rounding to nearest rounding bucket"toEpochHoursRounded(millis, 6)"
fromEpochXXX
Converts from an epoch granularity to milliseconds.
Function Name
Description
fromEpochSeconds
Converts from epoch seconds to milliseconds "fromEpochSeconds(secondsSinceEpoch)"
fromEpochMinutes
Converts from epoch minutes to milliseconds "fromEpochMinutes(minutesSinceEpoch)"
fromEpochHours
Converts from epoch hours to milliseconds "fromEpochHours(hoursSinceEpoch)"
Simple date format
Converts simple date format strings to milliseconds and vice versa, per the provided pattern string.
Function name
Description
Converts from milliseconds to a formatted date time string, as per the provided pattern "toDateTime(millis, 'yyyy-MM-dd')"
Converts a formatted date time string to milliseconds, as per the provided pattern "fromDateTime(dateTimeStr, 'EEE MMM dd HH:mm:ss ZZZ yyyy')"
Converts a JSON/AVRO complex object to a string. This json map can then be queried using function. "json_format(jsonMapField)"
Types of transformation
Filtering
Records can be filtered as they are ingested. A filter function can be specified in the filterConfigs in the ingestionConfigs of the table config.
If the expression evaluates to true, the record will be filtered out. The expressions can use any of the transform functions described in the previous section.
Consider a table that has a column timestamp. If you want to filter out records that are older than timestamp 1589007600000, you could apply the following function:
Consider a table that has a string column campaign and a multi-value column double column prices. If you want to filter out records where campaign = 'X' or 'Y' and sum of all elements in prices is less than 100, you could apply the following function:
Filter config also supports SQL-like expression of built-in scalar functions for filtering records (starting v 0.11.0+). Example:
Column transformation
Transform functions can be defined on columns in the ingestion config of the table config.
For example, imagine that our source data contains the prices and timestamp fields. We want to extract the maximum price and store that in the maxPrices field and convert the timestamp into the number of hours since the epoch and store it in the hoursSinceEpoch field. You can do this by applying the following transformation:
Below are some examples of commonly used functions.
String concatenation
Concat firstName and lastName to get fullName
Find an element in an array
Find max value in array bids
Time transformation
Convert timestamp from MILLISECONDS to HOURS
Column name change
Change name of the column from user_id to userId
Rename fields from a Kafka JSON message
Kafka JSON payloads often use keys that aren’t great Pinot column names. Common examples are keys containing -, such as event-id.
Map the source key to a schema-friendly column using transformConfigs. Reference the source key with a quoted identifier.
Add the destination columns (for example, event_id) to your Pinot schema.
Extract value from a column containing space
Pinot doesn't support columns that have spaces, so if a source data column has a space, we'll need to store that value in a column with a supported name. To extract the value from first Name into the column firstName, run the following:
Ternary operation
If eventType is IMPRESSION set impression to 1. Similar for CLICK.
AVRO Map
Store an AVRO Map in Pinot as two multi-value columns. Sort the keys, to maintain the mapping.
1) The keys of the map as map_keys
2) The values of the map as map_values
Chaining transformations
Transformations can be chained. This means that you can use a field created by a transformation in another transformation function.
For example, we might have the following JSON document in the data field of our source data:
We can apply one transformation to extract the userId and then another one to pull out the numerical part of the identifier:
Flattening
There are 2 kinds of flattening:
One record into many
This is not natively supported as of yet. You can write a custom Decoder/RecordReader if you want to use this. Once the Decoder generates the multiple GenericRows from the provided input record, a List<GenericRow> should be set into the destination GenericRow, with the key $MULTIPLE_RECORDS_KEY$. The segment generation drivers will treat this as a special case and handle the multiple records case.
Extract attributes from complex objects
Feature TBD
Add a new column during ingestion
If a new column is added to table or schema configuration during ingestion, incorrect data may appear in the consuming segment(s).
To ensure accurate values are reloaded, do the following:
Pause consumption (and wait for pause status success):
$ curl -X POST {controllerHost}/tables/{tableName}/pauseConsumption
className: Name of the class that implements the RecordReader interface. This class is used for parsing the data.
configClassName: Name of the class that implements the RecordReaderConfig interface. This class is used the parse the values mentioned in configs
configs: Key-value pair for format-specific configurations. This field is optional.
Supported input formats
Pinot supports multiple input formats out of the box. Specify the corresponding readers and the associated custom configurations to switch between formats.
CSV
CSV Record Reader supports the following configs:
fileFormat: default, rfc4180, excel, tdf, mysql
header: Header of the file. The columnNames should be separated by the delimiter mentioned in the configuration.
delimiter: The character seperating the columns.
multiValueDelimiter: The character separating multiple values in a single column. This can be used to split a column into a list.
skipHeader: Skip header record in the file. Boolean.
ignoreEmptyLines: Ignore empty lines (instead of filling them with default values). Boolean.
ignoreSurroundingSpaces: ignore spaces around column names and values. Boolean
quoteCharacter: Single character used for quotes in CSV files.
recordSeparator: Character used to separate records in the input file. Default is or \r depending on the platform.
nullStringValue: String value that represents null in CSV files. Default is empty string.
Your CSV file may have raw text fields that cannot be reliably delimited using any character. In this case, explicitly set the multiValueDelimeter field to empty in the ingestion config.
multiValueDelimiter: ''
Avro
The Avro record reader converts the data in file to a GenericRecord. A Java class or .avro file is not required. By default, the Avro record reader only supports primitive types. To enable support for rest of the Avro data types, set enableLogicalTypes to true .
We use the following conversion table to translate between Avro and Pinot data types. The conversions are done using the offical Avro methods present in org.apache.avro.Conversions.
Avro Data Type
Pinot Data Type
Comment
INT
INT
LONG
LONG
JSON
Thrift
Thrift requires the generated class using .thrift file to parse the data. The .class file should be available in the Pinot's classpath. You can put the files in the lib/ folder of Pinot distribution directory.
Parquet
Since 0.11.0 release, the Parquet record reader determines whether to use ParquetAvroRecordReader or ParquetNativeRecordReader to read records. The reader looks for the parquet.avro.schema or avro.schema key in the parquet file footer, and if present, uses the Avro reader.
You can change the record reader manually in case of a misconfiguration.
For the support of DECIMAL and other parquet native data types, always use ParquetNativeRecordReader.
INT96
LONG
ParquetINT96 type converts nanoseconds to Pinot INT64 type of milliseconds
INT64
LONG
INT32
INT
For ParquetAvroRecordReader , you can refer to the Avro section above for the type conversions.
ORC
ORC record reader supports the following data types -
ORC Data Type
Java Data Type
BOOLEAN
String
SHORT
Integer
INT
Integer
In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.
Protocol Buffers
The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc) from the .proto file using the command -
Apache Arrow
The Arrow input format plugin supports reading data in Apache Arrow IPC streaming format. This is useful for ingesting data from systems that produce Arrow-formatted output.
The pinot-arrow plugin is included in the standard Pinot binary distribution (tarball and Docker image). The ArrowMessageDecoder is available out of the box, and no additional installation steps are required to use Apache Arrow format for data ingestion.
For stream ingestion, the Arrow decoder converts Arrow columnar batches to Pinot rows:
Configuration properties:
Property
Default
Description
arrow.allocator.limit
268435456 (256 MB)
Memory limit for Arrow's off-heap allocator in bytes
The decoder handles Arrow type conversions automatically: Text → String, LocalDateTime → Timestamp, Arrow Maps → flattened Map<String, Object>, and Arrow Lists → List<Object>. Dictionary-encoded columns are also supported.
Decision Guide
Mode
Best For
Infrastructure
Data Scale
Status
Standalone
Dev/test, small jobs, scripted pipelines
None (single JVM)
Up to a few GB
Recommended for dev
When to Use Each Mode
Standalone is the simplest option and requires no distributed computing framework. It runs segment generation in a single JVM process, making it ideal for development, testing, and small production jobs where data volumes are modest (up to a few GB). It is also well suited for scripted CI/CD pipelines.
Spark 3 is the recommended choice for production batch ingestion at scale. It distributes segment generation across a Spark 3.x cluster, enabling you to process datasets ranging from gigabytes to terabytes and beyond. If you are setting up a new Spark-based pipeline, use this mode.
Hadoop uses MapReduce to generate segments on a Hadoop cluster. It is considered legacy and is primarily useful if you have existing MapReduce infrastructure and pipelines that you cannot migrate away from.
Flink is a good fit for organizations that already run Apache Flink. It supports both batch and streaming modes and is especially useful for backfilling offline tables or bootstrapping upsert tables, since the Flink connector can write partitioned segments that participate correctly in upsert semantics.
LaunchDataIngestionJob is a CLI convenience wrapper that invokes the Standalone runner under the hood. Use it when you want to trigger ingestion from a shell command or cron job without writing custom code.
Maven Artifact Coordinates
All artifacts use the group ID org.apache.pinot. Replace ${pinot.version} with your Pinot release version.
Mode
Artifact ID
Notes
Standalone
pinot-batch-ingestion-standalone
Included in the Pinot binary distribution
Spark 3
pinot-batch-ingestion-spark-3
Located in plugins-external/pinot-batch-ingestion/
Example Maven dependency for Spark 3:
Getting Started
To ingest data from a filesystem, perform the following steps, which are described in more detail in this page:
Create schema configuration
Create table configuration
Upload schema and table configs
Upload data
Batch ingestion currently supports the following mechanisms to upload the data:
Here's an example using standalone local processing.
First, create a table using the following CSV data.
Create schema configuration
In our data, the only column on which aggregations can be performed is score. Secondly, timestampInEpoch is the only timestamp column. So, on our schema, we keep score as metric and timestampInEpoch as timestamp column.
Here, we have also defined two extra fields: format and granularity. The format specifies the formatting of our timestamp column in the data source. Currently, it's in milliseconds, so we've specified 1:MILLISECONDS:EPOCH.
Create table configuration
We define a table transcript and map the schema created in the previous step to the table. For batch data, we keep the tableType as OFFLINE.
Upload schema and table configs
Now that we have both the configs, upload them and create a table by running the following command:
Check out the table config and schema in the \[Rest API] to make sure it was successfully uploaded.
Upload data
We now have an empty table in Pinot. Next, upload the CSV file to this empty table.
A table is composed of multiple segments. The segments can be created in the following three ways:
There are 2 controller APIs that can be used for a quick ingestion test using a small file.
When these APIs are invoked, the controller has to download the file and build the segment locally.
Hence, these APIs are NOT meant for production environments and for large input files.
/ingestFromFile
This API creates a segment using the given file and pushes it to Pinot. All steps happen on the controller.
Example usage:
To upload a JSON file data.json to a table called foo_OFFLINE, use below command
Note that query params need to be URLEncoded. For example, {"inputFormat":"json"} in the command below needs to be converted to %7B%22inputFormat%22%3A%22json%22%7D.
The batchConfigMapStr can be used to pass in additional properties needed for decoding the file. For example, in case of csv, you may need to provide the delimiter
/ingestFromURI
This API creates a segment using file at the given URI and pushes it to Pinot. Properties to access the FS need to be provided in the batchConfigMap. All steps happen on the controller.
Example usage:
Ingestion jobs
Segments can be created and uploaded using tasks known as DataIngestionJobs. A job also needs a config of its own. We call this config the JobSpec.
For our CSV file and table, the JobSpec should look like this:
Now that we have the job spec for our table transcript, we can trigger the job using the following command:
Once the job successfully finishes, head over to the \[query console] and start playing with the data.
Segment push job type
There are 3 ways to upload a Pinot segment:
Segment tar push
Segment URI push
Segment metadata push
Segment tar push
This is the original and default push mechanism.
Tar push requires the segment to be stored locally or can be opened as an InputStream on PinotFS. So we can stream the entire segment tar file to the controller.
The push job will:
Upload the entire segment tar file to the Pinot controller.
Pinot controller will:
Save the segment into the controller segment directory(Local or any PinotFS).
Extract segment metadata.
Add the segment to the table.
Segment URI push
This push mechanism requires the segment tar file stored on a deep store with a globally accessible segment tar URI.
URI push is light-weight on the client-side, and the controller side requires equivalent work as the tar push.
The push job will:
POST this segment tar URI to the Pinot controller.
Pinot controller will:
Download segment from the URI and save it to controller segment directory (local or any PinotFS).
Extract segment metadata.
Add the segment to the table.
Segment metadata push
This push mechanism also requires the segment tar file stored on a deep store with a globally accessible segment tar URI.
Metadata push is light-weight on the controller side, there is no deep store download involves from the controller side.
The push job will:
Download the segment based on URI.
Extract metadata.
Upload metadata to the Pinot Controller.
Pinot Controller will:
Add the segment to the table based on the metadata.
4. Segment Metadata Push with copyToDeepStore
This extends the original Segment Metadata Push for cases, where the segments are pushed to a location not used as deep store. The ingestion job can still do metadata push but ask Pinot Controller to copy the segments into deep store. Those use cases usually happen when the ingestion jobs don't have direct access to deep store but still want to use metadata push for its efficiency, thus using a staging location to keep the segments temporarily.
NOTE: the staging location and deep store have to use same storage scheme, like both on s3. This is because the copy is done via PinotFS.copyDir interface that assumes so; but also because this does copy at storage system side, so segments don't need to go through Pinot Controller at all.
To make this work, grant Pinot controllers access to the staging location. For example on AWS, this may require adding an access policy like this example for the controller EC2 instances:
Then use metadata push to add one extra config like this one:
Consistent data push and rollback
Pinot supports atomic update on segment level, which means that when data consisting of multiple segments are pushed to a table, as segments are replaced one at a time, queries to the broker during this upload phase may produce inconsistent results due to interleaving of old and new data.
When Pinot segment files are created in external systems (Hadoop/spark/etc), there are several ways to push those data to the Pinot controller and server:
Push segment to shared NFS and let pinot pull segment files from the location of that NFS. See Segment URI Push.
Push segment to a Web server and let pinot pull segment files from the Web server with HTTP/HTTPS link. See Segment URI Push.
Push segment to other systems and implement your own segment fetcher to pull data from those systems.
The first three options are supported out of the box within the Pinot package. As long your remote jobs send Pinot controller with the corresponding URI to the files, it will pick up the file and allocate it to proper Pinot servers and brokers. To enable Pinot support for PinotFS, you'll need to provide PinotFS configuration and proper Hadoop dependencies.
Persistence
By default, Pinot does not come with a storage layer, so all the data sent, won't be stored in case of a system crash. In order to persistently store the generated segments, you will need to change controller and server configs to add deep storage. Checkout File systems for all the info and related configs.
Tuning
Standalone
Since pinot is written in Java, you can set the following basic Java configurations to tune the segment runner job -
Log4j2 file location with -Dlog4j2.configurationFile
Plugin directory location with -Dplugins.dir=/opt/pinot/plugins
JVM props, like -Xmx8g -Xms4G
If you are using the docker, you can set the following under JAVA_OPTS variable.
Hadoop
You can set -D mapreduce.map.memory.mb=8192 to set the mapper memory size when submitting the Hadoop job.
Spark
You can add config spark.executor.memory to tune the memory usage for segment creation when submitting the Spark job.
Sample JSON record
Pinot Schema
The Pinot schema for this example would look as follows.
Pinot Table Configuration
The Pinot table configuration for this schema would look as follows.
Data in Pinot
Post ingestion, the student records would appear as separate records in Pinot. Note that the nested field scores is captured as a JSON field.
Unnested Student Records
Unnest sibling collections
In this example, we would look at un-nesting the sibling collections "student" and "teacher".
Sample JSON Record
Pinot Schema
Pinot Table configuration
Data in Pinot
Unnested student records
Unnest nested collection
In this example, we would look at un-nesting the nested collection "students.grades".
Sample JSON Record
Pinot Schema
Pinot Table configuration
Data in Pinot
Unnest Nested Collection
Unnest Multi Level Array
In this example, we would look at un-nesting the array "finalExam" which is located within the array "students".
Sample JSON Record
Pinot Schema
Pinot Table configuration
Data in Pinot
Unnested Multi Level Array
Convert inner collections
In this example, the inner collection "grades" is converted into a multi value string column.
Sample JSON Record
Pinot Schema
Pinot Table configuration
Data in Pinot
Converted Inner Collection
Primitive Array Converted to JSON String
In this example, the array of primitives "extra_curricular" is converted to a Json string.
Sample JSON Record
Pinot Schema
Pinot Table configuration
Data in Pinot
Primitives Converted to JSON
Unnest JsonArrayString collections
In this example, the data is STRING type and the content is string encoded JSON ARRAY .
In this case, the Unnest won't happen automatically on a STRING field.
Users need to first convert the STRING field to ARRAY or MAP field then perform the unnest.
Here are the steps:
use enrichmentConfigs to create the intermediate column recordArray with the function: jsonStringToListOrMap(data_for_unnesting)
configure complexTypeConfig to unnest the intermediate field recordArray to generate the field recordArray||name
Sample Record
Pinot Schema
Note the field to ingest is recordArray||name not data_for_unnesting||name
This guide shows you how to ingest a stream of records into a Pinot table.
Apache Pinot lets users consume data from streams and push it directly into the database. This process is called stream ingestion. Stream ingestion makes it possible to query data within seconds of publication.
Stream ingestion provides support for checkpoints for preventing data loss.
To set up Stream ingestion, perform the following steps, which are described in more detail in this page:
Create schema configuration
Create table configuration
Create ingestion configuration
Upload table and schema spec
Here's an example where we assume the data to be ingested is in the following format:
Create schema configuration
The schema defines the fields along with their data types. The schema also defines whether fields serve as dimensions, metrics, or timestamp. For more details on schema configuration, see .
For our sample data, the schema configuration looks like this:
Create table configuration with ingestion configuration
The next step is to create a table where all the ingested data will flow and can be queried. For details about each table component, see the reference.
The table configuration contains an ingestion configuration (ingestionConfig), which specifies how to ingest streaming data into Pinot. For details, see the reference.
Example table config with ingestionConfig
For our sample data and schema, the table config will look like this:
Example ingestionConfig for multi-topics ingestion
From , Pinot starts to support ingesting data from multiple stream partitions. (It is currently in Beta mode, and only supports multiple Kafka topics. Other stream types would be supported in the near future.) For our sample data and schema, assume that we duplicate it to 2 topics, transcript-topic1 and transcript-topic2. If we want to ingest from both topics, then the table config will look like this:
With multi-topics ingestion: (details please refer to the )
All transform functions would apply to both topics' ingestions.
Existing instance assignment strategy would all work as usual.
would still be handled in the same way.
Upload schema and table config
Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the configs are uploaded, Pinot will start ingesting available records from the topic.
Tune the stream config
Throttle stream consumption
There are some scenarios where the message rate in the input stream can come in bursts which can lead to long GC pauses on the Pinot servers or affect the ingestion rate of other real-time tables on the same server. If this happens to you, throttle the consumption rate during stream ingestion to better manage overall performance.
There are two independent throttling mechanisms available:
Stream consumption throttling can be tuned using the stream config topic.consumption.rate.limit which indicates the upper bound on the message rate for the entire topic.
Here is the sample configuration on how to configure the consumption throttling:
Some things to keep in mind while tuning this config are:
Since this configuration applied to the entire topic, internally, this rate is divided by the number of partitions in the topic and applied to each partition's consumer. This doesn't take replication factor into account.
Example
topic.consumption.rate.limit - 1000
num partitions in Kafka topic - 4
replication factor in table - 3
Pinot will impose a fixed limit of 1000 / 4 = 250 records per second on each partition. \
In case of multi-tenant deployment (where you have more than 1 table in the same server instance), you need to make sure that the rate limit on one table doesn't step on/starve the rate limiting of another table. So, when there is more than 1 table on the same server (which is most likely to happen), you may need to re-tune the throttling threshold for all the streaming tables.\
Once throttling is enabled for a table, you can verify by searching for a log that looks similar to:
In addition, you can monitor the consumption rate utilization with the metric COSUMPTION_QUOTA_UTILIZATION.
Note that any configuration change for topic.consumption.rate.limit in the stream config will NOT take effect immediately. The new configuration will be picked up from the next consuming segment. In order to enforce the new configuration, you need to trigger forceCommit APIs. Refer to for more details.
Byte-rate–based throttling (server level)
In addition to message-rate throttling, Pinot supports byte-based stream consumption throttling at the server level.
This throttling mechanism limits the total number of bytes consumed per second by a Pinot server, across all real-time tables and partitions hosted on that server.
When to use byte-based throttling
Byte-based throttling is especially useful when:
Message sizes vary significantly
Ingestion pressure is driven by payload size rather than record count
You want to cap network, direct memory, or disk IO usage at the server level
Configuration
Byte-based throttling is configured via cluster config, not via table or stream configs.
Config key
pinot.server.consumption.rate.limit.bytes
The value is specified in bytes per second.
Updating the configuration
The configuration can be updated dynamically using the Cluster Config API.
This limits each Pinot server to consume at most 3,000,000 bytes/sec (~3 MB/sec) across all real-time tables.
Example using curl
How byte-based throttling works
The byte rate limit is enforced per server
The limit applies collectively to all consuming partitions and tables hosted on that server
This throttling is independent of table-level message-rate throttling
Interaction with message-rate throttling
If both throttles are enabled:
Table-level topic.consumption.rate.limit controls records/sec per table
Server-level pinot.server.consumption.rate.limit.bytes controls bytes/sec per server
Pinot enforces both limits
This allows precise control when both message count and payload size matter.
Dynamic updates and propagation
Byte-based throttling is updated dynamically via the Cluster Config Change Listener
No server restart is required
Changes take effect automatically as servers receive the updated cluster config
Verifying throttling
Once enabled, Pinot logs messages indicating that a server-level byte consumption limiter has been applied.
You can also monitor throttling behavior using the metric:
This metric reflects how close the server is to its configured consumption quota.
Custom ingestion support
You can also write an ingestion plugin if the platform you are using is not supported out of the box. For a walkthrough, see .
Pause stream ingestion
There are some scenarios in which you may want to pause the real-time ingestion while your table is available for queries. For example, if there is a problem with the stream ingestion and, while you are troubleshooting the issue, you still want the queries to be executed on the already ingested data. For these scenarios, you can first issue a Pause request to a Controller host. After troubleshooting with the stream is done, you can issue another request to Controller to resume the consumption.
When a Pause request is issued, the controller instructs the real-time servers hosting your table to commit their consuming segments immediately. However, the commit process may take some time to complete. Note that Pause and Resume requests are async. An OK response means that instructions for pausing or resuming has been successfully sent to the real-time server. If you want to know if the consumption has actually stopped or resumed, issue a pause status request.
It's worth noting that consuming segments on real-time servers are stored in volatile memory, and their resources are allocated when the consuming segments are first created. These resources cannot be altered if consumption parameters are changed midway through consumption. It may take hours before these changes take effect. Furthermore, if the parameters are changed in an incompatible way (for example, changing the underlying stream with a completely new set of offsets, or changing the stream endpoint from which to consume messages), it will result in the table getting into an error state.
The pause and resume feature is helpful in these instances. When a pause request is issued by the operator, consuming segments are committed without starting new mutable segments. Instead, new mutable segments are started only when the resume request is issued. This mechanism provides the operators as well as developers with more flexibility. It also enables Pinot to be more resilient to the operational and functional constraints imposed by underlying streams.
There is another feature called Force Commit which utilizes the primitives of the pause and resume feature. When the operator issues a force commit request, the current mutable segments will be committed and new ones started right away. Operators can now use this feature for all compatible table config parameter changes to take effect immediately.
(v 0.12.0+) Once submitted, the forceCommit API returns a jobId that can be used to get the current progress of the forceCommit operation. A sample response and status API call:
The forceCommit request just triggers a regular commit before the consuming segments reaching the end criteria, so it follows the same mechanism as regular commit. It is one-time shot request, and not retried automatically upon failure. But it is idempotent so one may keep issuing it till success if needed.
This API is async, as it doesn't wait for the segment commit to complete. But a status entry is put in ZK to track when the request is issued and the consuming segments included. The consuming segments tracked in the status entry are compared with the latest IdealState to indicate the progress of forceCommit. However, this status is not updated or deleted upon commit success or failure, so that it could become stale. Currently, the most recent 100 status entries are kept in ZK, and the oldest ones only get deleted when the total number is about to exceed 100.
For incompatible parameter changes, an option is added to the resume request to handle the case of a completely new set of offsets. Operators can now follow a three-step process: First, issue a pause request. Second, change the consumption parameters. Finally, issue the resume request with the appropriate option. These steps will preserve the old data and allow the new data to be consumed immediately. All through the operation, queries will continue to be served.
Handle partition changes in streams
If a Pinot table is configured to consume using a (partition-based) stream type, then it is possible that the partitions of the table change over time. In Kafka, for example, the number of partitions may increase. In Kinesis, the number of partitions may increase or decrease -- some partitions could be merged to create a new one, or existing partitions split to create new ones.
Pinot runs a periodic task called RealtimeSegmentValidationManager that monitors such changes and starts consumption on new partitions (or stops consumptions from old ones) as necessary. Since this is a that is run on the controller, it may take some time for Pinot to recognize new partitions and start consuming from them. This may delay the data in new partitions appearing in the results that pinot returns.
If you want to recognize the new partitions sooner, then the periodic task so as to recognize such data immediately.
Infer ingestion status of real-time tables
Often, it is important to understand the rate of ingestion of data into your real-time table. This is commonly done by looking at the consumption lag of the consumer. The lag itself can be observed in many dimensions. Pinot supports observing consumption lag along the offset dimension and time dimension, whenever applicable (as it depends on the specifics of the connector).
The ingestion status of a connector can be observed by querying either the /consumingSegmentsInfo API or the table's /debug API, as shown below:
A sample response from a Kafka-based real-time table is shown below. The ingestion status is displayed for each of the CONSUMING segments in the table.
Term
Description
Monitor real-time ingestion
Real-time ingestion includes 3 stages of message processing: Decode, Transform, and Index.
In each of these stages, a failure can happen which may or may not result in an ingestion failure. The following metrics are available to investigate ingestion issues:
Decode stage -> an error here is recorded as INVALID_REALTIME_ROWS_DROPPED
Transform stage -> possible errors here are:
When a message gets dropped due to the transform, it is recorded as REALTIME_ROWS_FILTERED
There is yet another metric called ROWS_WITH_ERROR which is the sum of all error counts in the 3 stages above.
Furthermore, the metric REALTIME_CONSUMPTION_EXCEPTIONS gets incremented whenever there is a transient/permanent stream exception seen during consumption.
These metrics can be used to understand why ingestion failed for a particular table partition before diving into the server logs.
//This is an example ZNode config for EXTERNAL VIEW in Helix
{
"id" : "baseballStats_OFFLINE",
"simpleFields" : {
...
},
"mapFields" : {
"baseballStats_OFFLINE_0" : {
"Server_10.1.10.82_7000" : "ONLINE"
}
},
...
}
Failed to start a Pinot [SERVER]
java.lang.RuntimeException: java.net.BindException: Address already in use
at org.apache.pinot.core.transport.QueryServer.start(QueryServer.java:103) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da21132906]
at org.apache.pinot.server.starter.ServerInstance.start(ServerInstance.java:158) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da21132906]
at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:110) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da2113
# First run: quickstart generates data in the specified directory
./bin/pinot-admin.sh QuickStart -type batch -dataDir /tmp/pinot-quick-start
# Subsequent runs: quickstart reloads existing data from disk
./bin/pinot-admin.sh QuickStart -type batch -dataDir /tmp/pinot-quick-start
A list of directories, each containing a table schema, table config, and raw data. Use this with -type EMPTY or -type GENERIC to load your own tables into the quickstart cluster.
-configFile
-configFilePath
Path to a properties file that overrides default Pinot configuration values (controller, broker, server, etc.).
-zkAddress
-zkUrl, -zkExternalAddress
URL for an external ZooKeeper instance (e.g. localhost:2181) instead of using the default embedded instance.
-kafkaBrokerList
Kafka broker list for streaming quickstarts (e.g. localhost:9092). Use this to connect to an external Kafka cluster instead of the embedded one.
Underlying ingestion still works as LOWLEVEL mode, where
transcript-topic1 segments would be named like transcript__0__0__20250101T0000Z
transcript-topic2 segments would be named like transcript__10000__0__20250101T0000Z
The pinot.server.consumption.rate.limit setting must be configured in the server's instance configuration, not in the table configuration. This setting establishes a maximum consumption rate that applies collectively to all table partitions hosted on a single server. When both this server-level setting and the topic.consumption.rate.limit setting are specified, the server configuration has lower priority.1
\
Multiple real-time tables coexist on the same server
Consumption is throttled as soon as either limit is reached
When the transform pipeline sets the $INCOMPLETE_RECORD_KEY$ key in the message, it is recorded as INCOMPLETE_REALTIME_ROWS_CONSUMED , only when continueOnError configuration is enabled. If the continueOnError is not enabled, the ingestion fails.
Index stage -> When there is failure at this stage, the ingestion typically stops and marks the partition as ERROR.
currentOffsetsMap
Current consuming offset position per partition
latestUpstreamOffsetMap
(Wherever applicable) Latest offset found in the upstream topic partition
recordsLagMap
(Whenever applicable) Defines how far behind the current record's offset / pointer is from upstream latest record. This is calculated as the difference between the latestUpstreamOffset and currentOffset for the partition when the lag computation request is made.
(Whenever applicable) Defines how soon after record ingestion was the record consumed by Pinot. This is calculated as the difference between the time the record was consumed and the time at which the record was ingested upstream.
A consumption rate limiter is set up for topic <topic_name> in table <tableName> with rate limit: <rate_limit> (topic rate limit: <topic_rate_limit>, partition count: <partition_count>)
$ curl -X POST {controllerHost}/tables/{tableName}/forceCommit
$ curl -X POST {controllerHost}/tables/{tableName}/resumeConsumption?resumeFrom=smallest
$ curl -X POST {controllerHost}/tables/{tableName}/resumeConsumption?resumeFrom=largest
# GET /tables/{tableName}/consumingSegmentsInfo
curl -X GET "http://<controller_url:controller_admin_port>/tables/meetupRsvp/consumingSegmentsInfo" -H "accept: application/json"
# GET /debug/tables/{tableName}
curl -X GET "http://localhost:9000/debug/tables/meetupRsvp?type=REALTIME&verbosity=1" -H "accept: application/json"
{
"_segmentToConsumingInfoMap": {
"meetupRsvp__0__0__20221019T0639Z": [
{
"serverName": "Server_192.168.0.103_7000",
"consumerState": "CONSUMING",
"lastConsumedTimestamp": 1666161593904,
"partitionToOffsetMap": { // <<-- Deprecated. See currentOffsetsMap for same info
"0": "6"
},
"partitionOffsetInfo": {
"currentOffsetsMap": {
"0": "6" // <-- Current consumer position
},
"latestUpstreamOffsetMap": {
"0": "6" // <-- Upstream latest position
},
"recordsLagMap": {
"0": "0" // <-- Lag, in terms of #records behind latest
},
"recordsAvailabilityLagMap": {
"0": "2" // <-- Lag, in terms of time
}
}
}
],
Minion
Explore the minion component in Apache Pinot, empowering efficient data movement and segment generation within Pinot clusters.
A Pinot minion is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minon tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.
Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function with or without minions, they are typically present to support routine tasks like batch data ingest.
Starting a minion
Make sure you've . If you're using Docker, make sure to . To start a minion:
Interfaces
Pinot task generator
The Pinot task generator interface defines the APIs for the controller to generate tasks for minions to execute.
PinotTaskExecutorFactory
Factory for PinotTaskExecutor which defines the APIs for Minion to execute the tasks.
MinionEventObserverFactory
Factory for MinionEventObserver which defines the APIs for task event callbacks on minion.
Built-in tasks
Pinot ships with the following built-in Minion tasks:
Task
Purpose
Table Types
SegmentGenerationAndPushTask
The SegmentGenerationAndPushTask can fetch files from an input folder (e.g. from an S3 bucket) and convert them into segments. It converts one file into one segment and keeps the file name in segment metadata to avoid duplicate ingestion.
See for full configuration details.
Below is an example task config to put in TableConfig to enable this task. The task is scheduled every 10min to keep ingesting remaining files, with 10 parallel task at max and 1 file per task.
NOTE: You may want to simply omit "tableMaxNumTasks" due to this caveat: the task generates one segment per file, and derives segment name based on the time column of the file. If two files happen to have same time range and are ingested by tasks from different schedules, there might be segment name conflict. To overcome this issue for now, you can omit “tableMaxNumTasks” and by default it’s Integer.MAX_VALUE, meaning to schedule as many tasks as possible to ingest all input files in a single batch. Within one batch, a sequence number suffix is used to ensure no segment name conflict. Because the sequence number suffix is scoped within one batch, tasks from different batches might encounter segment name conflict issue said above.
When performing ingestion at scale remember that Pinot will list all of the files contained in the `inputDirURI` every time a `SegmentGenerationAndPushTask` job gets scheduled. This could become a bottleneck when fetching files from a cloud bucket like GCS. To prevent this make `inputDirURI` point to the least number of files possible.
RealtimeToOfflineSegmentsTask
See for details.
MergeRollupTask
See for details.
PurgeTask
See for details.
RefreshSegmentTask
See for details.
UpsertCompactionTask
See for details.
UpsertCompactMergeTask
See for details.
Enable tasks
Tasks are enabled on a per-table basis. To enable a certain task type (e.g. myTask) on a table, update the table config to include the task type:
Under each enable task type, custom properties can be configured for the task type.
There are also two task configs to be set as part of cluster configs like below. One controls task's overall timeout (1hr by default) and one for how many tasks to run on a single minion worker (1 by default).
Schedule tasks
Auto-schedule
There are 2 ways to enable task scheduling:
Controller level schedule for all minion tasks
Tasks can be scheduled periodically for all task types on all enabled tables. Enable auto task scheduling by configuring the schedule frequency in the controller config with the key controller.task.frequencyPeriod. This takes period strings as values, e.g. 2h, 30m, 1d.
Per table and task level schedule
Tasks can also be scheduled based on cron expressions. The cron expression is set in the schedule config for each task type separately. This config in the controller config, controller.task.scheduler.enabled should be set to true to enable cron scheduling.
As shown below, the RealtimeToOfflineSegmentsTask will be scheduled at the first second of every minute (following the syntax ).
Manual schedule
Tasks can be manually scheduled using the following controller rest APIs:
Rest API
Description
Schedule task on specific instances
Tasks can be scheduled on specific instances using the following config at task level:
By default, the value is minion_untagged to have backward-compatibility. This will allow users to schedule tasks on specific nodes and isolate tasks among tables / task-types.
Rest API
Description
Task level advanced configs
allowDownloadFromServer
When a task is executed on a segment, the minion node fetches the segment from deepstore. If the deepstore is not accessible, the minion node can download the segment from the server node. This is controlled by the allowDownloadFromServer config in the task config. By default, this is set to false.
We can also set this config at a minion instance level pinot.minion.task.allow.download.from.server (default is false). This instance level config helps in enforcing this behaviour if the number of tables / tasks is pretty high and we want to enable for all. Note: task-level config will override instance-level config value.
Plug-in custom tasks
To plug in a custom task, implement PinotTaskGenerator, PinotTaskExecutorFactory and MinionEventObserverFactory (optional) for the task type (all of them should return the same string for getTaskType()), and annotate them with the following annotations:
Implementation
Annotation
After annotating the classes, put them under the package of name org.apache.pinot.*.plugin.minion.tasks.*, then they will be auto-registered by the controller and minion.
Example
See where the TestTask is plugged-in.
Task Manager UI
In the Pinot UI, there is Minion Task Manager tab under Cluster Manager page. From that minion task manager tab, one can find a lot of task related info for troubleshooting. Those info are mainly collected from the Pinot controller that schedules tasks or Helix that tracks task runtime status. There are also buttons to schedule tasks in an ad hoc way. Below are some brief introductions to some pages under the minion task manager tab.
This one shows which types of Minion Task have been used. Essentially which task types have created their task queues in Helix.
**
Clicking into a task type, one can see the tables using that task. And a few buttons to stop the task queue, cleaning up ended tasks etc.
**
Then clicking into any table in this list, one can see how the task is configured for that table. And the task metadata if there is one in ZK. For example, MergeRollupTask tracks a watermark in ZK. If the task is cron scheduled, the current and next schedules are also shown in this page like below.
**
**
At the bottom of this page is a list of tasks generated for this table for this specific task type. Like here, one MergeRollup task has been generated and completed.
Clicking into a task from that list, we can see start/end time for it, and the subtasks generated for that task (as context, one minion task can have multiple subtasks to process data in parallel). In this example, it happened to have one sub-task here, and it shows when it starts and stops and which minion worker it's running.
**
Clicking into this subtask, one can see more details about it like the input task configs and error info if the task failed.
**
Task-related metrics
There is a controller job that runs every 5 minutes by default and emits metrics about Minion tasks scheduled in Pinot. The following metrics are emitted for each task type:
NumMinionTasksInProgress: Number of running tasks
NumMinionSubtasksRunning: Number of running sub-tasks
NumMinionSubtasksWaiting: Number of waiting sub-tasks (unassigned to a minion as yet)
The controller also emits metrics about how tasks are cron scheduled:
cronSchedulerJobScheduled: Number of current cron schedules registered to be triggered regularly according their cron expressions. It's a Gauge.
cronSchedulerJobTrigger: Number of cron scheduled triggered, as a Meter.
cronSchedulerJobSkipped: Number of late cron scheduled skipped, as a Meter.
For each task, the minion will emit these metrics:
TASK_QUEUEING: Task queueing time (task_dequeue_time - task_inqueue_time), assuming the time drift between helix controller and pinot minion is minor, otherwise the value may be negative
TASK_EXECUTION: Task execution time, which is the time spent on executing the task
NUMBER_OF_TASKS: number of tasks in progress on that minion. Whenever a Minion starts a task, increase the Gauge by 1, whenever a Minion completes (either succeeded or failed) a task, decrease it by 1
Merges small segments into larger ones and optionally rolls up data at coarser granularity
OFFLINE, REALTIME (without upsert/dedup)
Removes or modifies records for data retention and compliance (e.g., GDPR)
OFFLINE, REALTIME
Reprocesses segments after table config or schema changes (new indexes, columns, data types)
OFFLINE, REALTIME
Compacts individual upsert segments by removing invalidated records
REALTIME (upsert only)
Merges multiple small upsert segments into larger ones to reduce segment count
REALTIME (upsert only)
NumMinionSubtasksError: Number of error sub-tasks (completed with an error/exception)
PercentMinionSubtasksInQueue: Percent of sub-tasks in waiting or running states
PercentMinionSubtasksInError: Percent of sub-tasks in error
cronSchedulerJobExecutionTimeMs: Time used to complete task generation, as a Timer.
NUMBER_TASKS_EXECUTED: Number of tasks executed, as a Meter.
NUMBER_TASKS_COMPLETED: Number of tasks completed, as a Meter.
NUMBER_TASKS_CANCELLED: Number of tasks cancelled, as a Meter.
NUMBER_TASKS_FAILED: Number of tasks failed, as a Meter. Different from fatal failure, the task encountered an error which can not be recovered from this run, but it may still succeed by retrying the task.
NUMBER_TASKS_FATAL_FAILED: Number of tasks fatal failed, as a Meter. Different from failure, the task encountered an error, which will not be recoverable even with retrying the task.
Schedule tasks for the given task type on the given table
Usage: StartMinion
-help : Print this message. (required=false)
-minionHost <String> : Host name for minion. (required=false)
-minionPort <int> : Port number to start the minion at. (required=false)
-zkAddress <http> : HTTP address of Zookeeper. (required=false)
-clusterName <String> : Pinot cluster name. (required=false)
-configFileName <Config File Name> : Minion Starter Config file. (required=false)
public interface PinotTaskGenerator {
/**
* Initializes the task generator.
*/
void init(ClusterInfoAccessor clusterInfoAccessor);
/**
* Returns the task type of the generator.
*/
String getTaskType();
/**
* Generates a list of tasks to schedule based on the given table configs.
*/
List<PinotTaskConfig> generateTasks(List<TableConfig> tableConfigs);
/**
* Returns the timeout in milliseconds for each task, 3600000 (1 hour) by default.
*/
default long getTaskTimeoutMs() {
return JobConfig.DEFAULT_TIMEOUT_PER_TASK;
}
/**
* Returns the maximum number of concurrent tasks allowed per instance, 1 by default.
*/
default int getNumConcurrentTasksPerInstance() {
return JobConfig.DEFAULT_NUM_CONCURRENT_TASKS_PER_INSTANCE;
}
/**
* Performs necessary cleanups (e.g. remove metrics) when the controller leadership changes.
*/
default void nonLeaderCleanUp() {
}
}
public interface PinotTaskExecutorFactory {
/**
* Initializes the task executor factory.
*/
void init(MinionTaskZkMetadataManager zkMetadataManager);
/**
* Returns the task type of the executor.
*/
String getTaskType();
/**
* Creates a new task executor.
*/
PinotTaskExecutor create();
}
public interface PinotTaskExecutor {
/**
* Executes the task based on the given task config and returns the execution result.
*/
Object executeTask(PinotTaskConfig pinotTaskConfig)
throws Exception;
/**
* Tries to cancel the task.
*/
void cancel();
}
public interface MinionEventObserverFactory {
/**
* Initializes the task executor factory.
*/
void init(MinionTaskZkMetadataManager zkMetadataManager);
/**
* Returns the task type of the event observer.
*/
String getTaskType();
/**
* Creates a new task event observer.
*/
MinionEventObserver create();
}
public interface MinionEventObserver {
/**
* Invoked when a minion task starts.
*
* @param pinotTaskConfig Pinot task config
*/
void notifyTaskStart(PinotTaskConfig pinotTaskConfig);
/**
* Invoked when a minion task succeeds.
*
* @param pinotTaskConfig Pinot task config
* @param executionResult Execution result
*/
void notifyTaskSuccess(PinotTaskConfig pinotTaskConfig, @Nullable Object executionResult);
/**
* Invoked when a minion task gets cancelled.
*
* @param pinotTaskConfig Pinot task config
*/
void notifyTaskCancelled(PinotTaskConfig pinotTaskConfig);
/**
* Invoked when a minion task encounters exception.
*
* @param pinotTaskConfig Pinot task config
* @param exception Exception encountered during execution
*/
void notifyTaskError(PinotTaskConfig pinotTaskConfig, Exception exception);
}
Using "POST /cluster/configs" API on CLUSTER tab in Swagger, with this payload
{
"RealtimeToOfflineSegmentsTask.timeoutMs": "600000",
"RealtimeToOfflineSegmentsTask.numConcurrentTasksPerInstance": "4"
}
Complete reference for SQL syntax, operators, and clauses supported by Apache Pinot's single-stage engine (SSE) and multi-stage engine (MSE).
Pinot uses the Apache Calcite SQL parser with the MYSQL_ANSI dialect. This page documents every SQL statement, clause, and operator that Pinot supports, and notes where behavior differs between the single-stage engine (SSE) and the multi-stage engine (MSE).
To use MSE-only features such as JOINs, subqueries, window functions, and set operations, enable the multi-stage engine with SET useMultistageEngine = true; before your query. See Use the multi-stage query engine for details.
Supported Statements
Pinot supports the following top-level statement types:
Statement
Description
SELECT Syntax
The full syntax for a SELECT statement in Pinot is:
Column Expressions
A select_expression can be any of the following:
* -- all columns
A column name: city
A qualified column name: myTable.city
Aliases
Use AS to assign an alias to any select expression:
DISTINCT
Use SELECT DISTINCT to return unique combinations of column values:
In the SSE, DISTINCT is implemented as an aggregation function. DISTINCT * is not supported; you must list specific columns. DISTINCT with GROUP BY is also not supported.
FROM Clause
Table References
The simplest FROM clause references a single table:
Subqueries (MSE Only)
With the multi-stage engine, you can use a subquery as a data source:
JOINs (MSE Only)
The multi-stage engine supports the following join types:
Join Type
Description
For detailed join syntax and examples, see .
WHERE Clause
The WHERE clause filters rows using predicates. Multiple predicates can be combined with .
Comparison Operators
Operator
Description
Example
BETWEEN
Tests whether a value falls within an inclusive range:
NOT BETWEEN is also supported:
IN
Tests whether a value matches any value in a list:
NOT IN is also supported:
For large value lists, consider using for better performance.
LIKE
Pattern matching with wildcards. % matches any sequence of characters; _ matches any single character:
NOT LIKE is also supported.
IS NULL / IS NOT NULL
Tests whether a value is null:
See for details on how nulls work in Pinot.
REGEXP_LIKE
Filters rows using regular expression matching:
REGEXP_LIKE supports case-insensitive matching via a third parameter: REGEXP_LIKE(col, pattern, 'i').
TEXT_MATCH
Full-text search on columns with a text index:
JSON_MATCH
Predicate matching on columns with a JSON index:
VECTOR_SIMILARITY
Approximate nearest-neighbor search on vector-indexed columns:
GROUP BY
Groups rows that share values in the specified columns, typically used with aggregation functions:
Rules:
Every non-aggregated column in the SELECT list must appear in the GROUP BY clause.
Aggregation functions and non-aggregation columns cannot be mixed in the SELECT list without a GROUP BY.
HAVING
Filters groups after aggregation. Use HAVING instead of WHERE when filtering on aggregated values:
ORDER BY
Sorts the result set by one or more expressions:
Ordering Direction
ASC -- ascending order (default)
DESC -- descending order
NULL Ordering
NULLS FIRST -- null values appear first
NULLS LAST -- null values appear last
LIMIT / OFFSET
LIMIT
Restricts the number of rows returned:
If no LIMIT is specified, Pinot defaults to returning 10 rows for selection queries.
OFFSET
Skips a number of rows before returning results. Requires ORDER BY for consistent pagination:
Pinot also supports the legacy LIMIT offset, count syntax:
Logical Operators
Operator
Description
Precedence
From highest to lowest:
NOT
AND
OR
Use parentheses to override default precedence:
Arithmetic Operators
Arithmetic expressions can be used in SELECT expressions, WHERE clauses, and other contexts:
Operator
Description
Example
Type Casting
Use CAST to convert a value from one type to another:
Supported Target Types
Type
Description
Set Operations (MSE Only)
The multi-stage engine supports combining results from multiple queries:
Operation
Description
Window Functions (MSE Only)
Window functions compute a value across a set of rows related to the current row, without collapsing them into a single output row.
Syntax
Frame Clause
Example
For the full list of supported window functions and detailed syntax, see .
OPTION Clause
The OPTION clause provides Pinot-specific query hints. These are not standard SQL but allow you to control engine behavior:
The preferred approach is to use SET statements before the query:
Common query options include:
Option
Description
For the complete list of query options, see .
NULL Semantics
Default Behavior
By default, Pinot treats null values as the default value for the column type (0 for numeric types, empty string for strings, etc.). This avoids the overhead of null tracking and maintains backward compatibility.
Nullable Columns
To enable full null handling:
Mark columns as nullable in the schema (do not set notNull: true).
Enable null handling at query time:
Three-Valued Logic
When null handling is enabled, Pinot follows standard SQL three-valued logic:
Key behaviors with null handling enabled:
Comparisons with NULL (e.g., col = NULL) return NULL (not TRUE or FALSE). Use IS NULL / IS NOT NULL instead.
NULL IN (...) returns NULL, not FALSE.
For more details, see .
Identifier and Literal Rules
Double quotes (") delimit identifiers (column names, table names). Use double quotes for reserved keywords or special characters: SELECT "timestamp", "date" FROM myTable.
Single quotes (') delimit string literals: WHERE city = 'NYC'. Escape an embedded single quote by doubling it: 'it''s'.
CASE WHEN
Pinot supports CASE WHEN expressions for conditional logic:
CASE WHEN can be used inside aggregation functions:
Aggregation functions inside the ELSE clause are not supported.
Engine Compatibility Matrix
The following table summarizes feature support across the single-stage engine (SSE) and multi-stage engine (MSE):
Feature
SSE
MSE
Ingest from Apache Kafka
This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.
Learn how to ingest data from Kafka, a stream processing platform. You should have a local cluster up and running, following the instructions in Set up a cluster.
This guide uses the Kafka 3.0 connector (kafka30). Pinot also supports a Kafka 4.0 connector for KRaft-mode Kafka clusters. See Kafka Connector Versions for details on choosing the right connector.
Install and Launch Kafka
Let's start by downloading Kafka to our local machine.
To pull down the latest Docker image, run the following command:
Download Kafka from and then extract it:
Next we'll spin up a Kafka broker. Kafka 4.0 uses KRaft mode by default and does not require ZooKeeper:
Note: The --network pinot-demo flag is optional and assumes that you have a Docker network named pinot-demo that you want to connect the Kafka container to.
Kafka 4.0 uses KRaft mode by default. Generate a cluster ID and format the storage directory, then start the broker:
Start Kafka Broker (KRaft mode)
Data Source
We're going to generate some JSON messages from the terminal using the following script:
datagen.py
If you run this script (python datagen.py), you'll see the following output:
Ingesting Data into Kafka
Let's now pipe that stream of messages into Kafka, by running the following command:
We can check how many messages have been ingested by running the following command:
Output
And we can print out the messages themselves by running the following command
Output
Schema
A schema defines what fields are present in the table along with their data types in JSON format.
Create a file called /tmp/pinot/schema-stream.json and add the following content to it.
Table Config
A table is a logical abstraction that represents a collection of related data. It is composed of columns and rows (known as documents in Pinot). The table config defines the table's properties in JSON format.
Create a file called /tmp/pinot/table-config-stream.json and add the following content to it.
Create schema and table
Create the table and schema by running the appropriate command below:
Querying
Navigate to and click on the events table to run a query that shows the first 10 rows in this table.
_Querying the events table_
Kafka ingestion guidelines
Kafka connector modules in Pinot
Pinot ships two Kafka connector modules:
pinot-kafka-3.0 -- Uses Kafka client library 3.x (currently 3.9.2). This is the default connector included in Pinot distributions. Consumer factory class: org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory.
pinot-kafka-4.0 -- Uses Kafka client library 4.x (currently 4.1.1). This connector drops the ZooKeeper-based Scala dependency and uses the pure-Java Kafka client, suitable for KRaft-mode Kafka clusters. Consumer factory class: org.apache.pinot.plugin.stream.kafka40.KafkaConsumerFactory.
The legacy kafka-0.9 and kafka-2.x connector modules have been removed. If you are upgrading from an older Pinot release that used org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory, update your table configs to use one of the current connector classes listed above.
Pinot does not support using high-level Kafka consumers (HLC). Pinot uses low-level consumers to ensure accurate results, supports operational complexity and scalability, and minimizes storage overhead.
Migrating from the kafka-2.x connector
If your existing table configs reference the removed kafka-2.x connector, update the stream.kafka.consumer.factory.class.name property:
To (Kafka 3.x): org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory
To (Kafka 4.x): org.apache.pinot.plugin.stream.kafka40.KafkaConsumerFactory
No other stream config changes are required. The Kafka 3.x connector is compatible with Kafka brokers 2.x and above. The Kafka 4.x connector requires Kafka brokers 4.0 or above.
Kafka configurations in Pinot
Use Kafka partition (low) level consumer with SSL
Here is an example config which uses SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, ones starting with ssl. are for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.
Use Confluent Schema Registry with JSON encoded messages
If your Kafka messages are JSON-encoded and registered with Confluent Schema Registry, use the KafkaConfluentSchemaRegistryJsonMessageDecoder. This decoder uses the Confluent KafkaJsonSchemaDeserializer to decode messages whose JSON schemas are managed by the registry.
When to use this decoder
Your Kafka producer serializes messages using the Confluent JSON Schema serializer.
Your JSON schemas are registered in Confluent Schema Registry.
You want schema validation and evolution support for JSON messages.
If your messages are Avro-encoded and registered with Schema Registry, use KafkaConfluentSchemaRegistryAvroMessageDecoder instead (shown in the SSL example above). If your messages are plain JSON without a schema registry, use JSONMessageDecoder.
Example table config
The key configuration properties for this decoder are:
stream.kafka.decoder.class.name -- Set to org.apache.pinot.plugin.inputformat.json.confluent.KafkaConfluentSchemaRegistryJsonMessageDecoder.
stream.kafka.decoder.prop.schema.registry.rest.url -- The URL of the Confluent Schema Registry.
Authentication
This decoder supports the same authentication options as the Avro schema registry decoder. You can configure SSL or SASL_SSL authentication for both the Kafka consumer and the Schema Registry client using the stream.kafka.decoder.prop.schema.registry.* properties. See the and above for details.
For Schema Registry basic authentication, add the following properties:
This decoder was added in Pinot 1.4. Make sure your Pinot deployment is running version 1.4 or later.
Consume transactionally-committed messages
The Kafka 3.x and 4.x connectors support Kafka transactions. The transaction support is controlled by config kafka.isolation.level in Kafka stream config, which can be read_committed or read_uncommitted (default). Setting it to read_committed will ingest transactionally committed messages in Kafka stream only.
For example,
Note that the default value of this config read_uncommitted to read all messages. Also, this config supports low-level consumer only.
Use Kafka partition (low) level consumer with SASL_SSL
Here is an example config which uses SASL_SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, some for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.
Extract record headers as Pinot table columns
Pinot's Kafka connector supports automatically extracting record headers and metadata into the Pinot table columns. The following table shows the mapping for record header/metadata to Pinot table column names:
Kafka Record
Pinot Table Column
Description
In order to enable the metadata extraction in a Kafka table, you can set the stream config metadata.populate to true.
In addition to this, if you want to use any of these columns in your table, you have to list them explicitly in your table's schema.
For example, if you want to add only the offset and key as dimension columns in your Pinot table, it can listed in the schema as follows:
Once the schema is updated, these columns are similar to any other pinot column. You can apply ingestion transforms and / or define indexes on them.
Remember to follow the when updating schema of an existing table!
Tell Pinot where to find an Avro schema
There is a standalone utility to generate the schema from an Avro file. See for details.
To avoid errors like The Avro schema must be provided, designate the location of the schema in your streamConfigs section. For example, if your current section contains the following:
Then add this key: "stream.kafka.decoder.prop.schema"followed by a value that denotes the location of your schema.
Subset partition ingestion
By default, a Pinot REALTIME table consumes from all partitions of the configured Kafka topic. In some scenarios you may want a table to consume only a subset of the topic's partitions. The stream.kafka.partition.ids setting lets you specify exactly which Kafka partitions a table should consume.
When to use subset partition ingestion
Split-topic ingestion -- Multiple Pinot tables share the same Kafka topic, and each table is responsible for a different set of partitions. This is useful when the same topic contains logically distinct data partitioned by key, and you want separate tables (or indexes) for each partition group.
Multi-table partition assignment -- You want to distribute the partitions of a high-throughput topic across several Pinot tables for workload isolation, independent scaling, or different retention policies.
Selective consumption -- You only need data from specific partitions of a topic (for example, partitions that correspond to a particular region or tenant).
Configuration
Add stream.kafka.partition.ids to the streamConfigMaps entry in your table config. The value is a comma-separated list of Kafka partition IDs (zero-based integers):
When this setting is present, Pinot will consume only from the listed partitions. When it is absent or blank, Pinot consumes from all partitions of the topic (the default behavior).
Example: splitting a topic across two tables
Suppose you have a Kafka topic called events with two partitions (0 and 1). You can create two Pinot tables, each consuming from one partition:
Table events_part_0:
Table events_part_1:
Validation rules and limitations
Partition IDs must be non-negative integers. Negative values will cause a validation error.
Non-integer values (e.g. "abc") will cause a validation error.
Duplicate IDs are silently deduplicated. For example, "0,2,0,5" is treated as "0,2,5"
Use Protocol Buffers (Protobuf) format
Pinot supports decoding Protocol Buffer messages from Kafka using several decoder options depending on your setup.
ProtoBufMessageDecoder (descriptor file based)
Use ProtoBufMessageDecoder when you have a pre-compiled .desc (descriptor) file for your Protobuf schema. This decoder uses dynamic message parsing and does not require compiled Java classes.
Required stream config properties:
Property
Description
Example streamConfigs:
ProtoBufCodeGenMessageDecoder (compiled JAR based)
Use ProtoBufCodeGenMessageDecoder when you have a compiled JAR containing your generated Protobuf Java classes. This decoder uses runtime code generation for improved decoding performance.
Use KafkaConfluentSchemaRegistryProtoBufMessageDecoder when your Protobuf schemas are managed by Confluent Schema Registry. This decoder automatically resolves schemas from the registry at runtime.
Required stream config properties:
Property
Description
Optional properties:
Property
Description
Example streamConfigs:
Use Apache Arrow format
Pinot supports decoding Apache Arrow IPC streaming format messages from Kafka using ArrowMessageDecoder. This is useful when upstream systems produce data serialized in Arrow format.
Optional stream config properties:
Property
Description
Example streamConfigs:
The Arrow decoder expects each Kafka message to contain a complete Arrow IPC stream (schema + record batch). Ensure your producer serializes Arrow data in the IPC streaming format.
Consuming a Subset of Kafka Partitions
By default, a Pinot realtime table consumes all partitions of a Kafka topic. You can restrict ingestion to a specific subset of partitions using the stream.kafka.partition.ids property. This is useful when:
Splitting a single Kafka topic across multiple Pinot tables for independent scaling
Multi-tenant scenarios where different tables own different partition ranges
Configuration
Add stream.kafka.partition.ids to your streamConfigs with a comma-separated list of partition IDs:
Notes
Partition IDs are validated against actual Kafka topic metadata at startup.
Duplicate IDs in the list are automatically deduplicated.
The total partition count reported to the broker reflects the full Kafka topic size, ensuring correct query routing across tables sharing the same topic.
An expression: price * quantity
A function call: UPPER(city)
An aggregation function: COUNT(*), SUM(revenue)
A CASE WHEN expression
CROSS JOIN
Cartesian product of both tables
SEMI JOIN
Rows from the left table that have a match in the right table
ANTI JOIN
Rows from the left table that have no match in the right table
ASOF JOIN
Rows matched by closest value (e.g., closest timestamp)
LEFT ASOF JOIN
Like ASOF JOIN but keeps all left rows
Less than
WHERE price < 100
>
Greater than
WHERE price > 50
<=
Less than or equal to
WHERE quantity <= 10
>=
Greater than or equal to
WHERE rating >= 4.0
Aggregate expressions are not allowed inside the GROUP BY clause.
Multiplication
price * quantity
/
Division
total / count
%
Modulo (remainder)
id % 10
BOOLEAN
Boolean value
TIMESTAMP
Timestamp value
VARCHAR / STRING
Variable-length string
BYTES
Byte array
JSON
JSON value
useStarTree
Enable or disable star-tree index usage
skipUpsert
Query all records in an upsert table, ignoring deletes
TRUE
FALSE
FALSE
TRUE
FALSE
TRUE
NULL
NULL
TRUE
NULL
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
NULL
FALSE
NULL
TRUE
NULL
NULL
NULL
NULL
NULL
NULL NOT IN (...) returns NULL, not TRUE.
Aggregate functions like SUM, AVG, MIN, MAX ignore NULL values.
COUNT(*) counts all rows; COUNT(col) counts only non-null values.
Decimal literals should be enclosed in single quotes to preserve precision.
Yes
Yes
CASE WHEN
Yes
Yes
BETWEEN, IN, LIKE, IS NULL
Yes
Yes
Arithmetic operators (+, -, *, /, %)
Yes
Yes
CAST
Yes
Yes
OPTION / SET query hints
Yes
Yes
EXPLAIN PLAN
Yes
Yes
OFFSET
Yes
Yes
JOINs (INNER, LEFT, RIGHT, FULL, CROSS)
No
Yes
Semi / Anti joins
No
Yes
ASOF / LEFT ASOF joins
No
Yes
Subqueries
No
Yes
Set operations (UNION, INTERSECT, EXCEPT)
No
Yes
Window functions (OVER, PARTITION BY)
No
Yes
Correlated subqueries
No
No
INSERT INTO (from file)
No
Yes
CREATE TABLE / DROP TABLE DDL
No
No
DISTINCT with *
No
No
DISTINCT with GROUP BY
No
No
SELECT
Query data from one or more tables
SET
Set query options for the session (e.g., SET useMultistageEngine = true)
EXPLAIN PLAN FOR
Display the query execution plan without running the query
[INNER] JOIN
Rows that match in both tables
LEFT [OUTER] JOIN
All rows from the left table, matching rows from the right
RIGHT [OUTER] JOIN
All rows from the right table, matching rows from the left
FULL [OUTER] JOIN
=
Equal to
WHERE city = 'NYC'
<> or !=
Not equal to
WHERE status <> 'canceled'
AND
True if both conditions are true
OR
True if either condition is true
NOT
Negates a condition
+
Addition
price + tax
-
Subtraction
total - discount
INT / INTEGER
32-bit signed integer
BIGINT / LONG
64-bit signed integer
FLOAT
32-bit floating point
DOUBLE
UNION ALL
Combine all rows from both queries (including duplicates)
UNION
Combine rows from both queries, removing duplicates
Return rows from the first query that do not appear in the second
Limit CPU threads used by the query
Aggregation functions
__metadata$offset : String
Record metadata - partition : int
__metadata$partition : String
Record metadata - recordTimestamp : long
__metadata$recordTimestamp : String
.
The partition IDs are sorted internally for stable ordering, regardless of the order specified in the config.
The configured partition IDs are validated against the actual Kafka topic metadata at table creation time. If a specified partition ID does not exist in the topic, an error is raised.
When using subset partition ingestion with multiple tables consuming from the same topic, ensure that the partition assignments do not overlap if you want each record to be consumed by exactly one table. Pinot does not enforce non-overlapping partition assignments across tables.
Whitespace around partition IDs and commas is trimmed (e.g., " 0 , 2 , 5 " is valid).
When splitting a topic between two tables, configure one with even-numbered IDs and another with odd-numbered IDs (for example, "0,2" and "1,3" for a 4-partition topic).
Record key: any type
__key : String
For simplicity of design, we assume that the record key is always a UTF-8 encoded String
Record Headers: Map<String, String>
Each header key is listed as a separate column: __header$HeaderKeyName : String
For simplicity of design, we directly map the string headers from kafka record to pinot table column
stream.kafka.decoder.prop.descriptorFile
Path or URI to the .desc descriptor file. Supports local file paths, HDFS, and other Pinot-supported file systems.
stream.kafka.decoder.prop.protoClassName
(Optional) Fully qualified Protobuf message name within the descriptor. If omitted, the first message type in the descriptor is used.
stream.kafka.decoder.prop.jarFile
Path or URI to the JAR file containing compiled Protobuf classes.
stream.kafka.decoder.prop.protoClassName
Fully qualified Java class name of the Protobuf message (required).
-- Set a query option, then run a query
SET useMultistageEngine = true;
SELECT COUNT(*) FROM myTable WHERE city = 'San Francisco';
-- View the execution plan
EXPLAIN PLAN FOR
SELECT COUNT(*) FROM myTable GROUP BY city;
SELECT [ DISTINCT ] select_expression [, select_expression ]*
FROM table_reference
[ WHERE filter_condition ]
[ GROUP BY group_expression [, group_expression ]* ]
[ HAVING having_condition ]
[ ORDER BY order_expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] [, ...] ]
[ LIMIT count ]
[ OFFSET offset ]
[ OPTION ( key = value [, key = value ]* ) ]
SELECT city AS metro_area, COUNT(*) AS total_orders
FROM orders
GROUP BY city
SELECT DISTINCT city, state
FROM stores
LIMIT 100
SELECT * FROM myTable
SET useMultistageEngine = true;
SELECT city, avg_revenue
FROM (
SELECT city, AVG(revenue) AS avg_revenue
FROM orders
GROUP BY city
) AS sub
WHERE avg_revenue > 1000
SET useMultistageEngine = true;
SELECT o.order_id, c.name
FROM orders AS o
JOIN customers AS c ON o.customer_id = c.id
WHERE o.amount > 100
SELECT * FROM orders
WHERE amount BETWEEN 100 AND 500
SELECT * FROM orders
WHERE amount NOT BETWEEN 100 AND 500
SELECT * FROM orders
WHERE city IN ('NYC', 'LA', 'Chicago')
SELECT * FROM orders
WHERE status NOT IN ('canceled', 'refunded')
SELECT * FROM customers
WHERE name LIKE 'John%'
SELECT * FROM orders
WHERE discount IS NOT NULL
SELECT * FROM airlines
WHERE REGEXP_LIKE(airlineName, '^U.*')
SELECT * FROM logs
WHERE TEXT_MATCH(message, 'error AND timeout')
SELECT * FROM events
WHERE JSON_MATCH(payload, '"$.type" = ''click''')
SELECT * FROM embeddings
WHERE VECTOR_SIMILARITY(vector_col, ARRAY[0.1, 0.2, 0.3], 10)
SELECT city, COUNT(*) AS order_count, SUM(amount) AS total
FROM orders
GROUP BY city
SELECT city, COUNT(*) AS order_count
FROM orders
GROUP BY city
HAVING COUNT(*) > 100
SELECT city, SUM(amount) AS total
FROM orders
GROUP BY city
ORDER BY total DESC
SELECT city, revenue
FROM stores
ORDER BY revenue DESC NULLS LAST
SELECT * FROM orders LIMIT 50
SELECT * FROM orders
ORDER BY created_at DESC
LIMIT 20 OFFSET 40
SELECT * FROM orders
ORDER BY created_at DESC
LIMIT 40, 20
SELECT * FROM orders
WHERE (status = 'completed' OR status = 'shipped')
AND amount > 100
SELECT order_id, price * quantity AS line_total
FROM line_items
WHERE (price * quantity) > 1000
SELECT CAST(revenue AS BIGINT) FROM orders
SELECT CAST(event_time AS TIMESTAMP), CAST(user_id AS VARCHAR)
FROM events
SET useMultistageEngine = true;
SELECT city FROM stores
UNION ALL
SELECT city FROM warehouses
SET useMultistageEngine = true;
SELECT customer_id FROM orders_2024
INTERSECT
SELECT customer_id FROM orders_2025
function_name ( expression ) OVER (
[ PARTITION BY partition_expression [, ...] ]
[ ORDER BY order_expression [ ASC | DESC ] [, ...] ]
[ frame_clause ]
)
{ ROWS | RANGE } BETWEEN frame_start AND frame_end
frame_start / frame_end:
UNBOUNDED PRECEDING
| offset PRECEDING
| CURRENT ROW
| offset FOLLOWING
| UNBOUNDED FOLLOWING
SET useMultistageEngine = true;
SELECT
city,
order_date,
amount,
SUM(amount) OVER (PARTITION BY city ORDER BY order_date) AS running_total,
ROW_NUMBER() OVER (PARTITION BY city ORDER BY amount DESC) AS rank
FROM orders
SELECT * FROM orders
WHERE city = 'NYC'
OPTION(timeoutMs=5000)
SET timeoutMs = 5000;
SET useMultistageEngine = true;
SELECT * FROM orders WHERE city = 'NYC'
SET enableNullHandling = true;
SELECT * FROM orders WHERE discount IS NULL
SELECT
order_id,
CASE
WHEN amount > 1000 THEN 'high'
WHEN amount > 100 THEN 'medium'
ELSE 'low'
END AS tier
FROM orders
SELECT
SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END) AS completed_revenue
FROM orders
import datetime
import uuid
import random
import json
while True:
ts = int(datetime.datetime.now().timestamp()* 1000)
id = str(uuid.uuid4())
count = random.randint(0, 1000)
print(
json.dumps({"ts": ts, "uuid": id, "count": count})
)
GapFill function is experimental, and has limited support, validation and error reporting.
GapFill Function is only supported with the single-stage query engine (v1).
Many of the datasets are time series in nature, tracking state change of an entity over time. The granularity of recorded data points might be sparse or the events could be missing due to network and other device issues in the IOT environment. But analytics applications which are tracking the state change of these entities over time, might be querying for values at lower granularity than the metric interval.
Here is the sample data set tracking the status of parking lots in parking space.
lotId
event_time
is_occupied
We want to find out the total number of parking lots that are occupied over a period of time which would be a common use case for a company that manages parking spaces.
Let us take 30 minutes' time bucket as an example:
timeBucket/lotId
P1
P2
P3
If you look at the above table, you will see a lot of missing data for parking lots inside the time buckets. In order to calculate the number of occupied park lots per time bucket, we need gap fill the missing data.
The Ways of Gap Filling the Data
There are two ways of gap filling the data: FILL_PREVIOUS_VALUE and FILL_DEFAULT_VALUE.
FILL_PREVIOUS_VALUE means the missing data will be filled with the previous value for the specific entity, in this case, park lot, if the previous value exists. Otherwise, it will be filled with the default value.
FILL_DEFAULT_VALUE means that the missing data will be filled with the default value. For numeric column, the default value is 0. For Boolean column type, the default value is false. For TimeStamp, it is January 1, 1970, 00:00:00 GMT. For STRING, JSON and BYTES, it is empty String. For Array type of column, it is empty array.
We will leverage the following the query to calculate the total occupied parking lots per time bucket.
Aggregation/Gapfill/Aggregation
Query Syntax
In the example above, TIMESERIESON(column_name) element is obligatory, and column_name must point to actual table column. It can't be a literal or expression.
Moreover, if the innermost query contains GROUP BY clause then (contrary to regular queries) it must contain an aggregate function, otherwise Select and Gapfill should be in the same sql statement error is returned.
Workflow
The most nested sql will convert the raw event table to the following table.
lotId
event_time
is_occupied
The second most nested sql will gap fill the returned data as following:
timeBucket/lotId
P1
P2
P3
The outermost query will aggregate the gapfilled data as follows:
timeBucket
totalNumOfOccuppiedSlots
There is one assumption we made here that the raw data is sorted by the timestamp. The Gapfill and Post-Gapfill Aggregation will not sort the data.
The above example just shows the use case where the three steps happen:
The raw data will be aggregated;
The aggregated data will be gapfilled;
The gapfilled data will be aggregated.
There are three more scenarios we can support.
Select/Gapfill
If we want to gapfill the missing data per half an hour time bucket, here is the query:
Query Syntax
Workflow
At first the raw data will be transformed as follows:
lotId
event_time
is_occupied
Then it will be gapfilled as follows:
lotId
event_time
is_occupied
Aggregate/Gapfill
Query Syntax
Workflow
The nested sql will convert the raw event table to the following table.
lotId
event_time
is_occupied
The outer sql will gap fill the returned data as following:
timeBucket/lotId
P1
P2
P3
Gapfill/Aggregate
Query Syntax
Workflow
The raw data will be transformed as following at first:
lotId
event_time
is_occupied
The transformed data will be gap filled as follows:
lotId
event_time
is_occupied
The aggregation will generate the following table:
timeBucket
totalNumOfOccuppiedSlots
P1
2021-10-01 09:33:00.000
0
P1
2021-10-01 09:47:00.000
1
P3
2021-10-01 10:05:00.000
1
P2
2021-10-01 10:06:00.000
0
P2
2021-10-01 10:16:00.000
1
P2
2021-10-01 10:31:00.000
0
P3
2021-10-01 11:17:00.000
0
P1
2021-10-01 11:54:00.000
0
2021-10-01 10:00:00.000
0,1
1
2021-10-01 10:30:00.000
0
2021-10-01 11:00:00.000
0
2021-10-01 11:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
1
0
2021-10-01 10:00:00.000
1
1
1
2021-10-01 10:30:00.000
1
0
1
2021-10-01 11:00:00.000
1
0
0
2021-10-01 11:30:00.000
0
0
0
2021-10-01 10:30:00.000
2
2021-10-01 11:00:00.000
1
2021-10-01 11:30:00.000
0
P1
2021-10-01 09:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
0
P2
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
P3
2021-10-01 09:00:00.000
0
P1
2021-10-01 09:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P2
2021-10-01 09:30:00.000
1
P3
2021-10-01 09:30:00.000
0
P1
2021-10-01 10:00:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
0
P2
2021-10-01 10:00:00.000
1
P1
2021-10-01 10:30:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 10:30:00.000
1
P1
2021-10-01 11:00:00.000
1
P2
2021-10-01 11:00:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
P2
2021-10-01 11:30:00.000
0
P3
2021-10-01 11:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
1
0
2021-10-01 10:00:00.000
1
1
1
2021-10-01 10:30:00.000
1
0
1
2021-10-01 11:00:00.000
1
0
0
2021-10-01 11:30:00.000
0
0
0
P1
2021-10-01 09:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
0
P2
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
P3
2021-10-01 09:00:00.000
0
P1
2021-10-01 09:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P2
2021-10-01 09:30:00.000
1
P3
2021-10-01 09:30:00.000
0
P1
2021-10-01 10:00:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
0
P2
2021-10-01 10:00:00.000
1
P1
2021-10-01 10:30:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 10:30:00.000
1
P2
2021-10-01 10:30:00.000
0
P1
2021-10-01 11:00:00.000
1
P2
2021-10-01 11:00:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
P2
2021-10-01 11:30:00.000
0
P3
2021-10-01 11:30:00.000
0
2021-10-01 10:30:00.000
2
2021-10-01 11:00:00.000
1
2021-10-01 11:30:00.000
0
P1
2021-10-01 09:01:00.000
1
P2
2021-10-01 09:17:00.000
1
2021-10-01 09:00:00.000
1
1
2021-10-01 09:30:00.000
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
2021-10-01 09:00:00.000
1
1
0
2021-10-01 09:30:00.000
2021-10-01 09:00:00.000
2
2021-10-01 09:30:00.000
2
2021-10-01 10:00:00.000
3
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
2021-10-01 09:00:00.000
1
1
0
2021-10-01 09:30:00.000
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
2021-10-01 09:00:00.000
2
2021-10-01 09:30:00.000
2
2021-10-01 10:00:00.000
3
0,1
1
1
SELECT time_col, SUM(status) AS occupied_slots_count
FROM (
SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
'2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
TIMESERIESON(lotId)), lotId, status
FROM (
SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
FROM parking_data
WHERE event_time >= 1633078800000 AND event_time <= 1633089600000
GROUP BY 1, 2
ORDER BY 1
LIMIT 100)
LIMIT 100)
GROUP BY 1
LIMIT 100
SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
'2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
FROM parking_data
WHERE event_time >= 1633078800000 AND event_time <= 1633089600000
ORDER BY 1
LIMIT 100
SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
'2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
TIMESERIESON(lotId)), lotId, status
FROM (
SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
FROM parking_data
WHERE event_time >= 1633078800000 AND event_time <= 1633089600000
GROUP BY 1, 2
ORDER BY 1
LIMIT 100)
LIMIT 100
SELECT time_col, SUM(is_occupied) AS occupied_slots_count
FROM (
SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
'2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
FROM parking_data
WHERE event_time >= 1633078800000 AND event_time <= 1633089600000
ORDER BY 1
LIMIT 100)
GROUP BY 1
LIMIT 100
Stream Ingestion with Upsert
Upsert support in Apache Pinot.
Pinot provides native upsert support during ingestion. There are scenarios where records need modifications, such as correcting a ride fare or updating a delivery status.
Partial upserts are convenient as you only need to specify the columns where values change, and you ignore the rest.
Table type support
Upsert is supported across REALTIME, OFFLINE, and HYBRID table types. The available modes depend on the table type:
Table type
FULL upsert
PARTIAL upsert
Notes
For OFFLINE table upsert configuration details, see .
Overview of upserts in Pinot
See an overview of how upserts work in Pinot.
Enable upserts in Pinot
To enable upserts on a Pinot table, do the following:
Define the primary key in the schema
To update a record, you need a primary key to uniquely identify the record. To define a primary key, add the field primaryKeyColumns to the schema definition. For example, the schema definition of UpsertMeetupRSVP in the quick start example has this definition.
Note this field expects a list of columns, as the primary key can be a composite.
When two records of the same primary key are ingested, the record with the greater comparison value (timeColumn by default) is used. When records have the same primary key and event time, then the order is not determined. In most cases, the later ingested record will be used, but this may not be true in cases where the table has a column to sort by.
Partition the input stream by the primary key
An important requirement for the Pinot upsert table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the API. If the original stream is not partitioned, then a streaming processing job (such as with Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.
Additionally if using
Enable upsert in the table configurations
To enable upsert, make the following configurations in the table configurations.
Upsert modes
Full upsert
The upsert mode defaults to FULL . FULL upsert means that a new record will replace the older record completely if they have same primary key. Example config:
Partial upserts
Partial upsert lets you choose to update only specific columns and ignore the rest.
To enable the partial upsert, set the mode to PARTIAL and specify partialUpsertStrategies for partial upsert columns. Since release-0.10.0, OVERWRITE is used as the default strategy for columns without a specified strategy. defaultPartialUpsertStrategy is also introduced to change the default strategy for all columns.
Note that null handling must be enabled for partial upsert to work.
For example:
Pinot supports the following partial upsert strategies:
Strategy
Description
With partial upsert, if the value is null in either the existing record or the new coming record, Pinot will ignore the upsert strategy and the null value:
(null, newValue) -> newValue
(oldValue, null) -> oldValue
(null, null) -> null
Post-Partial-Upsert Transforms (Derived Columns)
When using partial upserts, you may have derived columns that need to be recomputed after the row is merged from the incoming record and the existing record. The postPartialUpsertTransformConfigs feature allows you to apply transformation functions to compute derived columns from the fully merged row.
Use Case
Consider an e-commerce table tracking orders:
order_id: Primary key
score: Points earned from the order
bonus: Bonus points awarded
With partial upserts, incoming records may only contain updated values for score or bonus. The ingestion-time transforms only see the incoming record, so they cannot correctly compute total from a partially merged row. The postPartialUpsertTransformConfigs allows you to recompute total from the complete merged row after the partial upsert merge happens.
Configuration
To enable post-partial-upsert transforms, add the postPartialUpsertTransformConfigs configuration to your table's upsertConfig:
Evaluation Semantics
Post-partial-upsert transforms are evaluated after the partial upsert merge completes
They operate on the complete merged row, not just the incoming record
Both incoming and existing column values are available for the transform expression
Interaction with Ingestion Transforms
Ingestion-time transforms and post-partial-upsert transforms serve different purposes:
Aspect
Ingestion Transforms
Post-Partial-Upsert Transforms
Both can be used together:
Ingestion transforms normalize the incoming record
The normalized incoming record participates in partial upsert merge
Post-partial-upsert transforms recompute derived columns from the complete merged row
Example Workflow
Given a partial upsert table with this configuration:
Processing these records:
Initial record (order_id=123):
Incoming: {order_id: 123, score: 100, bonus: 10}
Merge: (first record, no existing row)
The derived columns computed by post-partial-upsert transforms can be queried like any other column. If you need to use these derived columns in further upsert strategies or transforms, ensure they are defined in your schema.
None upserts
If set mode to NONE, the upsert is disabled.
Comparison column
By default, Pinot uses the value in the time column (timeColumn in tableConfig) to determine the latest record. That means, for two records with the same primary key, the record with the larger value of the time column is picked as the latest update. However, there are cases when users need to use another column to determine the order. In such case, you can use option comparisonColumn to override the column used for comparison. For example,
For partial upsert table, the out-of-order events won't be consumed and indexed. For example, for two records with the same primary key, if the record with the smaller value of the comparison column came later than the other record, it will be skipped.
NOTE: Please use comparisonColumns for single comparison column instead of comparisonColumn as it is currently deprecated. You may see unrecognizedProperties when using the old config, but it's converted to comparisonColumns automatically when adding the table.
Multiple comparison columns
In some cases, especially where partial upsert might be employed, there may be multiple producers of data each writing to a mutually exclusive set of columns, sharing only the primary key. In such a case, it may be helpful to use one comparison column per producer group so that each group can manage its own specific versioning semantics without the need to coordinate versioning across other producer groups.
Documents written to Pinot are expected to have exactly 1 non-null value out of the set of comparisonColumns; if more than 1 of the columns contains a value, the document will be rejected. When new documents are written, whichever comparison column is non-null will be compared against only that same comparison column seen in prior documents with the same primary key. Consider the following examples, where the documents are assumed to arrive in the order specified in the array.
The following would occur:
orderReceived: 1
Result: persisted
Reason: first doc seen for primary key "aa"
orderReceived: 2
Result: persisted (replacing orderReceived: 1)
Reason: comparison column (secondsSinceEpoch) larger than that previously seen
orderReceived: 3
Result: rejected
Reason: comparison column (secondsSinceEpoch) smaller than that previously seen
orderReceived: 4
Result: persisted (replacing orderReceived: 2)
Reason: comparison column (otherComparisonColumn) larger than previously seen (never seen previously), despite the value being smaller than that seen for secondsSinceEpoch
orderReceived: 5
Result: rejected
Reason: comparison column (otherComparisonColumn) smaller than that previously seen
orderReceived: 6
Result: persist (replacing orderReceived: 4)
Reason: comparison column (otherComparisonColumn) larger than that previously seen
Metadata time-to-live (TTL)
In Pinot, the metadata map is stored in heap memory. To decrease in-memory data and improve performance, minimize the time primary key entries are stored in the metadata map (metadata time-to-live (TTL)). Limiting the TTL is especially useful for primary keys with high cardinality and frequent updates.
Since the metadata TTL is applied on the first comparison column, the time unit of upsert TTL is the same as the first comparison column.
Configure how long primary keys are stored in metadata
To configure how long primary keys are stored in metadata, specify the length of time in metadataTTL. For example:
In this example, Pinot will retain primary keys in metadata for 1 day.
Note that enabling upsert snapshot is required for metadata TTL for in-memory validDocsIDs recovery.
Delete column
Upsert Pinot table can support soft-deletes of primary keys. This requires the incoming record to contain a dedicated boolean single-field column that serves as a delete marker for a primary key. Once the real-time engine encounters a record with delete column set to true , the primary key will no longer be part of the queryable set of documents. This means the primary key will not be visible in the queries, unless explicitly requested via query option skipUpsert=true.
Note that the delete column has to be a single-value boolean column.
Note that when deleteRecordColumn is added to an existing table, it will require a server restart to actually pick up the upsert config changes.
A deleted primary key can be revived by ingesting a record with the same primary, but with higher comparison column value(s).
Note that when reviving a primary key in a partial upsert table, the revived record will be treated as the source of truth for all columns. This means any previous updates to the columns will be ignored and overwritten with the new record's values.
Deleted Keys time-to-live (TTL)
The above config deleteRecordColumn only soft-deletes the primary key. To decrease in-memory data and improve performance, minimize the time deleted-primary-key entries are stored in the metadata map (deletedKeys time-to-live (TTL)). Limiting the TTL is especially useful for deleted-primary-keys where there are no future updates foreseen.
Configure how long deleted-primary-keys are stored in metadata
To configure how long primary keys are stored in metadata, specify the length of time in deletedKeysTTL For example:
In this example, Pinot will retain the deleted-primary-keys in metadata for 1 day.
Note that the value of this field deletedKeysTTL should be the same as the unit of comparison column. If your comparison column is having values which corresponds to seconds, this config should also have values in seconds (see above example). metadataTTL and deletedKeysTTL do not work with multiple comparison columns and comparison/time column must be of NUMERIC type.
Data consistency with deletes and compaction together
When using deletedKeysTTL together with UpsertCompactionTask, there can be a scenario where a segment containing deleted-record (where deleteRecordColumn = true was set for the primary key) gets compacted first and a previous old record is not yet compacted. During server restart, now the old record is added to the metadata manager map and is treated as non-deleted. To prevent data inconsistencies in this scenario, we have added a new config enableDeletedKeysCompactionConsistency which when set to true, will ensure that the deleted records are not compacted until all the previous records from all other segments are compacted for the deleted primary-key.
Data consistency when queries and upserts happen concurrently
Upserts in Pinot enable real-time updates and ensure that queries always retrieve the latest version of a record, making them a powerful feature for managing mutable data efficiently. However, in applications with extremely high QPS and high ingestion rates, queries and upserts happening concurrently can sometimes lead to inconsistencies in query results.
For example, consider a table with 1 million primary keys. A distinct count query should always return 1 million, regardless of how new records are ingested and older records are invalidated. However, at high ingestion and query rates, the query may occasionally return a count slightly above or below 1 million. This happens because queries determine valid records by acquiring validDocIds bitmaps from multiple segments, which indicate which documents are currently valid. Since acquiring these bitmaps is not atomic with respect to ongoing upserts, a query may capture an inconsistent view of the data, leading to overcounting or undercounting of valid records.
This is a classic concurrency issue where reads and writes happen simultaneously, leading to temporary inconsistencies. Typically, such issues are resolved using locks or snapshots to maintain a stable view of the data during query execution. To address this, two new consistency modes - SYNC and SNAPSHOT - have been introduced for upsert enabled tables to ensure consistent query results even when queries and upserts occur concurrently and at very high throughput.
By default, the consistency mode is NONE, meaning the system operates as before. The SYNC mode ensures consistency by blocking upserts while queries execute, guaranteeing that queries always see a stable upserted data view. However, this can introduce write latency. Alternatively, the SNAPSHOT mode creates a consistent snapshot of validDocIds bitmaps for queries to use. This allows upserts to continue without blocking queries, making it more suitable for workloads with both high query and write rates.
These new consistency modes provide flexibility, allowing applications to balance consistency guarantees against performance trade-offs based on their specific requirements.
For SNAPSHOT mode, one can configure how often the upsert view should be refreshed via a upsertConfig called upsertViewRefreshIntervalMs, which is 3000ms by default. Both the write and query threads can refresh the upsert view when it gets stale according to this config. Changing this config requires server restarts.
One can further adjust the view's freshness during query time without restarting servers via a query option called upsertViewFreshnessMs . By default, this query option matches with that upsertConfig upsertViewRefreshIntervalMs , but if a query sets it to a smaller value, the upsert view may get refreshed sooner for the query; and if set to 0, the query simply forces to refresh upsert view every time.
For debugging purposes, there's a query option called skipUpsertView. If set to true, it bypasses the consistent upsert view maintained by SYNC or SNAPSHOT modes. This effectively executes the query as if it were in NONE mode.
Use strictReplicaGroup for routing
The upsert Pinot table can use only the low-level consumer for the input streams. As a result, it uses the implicitly for the segments. Moreover, upsert poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires to use strictReplicaGroup as the routing strategy. To use that, configure instanceSelectorType in Routing as the following:
Using implicit partitioned replica-group assignment from low-level consumer won't persist the instance assignment (mapping from partition to servers) to the ZooKeeper, and new added servers will be automatically included without explicit reassigning instances (usually through rebalance). This can cause new segments of the same partition assigned to a different server and break the requirement of upsert.
To prevent this, we recommend using explicit to ensure the instance assignment is persisted. Note that numInstancesPerPartition should always be 1 in replicaGroupPartitionConfig
Enable validDocIds snapshots for upsert metadata recovery
Upsert snapshot support is also added in release-0.12.0. To enable the snapshot, set snapshot to ENABLE. For example:
Upsert maintains metadata in memory containing which docIds are valid in a particular segment (ValidDocIndexes). This metadata gets lost during server restarts and needs to be recreated again.
ValidDocIndexes can not be recovered easily after out-of-TTL primary keys get removed. Enabling snapshots addresses this problem by adding functions to store and recover validDocIds snapshot for Immutable Segments
The snapshots are taken on every segment commit to ensure that they are consistent with the persisted data in case of abrupt shutdown.
We recommend that you enable this feature so as to speed up server boot times during restarts.
The lifecycle for validDocIds snapshots are shows as follows,
If snapshot is enabled, snapshots for existing segments are taken or refreshed when the next consuming segment gets started.
The snapshot files are kept on disk until the segments get removed, e.g. due to data retention or manual deletion.
Enable preload for faster server restarts
Upsert preload feature can make it faster to restore the upsert states when server restarts. To enable the preload feature, set preload to ENABLE. Snapshot must also be enabled. For example:
Under the hood, it uses the validDocIds snapshots to identify the valid docs and restore their upsert metadata quickly instead of performing a whole upsert comparison flow. The flow is triggered before the server is marked as ready, after which the server starts to load the remaining segments without snapshots (hence the name preload).
The feature also requires you to specify pinot.server.instance.max.segment.preload.threads: N in the server config where N should be replaced with the number of threads that should be used for preload. It's 0 by default to disable the preloading feature.
A bug was introduced in v1.2.0 that when enablePreload and enableSnapshot flags are set to true but max.segment.preload.threads is left as 0, the preloading mechanism is still enabled but segments fail to get loaded as there is no threads for preloading. This was fixed in newer versions, but for v1.2.0, if enablePreload and enableSnapshot are set to true, remember to set max.segment.preload.threads to a positive value as well. Server restart is needed to get max.segment.preload.threads config change into effect.
Enable commit time compaction for storage optimization
If you are enabling commit time compaction for an existing table, it is recommended to first pause the ingestion for that table, enable this feature by updating the table-config, and then resume ingestion.
Many Upsert use-cases have a lot of Update events within the segment commit window. For instance, if we had an Upsert table for order status of Uber Eats orders, we would expect a lot of update events for the same order within a 1 hour window. For such use-cases, the committed segments end up with a lot of dead tuples, and you have to wait for the Segment Compaction tasks to prune them, which can take hours.
Commit time compaction is a performance optimization feature for upsert tables that removes invalid and obsolete records during the segment commit process itself. This not only reduces the storage bloat of the table immediately, but it can also bring down the segment commit time.
To enable commit time compaction, set the enableCommitTimeCompaction to true in the upsert configuration. For example:
How it works
During segment commit, commit time compaction:
Filters out invalid document IDs. Retains valid records and soft-deleted records.
Generates accurate column statistics for compacted segments
Maintains correct document order while removing obsolete data
Configuration requirements
The feature is enabled per table by setting enableCommitTimeCompaction=true in the upsert configuration
Changes take effect after one segment commit cycle (the current consuming segment will be committed without compaction)
Compatible with all types of upsert tables
Handle out-of-order events
There are 2 configs added related to handling out-of-order events.
dropOutOfOrderRecord
To enable dropping of out-of-order record, set the dropOutOfOrderRecord to true. For example:
This feature doesn't persist any out-of-order event to the consuming segment. If not specified, the default value is false.
When false, the out-of-order record gets persisted to the consuming segment, but the MetadataManager mapping is not updated thus this record is not referenced in query or in any future updates. You can still see the records when using skipUpsert query option.
When true, the out-of-order record doesn't get persisted at all and the MetadataManager mapping is not updated so this record is not referenced in query or in any future updates. You cannot see the records when using skipUpsert query option.
outOfOrderRecordColumn
This is to identify out-of-order events programmatically. To enable this config, add a boolean field in your table schema, say isOutOfOrder and enable via this config. For example:
This feature persists a true / false value to the isOutOfOrder field based on the orderness of the event. You can filter out out-of-order events while using skipUpsert to avoid any confusion. For example:
Note that dropOutOfOrderRecord and outOfOrderRecordColumn are only supported when no consistencyMode is set (i.e., consistencyMode = NONE). This is because, when a consistencyMode is enabled, rows are added before the valid documents are updated. As a result, out-of-order records cannot be dropped or marked in upsert tables, defeating the purpose of these options.
Use custom metadata manager
Pinot supports custom PartitionUpsertMetadataManager that handle records and segments updates.
Adding custom upsert managers
You can add custom PartitionUpsertMetadataManager as follows:
Create a new java project. Make sure you keep the package name as org.apache.pinot.segment.local.upsert.xxx
In your java project include the dependency
Add your custom partition manager that implements PartitionUpsertMetadataManager interface
Add your custom TableUpsertMetadataManager that implements BaseTableUpsertMetadataManager interface
Place the compiled JAR in the /plugins directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the custom upsert manager in table configs as follows:
⚠️ The upsert manager class name is case-insensitive as well.
Immutable upsert configuration fields
Certain upsert and schema configuration fields cannot be modified after table creation.
Changing these fields on an existing upsert table can lead to data inconsistencies or data loss, particularly when servers restart and commit segments. Pinot validates and invalidates documents based on these configurations, so altering them after data has been ingested will cause the existing validDocId snapshots to become inconsistent with the new configuration.
The following fields are immutable after table creation:
Upsert table limitations
There are some limitations for the upsert Pinot tables.
Partial upsert is supported for REALTIME tables only. OFFLINE tables support FULL upsert only. See for details.
The star-tree index cannot be used for indexing, as the star-tree index performs pre-aggregation during the ingestion.
Unlike append-only tables, out-of-order events (with comparison value in incoming record less than the latest available value) won't be consumed and indexed by Pinot partial upsert table, these late events will be skipped.
Best practices
Unlike other real-time tables, Upsert table takes up more memory resources as it needs to bookkeep the record locations in memory. As a result, it's important to plan the capacity beforehand, and monitor the resource usage. Here are some recommended practices of using Upsert table.
Create the topic/stream with more partitions.
The number of partitions in input streams determines the partition numbers of the Pinot table. The more partitions you have in input topic/stream, more Pinot servers you can distribute the Pinot table to and therefore more you can scale the table horizontally. Do note that you can't increase the partitions in future for upsert enabled tables so you need to start with good enough partitions (atleast 2-3X the number of pinot servers)
Memory usage
Upsert table maintains an in-memory map from the primary key to the record location. So it's recommended to use a simple primary key type and avoid composite primary keys to save the memory cost. Beware when using JSON column as primary key, same key-values in different order would be considered as different primary keys. In addition, consider the hashFunction config in the Upsert config, which can be UUID, MD5 or MURMUR3.
If your primary key column is a valid UUID and you are running out of memory due to a high number of primary keys, the UUID hash function can lower memory requirements by up to 35% without bringing in any hash collision risks.
If the primary key is not a valid UUID, this hash function stores the primary key as is and skips the UUID based compression.
MD5 and MURMUR3 can also help lower memory requirements. They work for all types of primary key values, but bring in a small risk of hash collision. The generated hash from MD5 and MURMUR3 is a 128-bit hash, so this is beneficial when your primary key values are larger than 128-bits.
Monitoring
Set up a dashboard over the metric pinot.server.upsertPrimaryKeysCount.tableName to watch the number of primary keys in a table partition. It's useful for tracking its growth which is proportional to the memory usage growth. **** The total memory usage by upsert is roughly (primaryKeysCount * (sizeOfKeyInBytes + 24))
Capacity planning
It's useful to plan the capacity beforehand to ensure you will not run into resource constraints later. A simple way is to measure the rate of the primary keys in the input stream per partition and extrapolate the data to a specific time period (based on table retention) to approximate the memory usage. A heap dump is also useful to check the memory usage so far on an upsert table instance.
Example
Putting these together, you can find the table configurations of the quick start examples as the following:
Pinot server maintains a primary key to record location map across all the segments served in an upsert-enabled table. As a result, when updating the config for an existing upsert table (e.g. change the columns in the primary key, change the comparison column), servers need to be restarted in order to apply the changes and rebuild the map.
Advanced Server Configuration
Consuming Segment Consistency Mode
For partial upsert tables or tables with dropOutOfOrder=true, configure how the server handles segment reloads and force commits via pinot.server.consuming.segment.consistency.mode in pinot-server.conf:
Mode
Description
Note: This is a server-level property distinct from the table-level upsertConfig.consistencyMode setting.
Migrating from deprecated config fields
As of Pinot 1.4.0, the following upsert config fields have been renamed:
Deprecated field
New field
Values
The new fields use the Enablement enum (ENABLE, DISABLE, DEFAULT) instead of boolean values. DEFAULT defers to the server-level configuration, which allows table-level overrides when the feature is enabled at the instance level.
The deprecated boolean fields still work but will be removed in a future release. Update your table configs to use the new field names.
Quick Start
To illustrate how the full upsert works, the Pinot binary comes with a quick start example. Use the following command to creates a real-time upsert table meetupRSVP.
You can also run partial upsert demo with the following command
As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to check out the real-time data.
For partial upsert you can see only the value from configured column changed based on specified partial upsert strategy.
An example for partial upsert is shown below, each of the event_id kept being unique during ingestion, meanwhile the value of rsvp_count incremented.
To see the difference from the non-upsert table, you can use a query option skipUpsert to skip the upsert effect in the query result.
FAQ
Can I change configs like primary key columns and comparison columns in existing upsert table?
Not recommended. Existing segments contain validDocId snapshots computed using the old configuration. Changing the configuration can lead to data inconsistencies as existing snapshots wouldn't be cleaned up, especially if a server restarts with validDocId snapshots while replica server do not.
Best option: Create a new table and reingest all data.
Alternative: Disable SNAPSHOT, pause consumption and restart all the servers. This will work for new incoming keys only; consistency across existing data is not guaranteed.
HYBRID
Yes
No
Avoid overlapping time ranges between offline and realtime
segmentPartitionConfig
to leverage Broker segment pruning then it's important to ensure that the partition function used matches both on the Kafka producer side as well as Pinot. In Kafka default for Java client is 32-bit
murmur2
hash and for all other languages such as Python its
CRC32
(Cyclic Redundancy Check 32-bit).
IGNORE
Ignore the new value, keep the existing value (v0.10.0+)
MAX
Keep the maximum value betwen the existing value and new value (v0.12.0+)
MIN
Keep the minimum value betwen the existing value and new value (v0.12.0+)
total: Derived column that should equal score + bonus
If snapshot is disabled, the existing snapshot for a segment is cleaned up when the segment gets loaded by the server, e.g. when the server restarts.
Reduces segment size immediately without requiring minion tasks
Schema fields:
primaryKeyColumns
upsertConfig fields:
mode (FULL, PARTIAL, NONE)
hashFunction
comparisonColumns
timeColumnName (when used as the default comparison column)
partialUpsertStrategies (for PARTIAL mode)
defaultPartialUpsertStrategy (for PARTIAL mode)
dropOutOfOrderRecord
outOfOrderRecordColumn
Attempting to update these fields will return an error:
Recommended workaround: Create a new table with the desired configuration and reingest all data.
Alternative (use with caution): If you must modify these fields without recreating the table, you can use the force=true query parameter on the table config update API. Before doing so, disable SNAPSHOT mode in upsertConfig, pause consumption, and restart all servers. Note that this approach only guarantees consistency for newly ingested keys; existing data may remain inconsistent.
We cannot change the number of partitions in the source topic after the upsert/dedup table is created (start with a relatively high number of partitions as mentioned in best practices).
REALTIME
Yes
Yes
Stream-based ingestion with full upsert feature set
OFFLINE
Yes
No
OVERWRITE
Overwrite the column of the last record
INCREMENT
Add the new value to the existing values
APPEND
Add the new item to the Pinot unordered set
UNION
Execution timing
Before ingestion into Pinot
After partial upsert merge, during ingestion
Input record
Incoming source record
Merged row (incoming + existing)
RESTRICTED
(Default for partial upsert tables with RF > 1) Disables segment reloads and force commits to prevent data inconsistency.
PROTECTED
Enables reloads/force commits with upsert metadata reversion during segment replacements. Requires ParallelSegmentConsumptionPolicy set to DISALLOW_ALWAYS or ALLOW_DURING_BUILD_ONLY.
UNSAFE
Allows reloads without metadata reversion. Use only if inconsistency is acceptable or handled externally.