Explore the fundamental concepts of Apache Pinot™ as a distributed OLAP database.
Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-latency queries on large datasets through a set of common data model abstractions. Delivering on these goals requires several foundational architectural commitments, including:
Storing data in columnar form to support high-performance scanning
Sharding of data to scale both storage and computation
A distributed architecture designed to scale capacity linearly
A tabular data model read by SQL queries
To learn about Pinot components, terminology, and gain a conceptual understanding of how data is stored in Pinot, review the following sections:
The segment threshold determines when a segment is committed in real-time tables.
When data is first ingested from a streaming provider like Kafka, Pinot stores the data in a consuming segment.
This segment is on the disk of the server(s) processing a particular partition from the streaming provider.
However, it's not until a segment is committed that the segment is written to the deep store. The segment threshold decides when that should happen.
Why is the segment threshold important?
The segment threshold is important because it ensures segments are a reasonable size.
When queries are processed, smaller segments may increase query latency due to more overhead (number of threads spawned, meta data processing, and so on).
Larger segments may cause servers to run out of memory. When a server is restarted, the consuming segment must start consuming from the first row again, causing a lag between Pinot and the streaming provider.
Mark Needham explains the segment threshold
Frequently Asked Questions (FAQs)
This page lists pages with frequently asked questions with answers from the community.
This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, make a pull request.
Apache Pinot is a real-time distributed OLAP datastore purpose-built for low-latency, high-throughput analytics, and perfect for user-facing analytical workloads.
Apache Pinot™ is a real-time distributed online analytical processing (OLAP) datastore. Use Pinot to ingest and immediately query data from streaming or batch data sources (including, Apache Kafka, Amazon Kinesis, Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage).
Ultra low-latency analytics even at extremely high throughput.
Columnar data store with several smart indexing and pre-aggregation techniques.
Scaling up and out with no upper bound.
Consistent performance based on the size of your cluster and an expected query per second (QPS) threshold.
It's perfect for user-facing real-time analytics and other analytical use cases, including internal dashboards, anomaly detection, and ad hoc data exploration.
User-facing real-time analytics
User-facing analytics refers to the analytical tools exposed to the end users of your product. In a user-facing analytics application, all users receive personalized analytics on their devices, resulting in hundreds of thousands of queries per second. Queries triggered by apps may grow quickly in proportion to the number of active users on the app, as many as millions of events per second. Data generated in Pinot is immediately available for analytics in latencies under one second.
User-facing real-time analytics requires the following:
Fresh data. The system needs to be able to ingest data in real time and make it available for querying, also in real time.
Support for high-velocity, highly dimensional event data from a wide range of actions and from multiple sources.
Low latency. Queries are triggered by end users interacting with apps, resulting in hundreds of thousands of queries per second with arbitrary patterns.
Why Pinot?
Pinot is designed to execute OLAP queries with low latency. It works well where you need fast analytics, such as aggregations, on both mutable and immutable data.
User-facing, real-time analytics
Pinot was originally built at LinkedIn to power rich interactive real-time analytics applications, such as , , , and many more. is another example of a user-facing analytics app built with Pinot.
Real-time dashboards for business metrics
Pinot can perform typical analytical operations such as slice and dice, drill down, roll up, and pivot on large scale multi-dimensional data. For instance, at LinkedIn, Pinot powers dashboards for thousands of business metrics. Connect various business intelligence (BI) tools such as , , or to visualize data in Pinot.
Enterprise business intelligence
For analysts and data scientists, Pinot works well as a highly-scalable data platform for business intelligence. Pinot converges big data platforms with the traditional role of a data warehouse, making it a suitable replacement for analysis and reporting.
Enterprise application development
For application developers, Pinot works well as an aggregate store that sources events from streaming data sources, such as Kafka, and makes it available for a query using SQL. You can also use Pinot to aggregate data across a microservice architecture into one easily queryable view of the domain.
Pinot prevent any possibility of sharing ownership of database tables across microservice teams. Developers can create their own query models of data from multiple systems of record depending on their use case and needs. As with all aggregate stores, query models are eventually consistent.
Get started
If you're new to Pinot, take a look at our Getting Started guide:
To start importing data into Pinot, see how to import batch and stream data:
To start querying data in Pinot, check out our Query guide:
Learn
For a conceptual overview that explains how Pinot works, check out the Concepts guide:
To understand the distributed systems architecture that explains Pinot's operating model, take a look at our basic architecture section:
Pinot storage model
Apache Pinot™ uses a variety of terms which can refer to either abstractions that model the storage of data or infrastructure components that drive the functionality of the system, including:
Pinot has a distributed systems architecture that scales horizontally. Pinot expects the size of a table to grow infinitely over time. To achieve this, all data needs to be distributed across multiple nodes. Pinot achieves this by breaking data into smaller chunks known as (similar to shards/partitions in HA relational databases). Segments can also be seen as time-based partitions.
Table
Similar to traditional databases, Pinot has the concept of a —a logical abstraction to refer to a collection of related data. As is the case with relational database management systems (RDBMS), a table is a construct that consists of columns and rows (documents) that are queried using SQL. A table is associated with a , which defines the columns in a table as well as their data types.
As opposed to RDBMS schemas, multiple tables can be created in Pinot (real-time or batch) that inherit a single schema definition. Tables are independently configured for concerns such as indexing strategies, partitioning, tenants, data sources, and replication.
Pinot stores data in . A Pinot table is conceptually identical to a relational database table with rows and columns. Columns have the same name and data type, known as the table's .
Pinot schemas are defined in a JSON file. Because that schema definition is in its own file, multiple tables can share a single schema. Each table can have a unique name, indexing strategy, partitioning, data sources, and other metadata.
Pinot table types include:
real-time: Ingests data from a streaming source like Apache Kafka®
offline: Loads data from a batch source
hybrid: Loads data from both a batch source and a streaming source
Segment
Pinot tables are stored in one or more independent shards called . A small table may be contained by a single segment, but Pinot lets tables grow to an unlimited number of segments. There are different processes for creating segments (see ). Segments have time-based partitions of table data, and are stored on Pinot that scale horizontally as needed for both storage and computation.
Tenant
To support multi-tenancy, Pinot has first class support for tenants. A table is associated with a . This allows all tables belonging to a particular logical namespace to be grouped under a single tenant name and isolated from other tenants. This isolation between tenants provides different namespaces for applications and teams to prevent sharing tables or schemas. Development teams building applications do not have to operate an independent deployment of Pinot. An organization can operate a single cluster and scale it out as new tenants increase the overall volume of queries. Developers can manage their own schemas and tables without being impacted by any other tenant on a cluster.
Every table is associated with a , or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., and ) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data from separate workloads from being stored or processed on the same physical hardware.
By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster.
Cluster
A Pinot is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see .
Physical architecture
A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop.
Controller: Maintains cluster metadata and manages cluster resources.
Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.
Broker: Accepts queries from client processes and forwards them to servers for processing.
The simplest possible Pinot cluster consists of four components: a server, a broker, a controller, and a Zookeeper node. In production environments, these components typically run on separate server instances, and scale out as needed for data volume, load, availability, and latency. Pinot clusters in production range from fewer than ten total instances to more than 1,000.
Pinot uses as a distributed metadata store and and for cluster management.
Helix is a cluster management solution created by the authors of Pinot. Helix maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. It constantly monitors the cluster to ensure that the right hardware resources are allocated to implement the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.
Controller
A is the core orchestrator that drives the consistency and routing in a Pinot cluster. Controllers are horizontally scaled as an independent component (container) and has visibility of the state of all other components in a cluster. The controller reacts and responds to state changes in the system and schedules the allocation of resources for tables, segments, or nodes. As mentioned earlier, Helix is embedded within the controller as an agent that is a participant responsible for observing and driving state changes that are subscribed to by other components.
The Pinot schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of and ). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.
The controller exposes a for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.
Server
host segments (shards) that are scheduled and allocated across multiple nodes and routed on an assignment to a tenant (there is a single tenant by default). Servers are independent containers that scale horizontally and are notified by Helix through state changes driven by the controller. A server can either be a real-time server or an offline server.
A real-time and offline server have very different resource usage requirements, where real-time servers are continually consuming new messages from external systems (such as Kafka topics) that are ingested and allocated on segments of a tenant. Because of this, resource isolation can be used to prioritize high-throughput real-time data streams that are ingested and then made available for query through a broker.
Broker
Pinot take query requests from client processes, scatter them to applicable servers, gather the results, and return them to the client. The controller shares cluster metadata with the brokers that allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.
A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.
Pinot minion
Pinot minion is an optional component that can be used to run background tasks such as "purge" for GDPR (General Data Protection Regulation). As Pinot is an immutable aggregate store, records containing sensitive private data need to be purged on a request-by-request basis. Minion provides a solution for this purpose that complies with GDPR while optimizing Pinot segments and building additional indices that guarantees performance in the presence of the possibility of data deletion. One can also write a custom task that runs on a periodic basis. While it's possible to perform these tasks on the Pinot servers directly, having a separate process (Minion) lessens the overall degradation of query latency as segments are impacted by mutable writes.
A Pinot is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minon tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.
Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function with or without minions, they are typically present to support routine tasks like batch data ingest.
\
Getting Started
This section contains quick start guides to help you get up and running with Pinot.
Running Pinot
To simplify the getting started experience, Apache Pinot™ ships with quick start guides that launch Pinot components in a single process and import pre-built datasets.
For a full list of these guides, see .
Running on public clouds
This page links to multiple quick start guides for deploying Pinot to different public cloud providers.
These quickstart guides show you how to run an Apache Pinot cluster using Kubernetes on different public cloud providers.
General
This page has a collection of frequently asked questions of a general nature with answers from the community.
This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, .
How does Apache Pinot use deep storage?
Segment retention
In this Apache Pinot concepts guide, we'll learn how segment retention works.
Segments in Pinot tables have a retention time, after which the segments are deleted. Typically, offline tables retain segments for a longer period of time than real-time tables.
The removal of segments is done by the retention manager. By default, the retention manager runs once every 6 hours.
The retention manager purges two types of segments:
Expired segments: Segments whose end time has exceeded the retention period.
0.9.1
Summary
This release fixes the major issue of and a major bug fixing of pinot admin exit code issue().
The release is based on the release 0.9.0 with the following cherry-picks:
0.9.3
Summary
This is a bug fixing release contains:
Update Log4j to 2.17.0 to address ()
0.12.1
Summary
This is a bug-fixing release contains:
use legacy case-when format ()
Replaced segments: Segments that have been replaced as part of the merge rollup task.
There are a couple of scenarios where segments in offline tables won't be purged:
If the segment doesn't have an end time. This would happen if the segment doesn't contain a time column.
If the segment's table has a segmentIngestionType of REFRESH.
If the retention period isn't specified, segments aren't purged from tables.
The retention manager initially moves these segments into a Deleted Segments area, from where they will eventually be permanently removed.
Getting data into Pinot is easy. Take a look at these two quick start guides which will help you get up and running with sample data for offline and real-time tables.
This page describes configuring the range index for Apache Pinot
Range indexing allows you to get better performance for queries that involve filtering over a range.
It would be useful for a query like the following:
SELECT COUNT(*)
FROM baseballStats
WHERE hits > 11
A range index is a variant of an inverted index, where instead of creating a mapping from values to columns, we create mapping of a range of values to columns. You can use the range index by setting the following config in the table configuration.
Range index is supported for dictionary encoded columns of any type as well as raw encoded columns of a numeric type. Note that the range index can also be used on a dictionary encoded time column using STRING type, since Pinot only supports datetime formats that are in lexicographical order.
A good thumb rule is to use a range index when you want to apply range predicates on metric columns that have a very large number of unique values. This is because using an inverted index for such columns will create a very large index that is inefficient in terms of storage and performance.
When data is pushed to Apache Pinot, Pinot makes a backup copy of the data and stores it on the configured deep-storage (S3/GCP/ADLS/NFS/etc). This copy is stored as tar.gz Pinot segments. Note, that Pinot servers keep a (untarred) copy of the segments on their local disk as well. This is done for performance reasons.
How does Pinot use Zookeeper?
Pinot uses Apache Helix for cluster management, which in turn is built on top of Zookeeper. Helix uses Zookeeper to store the cluster state, including Ideal State, External View, Participants, and so on. Pinot also uses Zookeeper to store information such as Table configurations, schemas, Segment Metadata, and so on.
Why am I getting "Could not find or load class" error when running Quickstart using 0.8.0 release?
Check the JDK version you are using. You may be getting this error if you are using an older version than the current Pinot binary release was built on. If so, you have two options: switch to the same JDK release as Pinot was built with or download the source code for the Pinot release and build it locally.
How to change TimeZone when running Pinot?
Pinot uses the local timezone by default. To change the timezone, set the pinot.timezone value in the .conf config file. It is set once for all Pinot components (Controller, Broker, Server, Minion). See the following sample configuration:
Server: Provides storage for segment files and compute for query processing.
(Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).
This page lists options for importing data into Apache Pinot™ with links to detailed instructions with examples.
There are multiple options for importing data into Apache Pinot™. The pages in this section provide step-by-step instructions for importing records into Pinot, supported by our plugin architecture. The intent is to get you up and running with imported data as quickly as possible.
Pinot supports multiple file input formats without needing to change anything other than the file name. Each example imports a ready-made dataset so you can see how things work without needing to find or create your own dataset.
Pinot Batch Ingestion
These guides show you how to import data from popular big data platforms.
Pinot Stream Ingestion
This guide shows you how to import data using stream ingestion from Apache Kafka topics.
This guide shows you how to import data using stream ingestion with upsert.
This guide shows you how to import data using stream ingestion with deduplication.
This guide shows you how to import data using stream ingestion with CLP.
Pinot file systems
By default, Pinot does not come with a storage layer, so all the data sent won't be stored in case of system crash. In order to persistently store the generated segments, you will need to change controller and server configs to add a deep storage. See for all the info and related configs.
These guides show you how to import data and persist it in these file systems.
Pinot input formats
This guide shows you how to import data from various Pinot-supported input formats.
This guide shows you how to handle the complex type in the ingested data, such as map and array.
This guide shows additional examples on how to work with complex types.
This guide shows you how to handle records with dynamic schemas, like JSON log events.
Reloading and uploading existing Pinot segments
This guide shows you how to reload Pinot segments from your deep store.
This guide shows you how to upload Pinot segments from an old, closed Pinot instance.
Components
Discover the core components of Apache Pinot, enabling efficient data processing and analytics. Unleash the power of Pinot's building blocks for high-performance data-driven applications.
Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-latency queries on large datasets through a set of common data model abstractions. Delivering on these goals requires several foundational architectural commitments, including:
Storing data in columnar form to support high-performance scanning
Sharding of data to scale both storage and computation
A distributed architecture designed to scale capacity linearly
A tabular data model read by SQL queries
Components
Learn about the major components and logical abstractions used in Pinot.
Operator reference
Developer reference
Time boundary
Learn about time boundaries in hybrid tables.
Learn about time boundaries in hybrid tables. Hybrid tables are when we have offline and real-time tables with the same name.
When querying these tables, the Pinot broker decides which records to read from the offline table and which to read from the real-time table. It does this using the time boundary.
How is the time boundary determined?
The time boundary is determined by looking at the maximum end time of the offline segments and the segment ingestion frequency specified for the offline table.
If it's set to hourly, then:
Otherwise:
It is possible to force the hybrid table to use max(all offline segments' end time) by calling the API (V 0.12.0+)
Note that this will not automatically update the time boundary as more segments are added to the offline table, and must be called each time a segment with more recent end time is uploaded to the offline table. You can revert back to using the derived time boundary by calling API:
Querying
When a Pinot broker receives a query for a hybrid table, the broker sends a time boundary annotated version of the query to the offline and real-time tables.
For example, if we executed the following query:
The broker would send the following query to the offline table:
And the following query to the real-time table:
The results of the two queries are merged by the broker before being returned to the client.
Ensure you have available Pinot Minion instances deployed within the cluster.
Pinot version is 0.11.0 or above
How it works
Parse the query with the table name and directory URI along with a list of options for the ingestion job.
Call controller minion task execution API endpoint to schedule the task on minion
Response has the schema of table name and task job id.
Usage Syntax
INSERT INTO [database.]table FROM FILE dataDirURI OPTION ( k=v ) [, OPTION (k=v)]*
Example
Insert Rows into Pinot
We are actively developing this feature...
The details will be revealed soon.
Server
Uncover the efficient data processing and storage capabilities of Apache Pinot's server component, optimizing performance for data-driven applications.
Pinot servers provide the primary storage for and perform the computation required to execute queries. A production Pinot cluster contains many servers. In general, the more servers, the more data the cluster can retain in tables, the lower latency the cluster can deliver on queries, and the more concurrent queries the cluster can process.
Servers are typically segregated into real-time and offline workloads, with "real-time" servers hosting only real-time tables, and "offline" servers hosting only offline tables. This is a ubiquitous operational convention, not a difference or an explicit configuration in the server process itself. There are two types of servers:
Offline
Broker
Discover how Apache Pinot's broker component optimizes query processing, data retrieval, and enhances data-driven applications.
Pinot brokers take query requests from client processes, scatter them to applicable servers, gather the results, and return results to the client. The controller shares cluster metadata with the brokers, which allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.
A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.
Pinot brokers are modeled as Helix spectators. They need to know the location of each segment of a table (and each replica of the segments) and route requests to the appropriate server that hosts the segments of the table being queried.
The broker ensures that all the rows of the table are queried exactly once so as to return correct, consistent results for a query. The brokers may optimize to prune some of the segments
Deep Store
Leverage Apache Pinot's deep store component for efficient large-scale data storage and management, enabling impactful data processing and analysis.
The deep store (or deep storage) is the permanent store for files.
It is used for backup and restore operations. New nodes in a cluster will pull down a copy of segment files from the deep store. If the local segment files on a server gets damaged in some way (or accidentally deleted), a new copy will be pulled down from the deep store on server restart.
The deep store stores a compressed version of the segment files and it typically won't include any indexes. These compressed files can be stored on a local file system or on a variety of other file systems. For more details on supported file systems, see .
Note: Deep store by itself is not sufficient for restore operations. Pinot stores metadata such as table config, schema, segment metadata in Zookeeper. For restore operations, both Deep Store as well as Zookeeper metadata are required.
Backfill Data
Batch ingestion of backfill data into Apache Pinot.
Introduction
Pinot batch ingestion involves two parts: routine ingestion job(hourly/daily) and backfill. Here are some examples to show how routine batch ingestion works in Pinot offline table:
Troubleshooting Pinot
Find debug information in Pinot
Pinot offers various ways to assist with troubleshooting and debugging problems that might happen.
Start with the which will surface many of the commonly occurring problems. The debug api provides information such as tableSize, ingestion status, and error messages related to state transition in server.
The table debug API can be invoked via the Swagger UI, as in the following image:
File Systems
This section contains a collection of short guides to show you how to import data from a Pinot-supported file system.
FileSystem is an abstraction provided by Pinot to access data stored in distributed file systems (DFS).
Pinot uses distributed file systems for the following purposes:
Batch ingestion job: To read the input data (CSV, Avro, Thrift, etc.) and to write generated segments to DFS.
Reload a table segment
Reload a table segment in Apache Pinot.
When Pinot writes data to segments in a table, it saves those segments to a deep store location specified in your , such as a storage drive or Amazon S3 bucket.
If a new column is added to your table or schema configuration during ingestion, incorrect data may appear in the consuming segment(s). To ensure accurate values are reloaded, see how to .
Native text index
This page talks about native text indices and corresponding search functionality in Apache Pinot.
Experimental
This index is experimental and should only be used for testing. It is not recommended for use in production.
Instead, use .
Release notes
The following summarizes Apache Pinot™ releases, from the latest one to the earliest one.
Note
Before upgrading from one version to another one, read the release notes. While the Pinot committers strive to keep releases backward-compatible and introduce new features in a compatible manner, your environment may have a unique combination of configurations/data/schema that may have been somehow overlooked. Before you roll out a new release of Pinot on your cluster, it is best that you run the that Pinot provides. The tests can be easily customized to suit the configurations and tables in your pinot cluster(s). As a good practice, you should build your own test suite, mirroring the table configurations, schema, sample data, and queries that are used in your cluster.
Organize raw data into buckets (eg: /var/pinot/airlineStats/rawdata/2014/01/01). Each bucket typically contains several files (eg: /var/pinot/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01_0.avro)
Run a Pinot batch ingestion job, which points to a specific date folder like ‘/var/pinot/airlineStats/rawdata/2014/01/01’. The segment generation job will convert each such avro file into a Pinot segment for that day and give it a unique name.
Run Pinot segment push job to upload those segments with those uniques names via a Controller API
IMPORTANT: The segment name is the unique identifier used to uniquely identify that segment in Pinot. If the controller gets an upload request for a segment with the same name - it will attempt to replace it with the new one.
This newly uploaded data can now be queried in Pinot. However, sometimes users will make changes to the raw data which need to be reflected in Pinot. This process is known as 'Backfill'.
How to backfill data in Pinot
Pinot supports data modification only at the segment level, which means you must update entire segments for doing backfills. The high level idea is to repeat steps 2 (segment generation) and 3 (segment upload) mentioned above:
Backfill jobs must run at the same granularity as the daily job. E.g., if you need to backfill data for 2014/01/01, specify that input folder for your backfill job (e.g.: ‘/var/pinot/airlineStats/rawdata/2014/01/01’)
The backfill job will then generate segments with the same name as the original job (with the new data).
When uploading those segments to Pinot, the controller will replace the old segments with the new ones (segment names act like primary keys within Pinot) one by one.
Edge case example
Backfill jobs expect the same number of (or more) data files on the backfill date. So the segment generation job will create the same number of (or more) segments than the original run.
For example, assuming table airlineStats has 2 segments(airlineStats_2014-01-01_2014-01-01_0, airlineStats_2014-01-01_2014-01-01_1) on date 2014/01/01 and the backfill input directory contains only 1 input file. Then the segment generation job will create just one segment: airlineStats_2014-01-01_2014-01-01_0. After the segment push job, only segment airlineStats_2014-01-01_2014-01-01_0 got replaced and stale data in segment airlineStats_2014-01-01_2014-01-01_1 are still there.
If the raw data is modified in such a way that the original time bucket has fewer input files than the first ingestion run, backfill will fail.
Pinot supports text indexing and search by building Lucene indices as sidecars to the main Pinot segments. While this is a great technique, it essentially limits the avenues of optimizations that can be done for Pinot specific use cases of text search.
How is Pinot different?
Pinot, like any other database/OLAP engine, does not need to conform to the entire full text search domain-specific language (DSL) that is traditionally used by full-text search (FTS) engines like ElasticSearch and Solr. In traditional SQL text search use cases, the majority of text searches belong to one of three patterns: prefix wildcard queries (like pino*), postfix or suffix wildcard queries (like *inot), and term queries (like pinot).
Native text indices in Pinot
In Pinot, native text indices are built from the ground up. They use a custom text-indexing engine, coupled with Pinot's powerful inverted indices, to provide a fast text search experience.
The benefits are that native text indices are 80-120% faster than Lucene-based indices for the text search use cases mentioned above. They are also 40% smaller on disk.
Native text indices support real-time text search. For REALTIME tables, native text indices allow data to be indexed in memory in the text index, while concurrently supporting text searches on the same index.
Historically, most text indices depend on the in-memory text index being written to first and then sealed, before searches are possible. This limits the freshness of the search, being near-real-time at best.
Native text indices come with a custom in-memory text index, which allows for real-time indexing and search.
Searching Native Text Indices
The function, TEXT\_CONTAINS, supports text search on native text indices.
Examples:
TEXT\_CONTAINS can be combined using standard boolean operators
Note:TEXT\_CONTAINS supports regex and term queries and will work only on native indices. TEXT\_CONTAINS supports standard regex patterns (as used by LIKE in SQL Standard), so there might be some syntatical differences from Lucene queries.
Creating Native Text Indices
Native text indices are created using field configurations. To indicate that an index type is native, specify it using properties in the field configuration:
SELECT count(*)
FROM events_OFFLINE
WHERE timeColumn <= $timeBoundary
SELECT count(*)
FROM events_REALTIME
WHERE timeColumn > $timeBoundary
SET taskName = 'myTask-s3';
SET input.fs.className = 'org.apache.pinot.plugin.filesystem.S3PinotFS';
SET input.fs.prop.accessKey = 'my-key';
SET input.fs.prop.secretKey = 'my-secret';
SET input.fs.prop.region = 'us-west-2';
INSERT INTO "baseballStats"
FROM FILE 's3://my-bucket/public_data_set/baseballStats/rawdata/'
POST /segments/{tableName}/reload
POST /segments/{tableName}/{segmentName}/reload
{
"status": "200"
}
SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS (<column_name>, <search_expression>)
SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS (<column_name>, "foo.*")
SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS (<column_name>, ".*bar")
SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS (<column_name>, "foo")
SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS ("col1", "foo") AND TEXT_CONTAINS ("col2", "bar")
Offline servers are responsible for downloading segments from the segment store, to host and serve queries off. When a new segment is uploaded to the controller, the controller decides the servers (as many as replication) that will host the new segment and notifies them to download the segment from the segment store. On receiving this notification, the servers download the segment file and load the segment onto the server, to server queries off them.
Real-time
Real-time servers directly ingest from a real-time stream (such as Kafka or EventHubs). Periodically, they make segments of the in-memory ingested data, based on certain thresholds. This segment is then persisted onto the segment store.
Pinot servers are modeled as Helix participants, hosting Pinot tables (referred to as resources in Helix terminology). Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more Helix partitions of one or more helix resources (i.e. one or more segments of one or more tables).
Helix provides the framework by which spectators can learn the location in which each partition of a resource (i.e. participant) resides. The brokers use this mechanism to learn the servers that host specific segments of a table.
In the case of hybrid tables, the brokers ensure that the overlap between real-time and offline segment data is queried exactly once, by performing offline and real-time federation.
Let's take this example, we have real-time data for five days - March 23 to March 27, and offline data has been pushed until Mar 25, which is two days behind real-time. The brokers maintain this time boundary.
Suppose, we get a query to this table : select sum(metric) from table. The broker will split the query into 2 queries based on this time boundary – one for offline and one for real-time. This query becomes select sum(metric) from table_REALTIME where date >= Mar 25
and select sum(metric) from table_OFFLINE where date < Mar 25
The broker merges results from both these queries before returning the result to the client.
There are several different ways that segments are persisted in the deep store.
For offline tables, the batch ingestion job writes the segment directly into the deep store, as shown in the diagram below:
Batch job writing a segment into the deep store
The ingestion job then sends a notification about the new segment to the controller, which in turn notifies the appropriate server to pull down that segment.
For real-time tables, by default, a segment is first built-in memory by the server. It is then uploaded to the lead controller (as part of the Segment Completion Protocol sequence), which writes the segment into the deep store, as shown in the diagram below:
Server sends segment to Controller, which writes segments into the deep store
Having all segments go through the controller can become a system bottleneck under heavy load, in which case you can use the peer download policy, as described in Decoupling Controller from the Data Path.
When using this configuration, the server will directly write a completed segment to the deep store, as shown in the diagram below:
Server writing a segment into the deep store
Configuring the deep store
For hands-on examples of how to configure the deep store, see the following tutorials:
It can also be invoked directly by accessing the URL as follows. The api requires the tableName, and can optionally take tableType (offline|realtime) and verbosity level.
Pinot also provides a variety of operational metrics that can be used for creating dashboards, alerting and monitoring.
Finally, all pinot components log debug information related to error conditions.
Debug a slow query or a query which keeps timing out
Use the following steps:
If the query executes, look at the query result. Specifically look at numEntriesScannedInFilter and numDocsScanned.
If numEntriesScannedInFilter is very high, consider adding indexes for the corresponding columns being used in the filter predicates. You should also think about partitioning the incoming data based on the dimension most heavily used in your filter queries.
If numDocsScanned is very high, that means the selectivity for the query is low and lots of documents need to be processed after the filtering. Consider refining the filter to increase the selectivity of the query.
If the query is not executing, you can extend the query timeout by appending a timeoutMs parameter to the query, for example, select * from mytable limit 10 option(timeoutMs=60000). Then repeat step 1, as needed.
Look at garbage collection (GC) stats for the corresponding Pinot servers. If a particular server seems to be running full GC all the time, you can do a couple of things such as
To use a distributed file system, you need to enable plugins. To do that, specify the plugin directory and include the required plugins:
You can change the file system in the controller and server configuration. In the following configuration example, the URI is s3://bucket/path/to/file and scheme refers to the file system URI prefix s3.
You can also change the file system during ingestion. In the ingestion job spec, specify the file system with the following configuration:
Learn to build and manage Apache Pinot clusters, uncovering key components for efficient data processing and optimized analysis.
A Pinot cluster is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see Physical architecture.
A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop:
Controller: Maintains cluster metadata and manages cluster resources.
Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.
Broker: Accepts queries from client processes and forwards them to servers for processing.
Server: Provides storage for segment files and compute for query processing.
(Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).
The simplest possible Pinot cluster consists of four components: a server, a broker, a controller, and a Zookeeper node. In production environments, these components typically run on separate server instances, and scale out as needed for data volume, load, availability, and latency. Pinot clusters in production range from fewer than ten total instances to more than 1,000.
Pinot uses as a distributed metadata store and for cluster management.
Helix is a cluster management solution that maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. Helix constantly monitors the cluster to ensure that the right hardware resources are allocated for the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.
Cluster configuration
For details of cluster configuration settings, see .
Cluster components
Helix divides nodes into logical components based on their responsibilities:
Participant
Participants are the nodes that host distributed, partitioned resources
Pinot servers are modeled as participants. For details about server nodes, see .
Spectator
Spectators are the nodes that observe the current state of each participant and use that information to access the resources. Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).
Pinot brokers are modeled as spectators. For details about broker nodes, see .
Controller
The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability.
Pinot controllers are modeled as controllers. For details about controller nodes, see .
Logical view
Another way to visualize the cluster is a logical view, where:
A cluster contains
Tenants contain
Tables contain
Set up a Pinot cluster
Typically, there is only one cluster per environment/data center. There is no need to create multiple Pinot clusters because Pinot supports .
To set up a cluster, see one of the following guides:
Running on Azure
This quickstart guide helps you get started running Pinot on Microsoft Azure.
Quickstart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12
1.2 Install Helm
To install Helm, see .
For Mac users
Check helm version after installation.
This quickstart provides helm supports for helm v3.0.0 and v2.12.1. Pick the script based on your helm version.
1.3 Install Azure CLI
Follow this link () to install Azure CLI.
For Mac users
2. (Optional) Log in to your Azure account
This script will open your default browser to sign-in to your Azure Account.
3. (Optional) Create a Resource Group
Use the following script create a resource group in location eastus.
4. (Optional) Create a Kubernetes cluster(AKS) in Azure
This script will create a 3 node cluster named pinot-quickstart for demo purposes.
Modify the parameters in the following example command with your resource group and cluster details:
Once the command succeeds, the cluster is ready to be used.
5. Connect to an existing cluster
Run the following command to get the credential for the cluster pinot-quickstart that you just created:
To verify the connection, run the following:
6. Pinot quickstart
Follow this to deploy your Pinot demo.
7. Delete a Kubernetes Cluster
Inverted index
This page describes configuring the inverted index for Apache Pinot
We can define the forward index as a mapping from document IDs (also known as rows) to values. Similarly, an inverted index establishes a mapping from values to a set of document IDs, making it the "inverted" version of the forward index. When you frequently use a column for filtering operations like EQ (equal), IN (membership check), GT (greater than), etc., incorporating an inverted index can significantly enhance query performance.
Pinot supports two distinct types of inverted indexes: bitmap inverted indexes and sorted inverted indexes. Bitmap inverted indexes represent the actual inverted index type, whereas the sorted type is automatically available when the column is sorted. Both types of indexes necessitate the enabling of a dictionary for the respective column.
Bitmap inverted index
When a column is not sorted, and an inverted index is enabled for that column, Pinot maintains a mapping from each value to a bitmap of rows. This design ensures that value lookup operations take constant time, providing efficient querying capabilities.
When an inverted index is enabled for a column, Pinot maintains a map from each value to a bitmap of rows, which makes value lookup take constant time. If you have a column that is frequently used for filtering, adding an inverted index will improve performance greatly. You can create an inverted index on a multi-value column.
Inverted indexes are disabled by default and can be enabled for a column by specifying the configuration within the :
The older way to configure inverted indexes can also be used, although it is not actually recommended:
When the index is created
By default, bitmap inverted indexes are not generated when the segment is initially created; instead, they are created when the segment is loaded by Pinot. This behavior is governed by the table configuration option indexingConfig.createInvertedIndexDuringSegmentGeneration, which is set to false by default.
Sorted inverted index
As explained in the section, a column that is both sorted and equipped with a dictionary is encoded in a specialized manner that serves the purpose of implementing both forward and inverted indexes. Consequently, when these conditions are met, an inverted index is effectively created without additional configuration, even if the configuration suggests otherwise. This sorted version of the forward index offers a lookup time complexity of log(n) and leverages data locality.
For instance, consider the following example: if a query includes a filter on the memberId column, Pinot will perform a binary search on memberId values to find the range pair of docIds for corresponding filtering value. If the query needs to scan values for other columns after filtering, values within the range docId pair will be located together, which means we can benefit from data locality.
A sorted inverted index indeed offers superior performance compared to a bitmap inverted index, but it's important to note that it can only be applied to sorted columns. In cases where query performance with a regular inverted index is unsatisfactory, especially when a large portion of queries involve filtering on the same column (e.g., _memberId_), using a sorted index can substantially enhance query performance.
Pinot On Kubernetes FAQ
This page has a collection of frequently asked questions about Pinot on Kubernetes with answers from the community.
This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, make a pull request.
How to increase server disk size on AWS
The following is an example using Amazon Elastic Kubernetes Service (Amazon EKS).
1. Update Storage Class
In the Kubernetes (k8s) cluster, check the storage class: in Amazon EKS, it should be gp2.
Then update StorageClass to ensure:
Once StorageClass is updated, it should look like this:
2. Update PVC
Once the storage class is updated, then we can update the PersistentVolumeClaim (PVC) for the server disk size.
Now we want to double the disk size for pinot-server-3.
The following is an example of current disks:
The following is the output of data-pinot-server-3:
Now, let's change the PVC size to 2T by editing the server PVC.
Once updated, the specification's PVC size is updated to 2T, but the status's PVC size is still 1T.
3. Restart pod to let it reflect
Restart the pinot-server-3 pod:
Recheck the PVC size:
FST index
The FST index supports regex queries on text. Decreases on-disk index by 4-6 times.
Only supports regex queries
Only supported on stored or completed Pinot segments (no consuming segments).
Only supported on dictionary-encoded columns.
Works better for prefix queries
Note: Lucene is case sensitive as such when using FST index based column(s) in query, user needs to ensure this is taken into account. For e.g Select * from table T where colA LIKE %Value% which has a FST index on colA will only return rows containing string "Value" but not "value".
For more information on the FST construction and code, see .
Enable the FST index
To enable the FST index on a dictionary-encoded column, include the following configuration:
The FST index generates one FST index file (.lucene.fst). If the inverted index is enabled, this is further able to take advantage of that.
For more information about enabling the FST index, see ways to .
Fix the bug that RealtimeToOfflineTask failed to progress with large time bucket gaps ().
The release is based on the release 0.9.1 with the following cherry-picks:
Tenant
Discover the tenant component of Apache Pinot, which facilitates efficient data isolation and resource management within Pinot clusters.
Every table is associated with a tenant, or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., and ) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data in separate workloads from being stored or processed on the same physical hardware.
By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster.
To support multi-tenancy, Pinot has first-class support for tenants. Every table is associated with a server tenant and a broker tenant, which controls the nodes used by the table as servers and brokers. Multi-tenancy lets Pinot group all tables belonging to a particular use case under a single tenant name.
The concept of tenants is very important when the multiple use cases are using Pinot and there is a need to provide quotas or some sort of isolation across tenants. For example, consider we have two tables
Running on AWS
This quickstart guide helps you get started running Pinot on Amazon Web Services (AWS).
In this quickstart guide, you will set up a Kubernetes Cluster on
1. Tooling Installation
HDFS as Deep Storage
This guide shows how to set up HDFS as deep storage for a Pinot segment.
To use HDFS as deep storage you need to include HDFS dependency jars and plugins.
Usage: StartServer
-serverHost <String> : Host name for controller. (required=false)
-serverPort <int> : Port number to start the server at. (required=false)
-serverAdminPort <int> : Port number to serve the server admin API at. (required=false)
-dataDir <string> : Path to directory containing data. (required=false)
-segmentDir <string> : Path to directory containing segments. (required=false)
-zkAddress <http> : Http address of Zookeeper. (required=false)
-clusterName <String> : Pinot cluster name. (required=false)
-configFileName <Config File Name> : Broker Starter Config file. (required=false)
-help : Print this message. (required=false)
#CONTROLLER
pinot.controller.storage.factory.class.[scheme]=className of the pinot file system
pinot.controller.segment.fetcher.protocols=file,http,[scheme]
pinot.controller.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
#SERVER
pinot.server.storage.factory.class.[scheme]=className of the Pinot file system
pinot.server.segment.fetcher.protocols=file,http,[scheme]
pinot.server.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
And then export /opt/pinot/lib/hadoop-common-<release-version>.jar in the classpath.
brew install kubernetes-cli
kubectl version
brew install kubernetes-helm
helm version
brew update && brew install azure-cli
az login
AKS_RESOURCE_GROUP=pinot-demo
AKS_RESOURCE_GROUP_LOCATION=eastus
az group create --name ${AKS_RESOURCE_GROUP} \
--location ${AKS_RESOURCE_GROUP_LOCATION}
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks create --resource-group ${AKS_RESOURCE_GROUP} \
--name ${AKS_CLUSTER_NAME} \
--node-count 3
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks get-credentials --resource-group ${AKS_RESOURCE_GROUP} \
--name ${AKS_CLUSTER_NAME}
kubectl get nodes
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks delete --resource-group ${AKS_RESOURCE_GROUP} \
--name ${AKS_CLUSTER_NAME}
Decrease the total number of segments per server (by partitioning the data in a more efficient way).
Table A
and
Table B
in the same Pinot cluster.
Defining tenants for tables
We can configure Table A with server tenant Tenant A and Table B with server tenant Tenant B. We can tag some of the server nodes for Tenant A and some for Tenant B. This will ensure that segments of Table A only reside on servers tagged with Tenant A, and segment of Table B only reside on servers tagged with Tenant B. The same isolation can be achieved at the broker level, by configuring broker tenants to the tables.
Table isolation using tenants
No need to create separate clusters for every table or use case!
Tenant configuration
This tenant is defined in the tenants section of the table config.
This section contains two main fields broker and server , which decide the tenants used for the broker and server components of this table.
In the above example:
The table will be served by brokers that have been tagged as brokerTenantName_BROKER in Helix.
If this were an offline table, the offline segments for the table will be hosted in Pinot servers tagged in Helix as serverTenantName_OFFLINE
If this were a real-time table, the real-time segments (both consuming as well as completed ones) will be hosted in pinot servers tagged in Helix as serverTenantName_REALTIME.
Create a tenant
Broker tenant
Here's a sample broker tenant config. This will create a broker tenant sampleBrokerTenant by tagging three untagged broker nodes as sampleBrokerTenant_BROKER.
To create this tenant use the following command. The creation will fail if number of untagged broker nodes is less than numberOfInstances.
Follow instructions in Getting Pinot to get Pinot locally, and then
Check out the table config in the Rest API to make sure it was successfully uploaded.
Server tenant
Here's a sample server tenant config. This will create a server tenant sampleServerTenant by tagging 1 untagged server node as sampleServerTenant_OFFLINE and 1 untagged server node as sampleServerTenant_REALTIME.
To create this tenant use the following command. The creation will fail if number of untagged server nodes is less than offlineInstances + realtimeInstances.
Follow instructions in Getting Pinot to get Pinot locally, and then
Check out the table config in the Rest API to make sure it was successfully uploaded.
Discover the controller component of Apache Pinot, enabling efficient data and query management.
The Pinot controller schedules and reschedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, the Pinot controller schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (for example, ingest of real-time tables and offline tables). The Pinot controller can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.
The controller exposes a REST API endpoint for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.
The Pinot controller is responsible for the following:
Maintaining global metadata (e.g., configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store.
Hosting the Helix Controller and managing other Pinot components (brokers, servers, minions)
Maintaining the mapping of which servers are responsible for which segments. This mapping is used by the servers to download the portion of the segments that they are responsible for. This mapping is also used by the broker to decide which servers to route the queries to.
Serving admin endpoints for viewing, creating, updating, and deleting configs, which are used to manage and operate the cluster.
Serving endpoints for segment uploads, which are used in offline data pushes. They are responsible for initializing real-time consumption and coordination of persisting real-time segments into the segment store periodically.
Undertaking other management activities such as managing retention of segments, validations.
For redundancy, there can be multiple instances of Pinot controllers. Pinot expects that all controllers are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or .
Running the periodic task manually
The controller runs several periodic tasks in the background, to perform activities such as management and validation. Each periodic task has to define the run frequency and default frequency. Each task runs at its own schedule or can also be triggered manually if needed. The task runs on the lead controller for each table.
For period task configuration details, see .
Use the GET /periodictask/names API to fetch the names of all the periodic tasks running on your Pinot cluster.
To manually run a named periodic task, use the GET /periodictask/run API:
The Log Request Id (api-09630c07) can be used to search through pinot-controller log file to see log entries related to execution of the Periodic task that was manually run.
If tableName (and its type OFFLINE or REALTIME) is not provided, the task will run against all tables.
Starting a controller
Make sure you've . If you're using Docker, make sure to . To start a controller:
Schema
Explore the Schema component in Apache Pinot, vital for defining the structure and data types of Pinot tables, enabling efficient data processing and analysis.
Each table in Pinot is associated with a schema. A schema defines:
Fields in the table with their data types.
Whether the table uses column-based or table-based null handling. For more information, see Null value support.
The schema is stored in Zookeeper along with the table configuration.
Schema naming in Pinot follows typical database table naming conventions, such as starting names with a letter, not ending with an underscore, and using only alphanumeric characters
Categories
A schema also defines what category a column belongs to. Columns in a Pinot table can be categorized into three categories:
Category
Description
Pinot does not enforce strict rules on which of these categories columns belong to, rather the categories can be thought of as hints to Pinot to do internal optimizations.
For example, metrics may be stored without a dictionary and can have a different default null value.
The categories are also relevant when doing segment merge and rollups. Pinot uses the dimension and time fields to identify records against which to apply merge/rollups.
Metrics aggregation is another example where Pinot uses dimensions and time are used as the key, and automatically aggregates values for the metric columns.
For configuration details, see .
Date and time fields
Since Pinot doesn't have a dedicated DATETIME datatype support, you need to input time in either STRING, LONG, or INT format. However, Pinot needs to convert the date into an understandable format such as epoch timestamp to do operations. You can refer to for more details on supported formats.
Creating a schema
First, Make sure your and running.
Let's create a schema and put it in a JSON file. For this example, we have created a schema for flight data.
For more details on constructing a schema file, see the .
Then, we can upload the sample schema provided above using either a Bash command or REST API call.
Check out the schema in the to make sure it was successfully uploaded
Pinot Data Explorer
Pinot Data Explorer is a user-friendly interface in Apache Pinot for interactive data exploration, querying, and visualization.
Once you have set up a cluster, you can start exploring the data and the APIs using the Pinot Data Explorer.
The first screen that you'll see when you open the Pinot Data Explorer is the Cluster Manager. The Cluster Manager provides a UI to operate and manage your cluster.
If you want to view the contents of a server, click on its instance name. You'll then see the following:
To view the baseballStats table, click on its name, which will show the following screen:
From this screen, we can edit or delete the table, edit or adjust its schema, as well as several other operations.
For example, if we want to add yearID to the list of inverted indexes, click on Edit Table, add the extra column, and click Save:
Query Console
Let's run some queries on the data in the Pinot cluster. Navigate to to see the querying interface.
We can see our baseballStats table listed on the left (you will see meetupRSVP or airlineStats if you used the streaming or the hybrid ). Click on the table name to display all the names along with the data types of the columns of the table.
You can also execute a sample query select * from baseballStats limit 10 by typing it in the text box and clicking the Run Query button.
Cmd + Enter can also be used to run the query when focused on the console.
Here are some sample queries you can try:
Pinot supports a subset of standard SQL. For more information, see .
Rest API
The contains all the APIs that you will need to operate and manage your cluster. It provides a set of APIs for Pinot cluster management including health check, instances management, schema and table management, data segments management.
Let's check out the tables in this cluster by going to , click Try it out, and then click Execute. We can see thebaseballStats table listed here. We can also see the exact cURL call made to the controller API.
You can look at the configuration of this table by going to , click Try it out, type baseballStats in the table name, and then click Execute.
Let's check out the schemas in the cluster by going to , click Try it out, and then click Execute. We can see a schema called baseballStats in this list.
Take a look at the schema by going to , click Try it out, type baseballStats in the schema name, and then click Execute.
Finally, let's check out the data segments in the cluster by going to , click Try it out, type in baseballStats in the table name, and then click Execute. There's 1 segment for this table, called baseballStats_OFFLINE_0.
To learn how to upload your own data and schema, see or .
Google Cloud Storage
This guide shows you how to import data from GCP (Google Cloud Platform).
Enable the Google Cloud Storage using the pinot-gcs plugin. In the controller or server, add the config:
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...
GCP file systems provides the following options:
projectId - The name of the Google Cloud Platform project under which you have created your storage bucket.
gcpKey - Location of the json file containing GCP keys. You can refer to download the keys.
Each of these properties should be prefixed by pinot.[node].storage.factory.class.gs. where node is either controller or server depending on the configuration, like this:
Examples
Job spec
Controller config
Server config
Minion config
Flink
Batch ingestion of data into Apache Pinot using Apache Flink.
Pinot supports Apache Flink as a processing framework to push segment files to the database.
Pinot distribution contains an Apache Flink SinkFunction that can be used as part of the Apache Flink application (Streaming or Batch) to directly write into a designated Pinot database.
Example
Flink application
Here is an example code snippet to show how to utilize the in a Flink streaming application:
As in the example shown above, the only required information from the Pinot side is the table and the table .
For a more detailed executable, refer to the .
Table Config
PinotSinkFunction uses mostly the TableConfig object to infer the batch ingestion configuration to start a SegmentWriter and SegmentUploader to communicate with the Pinot cluster.
Note that even though in the above example Flink application is running in streaming mode, the data is still batch together and flush/upload to Pinot once the flush threshold is reached. It is not a direct streaming write into Pinot.
Here is an example table config
the only required configurations are:
"outputDirURI": where PinotSinkFunction should write the constructed segment file to
"push.controllerUri": which Pinot cluster (controller) URL PinotSinkFunction should communicate with.
The rest of the configurations are standard for any Pinot table.
HDFS
This guide shows you how to import data from HDFS.
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...
HDFS implementation provides the following options:
hadoop.conf.path: Absolute path of the directory containing Hadoop XML configuration files, such as hdfs-site.xml, core-site.xml .
hadoop.write.checksum: Create checksum while pushing an object. Default is false
Each of these properties should be prefixed by pinot.[node].storage.factory.class.hdfs. where node is either controller or server depending on the config
The kerberos configs should be used only if your Hadoop installation is secured with Kerberos. Refer to the for information on how to secure Hadoop using Kerberos.
You must provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.
Push HDFS segment to Pinot Controller
To push HDFS segment files to Pinot controller, send the HDFS path of your newly created segment files to the Pinot Controller. The controller will download the files.
This curl example requests tells the controller to download segment files to the proper table:
Examples
Job spec
Standalone Job:
Hadoop Job:
Controller config
Server config
Minion config
Upload a table segment
Upload a table segment in Apache Pinot.
This procedure uploads one or more table segments that have been stored as Pinot segment binary files outside of Apache Pinot, such as if you had to close an original Pinot cluster and create a new one.
Choose one of the following:
If your data is in a location that uses HDFS, create a segment fetcher.
If your data is on a host where you have SSH access, use the Pinot Admin script.
Before you upload, do the following:
or confirm one exists that matches the segment you want to upload.
or confirm one exists that matches the segment you want to upload.
(If needed) Upload the schema and table configs.
Create a segment fetcher
If the data is in a location using HDFS, you can create a , which will push segment files from external systems such as those running Hadoop or Spark. It is possible to with an external jar by implementing a class that extends this interface.
Use the Pinot Admin script to upload segments
To do this, you need to create a JobSpec configuration file. For details, see . This file defines the job, including things like the job type, the input directory or URI, and the table name that the segments will be connected to.
You can upload a Pinot segment using several methods:
Segment tar push
Segment URI push
Segment metadata push
Segment tar push
This is the original and default push mechanism. It requires the segment to be stored locally, or that the segment can be opened as an InputStream on PinotFS, so we can stream the entire segment tar file to the controller.
The push job will upload the entire segment tar file to the Pinot controller.
The Pinot controller will save the segment into the controller segment directory (Local or any PinotFS), then extract segment metadata, and add the segment to the table.
While you can create a JobSpec for this job, in simple instances you can push without one.
Upload segment files to your Pinot server from controller using the Pinot Admin script as follows:
All options should be prefixed with - (hyphen)
Option
Description
Segment URI push
This push mechanism requires the segment tar file stored on a deep store with a globally accessible segment tar URI.
URI push is lightweight on the client-side, and the controller side requires equivalent work as the tar push.
The push job posts this segment tar URI to the Pinot controller.
The Pinot controller saves the segment into the controller segment directory (local or any PinotFS), then extracts segment metadata, and adds the segment to the table.
Upload segment files to your Pinot server using the JobSpec you create and the Pinot Admin script as follows:
Segment metadata push
This push mechanism also requires the segment tar file stored on a deep store with a globally accessible segment tar URI.
Metadata push is lightweight on the controller side. There is no deep store download involved from the controller side.
The push job downloads the segment based on URI, then extracts metadata, and upload metadata to the Pinot controller.
The Pinot controller adds the segment to the table based on the metadata.
Upload segment metadata to your Pinot server using the JobSpec you create and the Pinot Admin script as follows:
Azure Data Lake Storage
This guide shows you how to import data from files stored in Azure Data Lake Storage Gen2 (ADLS Gen2)
Enable the Azure Data Lake Storage using the pinot-adls plugin. In the controller or server, add the config:
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...
Azure Blob Storage provides the following options:
accountName: Name of the Azure account under which the storage is created.
accessKey: Access key required for the authentication.
fileSystemName
Each of these properties should be prefixed by pinot.[node].storage.factory.class.adl2. where node is either controller or server depending on the config, like this:
Examples
Job spec
Controller config
Server config
Minion config
Timestamp index
Use a timestamp index to speed up your time query with different granularities
This feature is supported from Pinot 0.11+.
Background
The TIMESTAMP data type introduced in the stores value as millisecond epoch long value.
Typically, users won't need this low level granularity for analytics queries. Scanning the data and time value conversion can be costly for big data.
A common query pattern for timestamp columns is filtering on a time range and then grouping by using different time granularities(days/month/etc).
Typically, this requires the query executor to extract values, apply the transform functions then do filter/groupBy, with no leverage on the dictionary or index.
This was the inspiration for the Pinot timestamp index, which is used to improve the query performance for range query and group by queries on TIMESTAMP columns.
Supported data type
A TIMESTAMP index can only be created on the TIMESTAMP data type.
Timestamp Index
You can configure the granularity for a Timestamp data type column. Then:
Pinot will pre-generate one column per time granularity using a forward index and range index. The naming convention is $${ts_column_name}$${ts_granularity}, where the timestamp column ts with granularities DAY, MONTH will have two extra columns generated: $ts$DAY and $ts$MONTH.
Example query usage:
Some preliminary benchmarking shows the query performance across 2.7 billion records improved from 45 secs to 4.2 secs using a timestamp index and a query like this:
vs.
Usage
The timestamp index is configured on a per column basis inside the fieldConfigList section in the table configuration.
Specify the timestampConfig field. This object must contain a field called granularities, which is an array with at least one of the following values:
MILLISECOND
SECOND
MINUTE
Sample config:
Ingest streaming data from Amazon Kinesis
This guide shows you how to ingest a stream of records from an Amazon Kinesis topic into a Pinot table.
To ingest events from an Amazon Kinesis stream into Pinot, set the following configs into your table config:
where the Kinesis specific properties are:
Property
Description
Hadoop
Batch ingestion of data into Apache Pinot using Apache Hadoop.
Segment Creation and Push
Pinot supports as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.
You can follow the to build Pinot from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
Configure indexes
Learn how to apply indexes to a Pinot table. This guide assumes that you have followed the guide.
Pinot supports a series of different indexes that can be used to optimize query performance. In this guide, we'll learn how to add indexes to the events table that we set up in the guide.
Why do we need indexes?
If no indexes are applied to the columns in a Pinot segment, the query engine needs to scan through every document, checking whether that document meets the filter criteria provided in a query. This can be a slow process if there are a lot of documents to scan.
Stream ingestion with Dedup
Deduplication support in Apache Pinot.
Pinot provides native support for deduplication (dedup) during the real-time ingestion (v0.11.0+).
Prerequisites for enabling dedup
To enable dedup on a Pinot table, make the following table configuration and schema changes:
Running on GCP
This quickstart guide helps you get started running Pinot on Google Cloud Platform (GCP).
In this quickstart guide, you will set up a Kubernetes Cluster on
1. Tooling Installation
Query FAQ
This page has a collection of frequently asked questions about queries with answers from the community.
This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, .
Querying
Segment compaction on upserts
Use segment compaction on upsert-enabled real-time tables.
Overview of segment compaction
Compacting a segment replaces the completed segment with a compacted segment that only contains the latest version of records. For more information about how to use upserts on a real-time table in Pinot, see .
The Pinot upsert feature stores all versions of the record ingested into immutable segments on disk. Even though the previous versions are not queried, they continue to add to the storage overhead. To remove older records (no longer used in query results) and reclaim storage space, we need to compact Pinot segments periodically. Segment compaction is done via a new minion task. To schedule Pinot tasks periodically, see the .
Bloom filter
This page describes configuring the Bloom filter for Apache Pinot
When a column is configured to use this filter, Pinot creates one Bloom filter per segment. The Bloom filter help to prune segments that do not contain any record matching an EQUALITY or IN predicate.
Note: Support for IN clause is limited to <= 10 values in the predicate, this is to ensure pruning overhead is minimal.
This is useful for query patterns like below where Bloom Filter is defined on playerID column in the table:
Vector index
Overview
Apache Pinot now supports a Vector Index for efficient similarity searches over high-dimensional vector embeddings. This feature introduces the capability to store and query float array columns (multi-valued) using a vector similarity algorithm.
Dimension table
Batch ingestion of data into Apache Pinot using dimension tables.
Dimension tables are a special kind of offline tables from which data can be looked up via the , providing join-like functionality.
Dimension tables are replicated on all the hosts for a given tenant to allow faster lookups. When a table is marked as a dimension table, it will be replicated on all the hosts, which means that these tables must be small in size.
A dimension table cannot be part of a .
Configure dimension tables using following properties in the table configuration:
Dimension columns are typically used in slice and dice operations for answering business queries. Some operations for which dimension columns are used:
GROUP BY - group by one or more dimension columns along with aggregations on one or more metric columns
Filter clauses such as WHERE
Metric
These columns represent the quantitative data of the table. Such columns are used for aggregation. In data warehouse terminology, these can also be referred to as fact or measure columns.
Some operation for which metric columns are used:
Aggregation - SUM, MIN, MAX, COUNT, AVG etc
Filter clause such as WHERE
DateTime
This column represents time columns in the data. There can be multiple time columns in a table, but only one of them can be treated as primary. The primary time column is the one that is present in the segment config.
The primary time column is used by Pinot to maintain the time boundary between offline and real-time data in a hybrid table and for retention management. A primary time column is mandatory if the table's push type is APPEND and optional if the push type is REFRESH .
Common operations that can be done on time column:
: Name of the file system to use, for example, the container name (similar to the bucket name in S3).
enableChecksum: Enable MD5 checksum for verification. Default is false.
Kinesis region e.g. us-west-1
accessKey
Kinesis access key
secretKey
Kinesis secret key
shardIteratorType
Set to LATEST to consume only new records, TRIM_HORIZON for earliest sequence number_,_ AT___SEQUENCE_NUMBER and AFTER_SEQUENCE_NUMBER to start consumptions from a particular sequence number
maxRecordsToFetch
... Default is 20.
Kinesis supports authentication using the DefaultCredentialsProviderChain. The credential provider looks for the credentials in the following order:
Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey
Web Identity Token credentials from the environment or container
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable,
Instance profile credentials delivered through the Amazon EC2 metadata service
You must provide all readaccess level permissions for Pinot to work with an AWS Kinesis data stream. See the AWS documentation for details.
Although you can also specify the accessKey and secretKey in the properties above, we don't recommend this insecure method. We recommend using it only for non-production proof-of-concept (POC) setups. You can also specify other AWS fields such as AWS_SESSION_TOKEN as environment variables and config and it will work.
Resharding
In Kinesis, whenever you reshard a stream, it is done via split or merge operations on shards. If you split a shard, the shard closes and creates 2 new children shards. So if you started with shard0, and then split it, it would result in shard1 and shard2. Similarly, if you merge 2 shards, both those will close and create a child shard. So in the same example, if you merge shards 1 and 2, you'll end up with shard3 as the active shard, while shard0, shard1, shard2 will remain closed forever.
You will see a period where the ideal state will show all segments ONLINE, as parents have naturally completed ingesting, and we're waiting for RealtimeValidationManager to kickstart the ingestion from children.
ShardID is of the format "shardId-000000000001". We use the numeric part as partitionId. Our partitionId variable is integer. If shardIds grow beyond Integer.MAX\_VALUE, we will overflow into the partitionId space.
Segment size based thresholds for segment completion will not work. It assumes that partition "0" always exists. However, once the shard 0 is split/merged, we will no longer have partition 0.
streamType
This should be set to "kinesis"
stream.kinesis.topic.name
Kinesis stream name
region
Next, you need to change the execution config in the job spec to the following -
You can check out the sample job spec here.
Finally execute the hadoop job using the command -
Ensure environment variables PINOT_ROOT_DIR and PINOT_VERSION are set properly.
Data Preprocessing before Segment Creation
We’ve seen some requests that data should be massaged (like partitioning, sorting, resizing) before creating and pushing segments to Pinot.
The MapReduce job called SegmentPreprocessingJob would be the best fit for this use case, regardless of whether the input data is of AVRO or ORC format.
Check the below example to see how to use SegmentPreprocessingJob.
In Hadoop properties, set the following to enable this job:
In table config, specify the operations in preprocessing.operations that you'd like to enable in the MR job, and then specify the exact configs regarding those operations:
preprocessing.num.reducers
Minimum number of reducers. Optional. Fetched when partitioning gets disabled and resizing is enabled. This parameter is to avoid having too many small input files for Pinot, which leads to the case where Pinot server is holding too many small segments, causing too many threads.
preprocessing.max.num.records.per.file
Maximum number of records per reducer. Optional.Unlike, “preprocessing.num.reducers”, this parameter is to avoid having too few large input files for Pinot, which misses the advantage of muti-threading when querying. When not set, each reducer will finally generate one output file. When set (e.g. M), the original output file will be split into multiple files and each new output file contains at most M records. It does not matter whether partitioning is enabled or not.
For more details on this MR job, refer to this document.
When indexes are applied, the query engine can more quickly work out which documents satisfy the filter criteria, reducing the time it takes to execute the query.
What indexes does Pinot support?
By default, Pinot creates a forward index for every column. The forward index generally stores documents in insertion order.
However, before flushing the segment, Pinot does a single pass over every column to see whether the data is sorted. If data is sorted, Pinot creates a sorted (forward) index for that column instead of the forward index.
For real-time tables you can also explicitly tell Pinot that one of the columns should be sorted. For more details, see the [Sorted Index Documentation](https://docs.pinot.apache.org/basics/indexing/forward-index#real-time-tables).
For filtering documents within a segment, Pinot supports the following indexing techniques:
Inverted index: Used for exact lookups.
Range index - Used for range queries.
Text index - Used for phrase, term, boolean, prefix, or regex queries.
Geospatial index - Based on H3, a hexagon-based hierarchical gridding. Used for finding points that exist within a certain distance from another point.
JSON index - Used for querying columns in JSON documents.
Star-Tree index - Pre-aggregates results across multiple columns.
View events table
Let's see how we can apply these indexing techniques to our data. To recap, the events table has the following fields:
Date Time Fields
Dimensions Fields
Metric Fields
ts
uuid
count
We might want to write queries that filter on the ts and uuid columns, so these are the columns on which we would want to configure indexes.
Since the data we're ingesting into the Kafka topic is all implicitly ordered by timestamp, this means that the ts column already has a sorted index. This means that any queries that filter on this column are already optimised.
So that leaves us with the uuid column.
Add an inverted index
We're going to add an inverted index to the uuid column so that queries that filter on that column will return quicker. We need to add the following line:
Once you've done that, you'll need to click Reload All Segments and then Yes to apply the indexing change to all segments.
Check the index has been applied
We can check that the index has been applied to all our segments by querying Pinot's REST API. You can find Swagger documentation at localhost:9000/help.
The following query will return the indexes defined on the uuid column:
To be able to dedup records, a primary key is needed to uniquely identify a given record. To define a primary key, add the field primaryKeyColumns to the schema definition.
Note this field expects a list of columns, as the primary key can be composite.
While ingesting a record, if its primary key is found to be already present, the record will be dropped.
Partition the input stream by the primary key
An important requirement for the Pinot dedup table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the send API. If the original stream is not partitioned, then a streaming processing job (e.g. Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.
Use strictReplicaGroup for routing
The dedup Pinot table can use only the low-level consumer for the input streams. As a result, it uses the partitioned replica-group assignment for the segments. Moreover, dedup poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires strictReplicaGroup as the routing strategy. To use that, configure instanceSelectorType in Routing as the following:
instance assignment is persisted. Note that numInstancesPerPartition should always be 1 in replicaGroupPartitionConfig.
Other limitations
The high-level consumer is not allowed for the input stream ingestion, which means stream.kafka.consumer.type must be lowLevel.
The incoming stream must be partitioned by the primary key such that, all records with a given primaryKey must be consumed by the same Pinot server instance.
Enable dedup in the table configurations
To enable dedup for a REALTIME table, add the following to the table config.
Supported values for hashFunction are NONE, MD5 and MURMUR3, with the default being NONE.
Metadata TTL
Server stores the existing primary keys in dedup metadata map kept on JVM heap. As the dedup metadata grows, the heap memory pressure increases, which may affect the performance of ingestion and queries. One can set a positive metadata TTL to enable the TTL mechanism to keep the metadata size bounded. By default, the table's time colum is used as the dedup time column. The time unit of TTL is the same as the dedup time column. The TTL should be set long enough so that new records can be deduplicated before the primary keys gets removed.
Enable preload for faster server restarts
When ingesting new records, the server has to read the metadata map to check for duplicates. But when server restarts, the documents in existing segments are all unique as ensured by the dedup logic during real-time ingestion. So we can do write-only to bootstrap the metadata map faster.
The feature also requires you to specify pinot.server.instance.max.segment.preload.threads: N in the server config where N should be replaced with the number of threads that should be used for preload. It's 0 by default to disable the preloading feature. This preloading thread pool is shared with upsert table's preloading.
Best practices
Unlike other real-time tables, Dedup table takes up more memory resources as it needs to bookkeep the primary key and its corresponding segment reference, in memory. As a result, it's important to plan the capacity beforehand, and monitor the resource usage. Here are some recommended practices of using Dedup table.
Create the Kafka topic with more partitions. The number of Kafka partitions determines the partition numbers of the Pinot table. The more partitions you have in the Kafka topic, more Pinot servers you can distribute the Pinot table to and therefore more you can scale the table horizontally.
Dedup table maintains an in-memory map from the primary key to the segment reference. So it's recommended to use a simple primary key type and avoid composite primary keys to save the memory cost. In addition, consider the hashFunction config in the Dedup config, which can be MD5 or MURMUR3, to store the 128-bit hashcode of the primary key instead. This is useful when your primary key takes more space. But keep in mind, this hash may introduce collisions, though the chance is very low.
Monitoring: Set up a dashboard over the metric pinot.server.dedupPrimaryKeysCount.tableName to watch the number of primary keys in a table partition. It's useful for tracking its growth which is proportional to the memory usage growth.
Capacity planning: It's useful to plan the capacity beforehand to ensure you will not run into resource constraints later. A simple way is to measure the amount of the primary keys in the Kafka throughput per partition and time the primary key space cost to approximate the memory usage. A heap dump is also useful to check the memory usage so far on an dedup table instance.
I get the following error when running a query, what does it mean?
This implies that the Pinot Broker assigned to the table specified in the query was not found. A common root cause for this is a typo in the table name in the query. Another uncommon reason could be if there wasn't actually a broker with required broker tenant tag for the table.
What are all the fields in the Pinot query's JSON response?
SQL Query fails with "Encountered 'timestamp' was expecting one of..."
"timestamp" is a reserved keyword in SQL. Escape timestamp with double quotes.
Other commonly encountered reserved keywords are date, time, table.
Filtering on STRING column WHERE column = "foo" does not work?
For filtering on STRING columns, use single quotes:
ORDER BY using an alias doesn't work?
The fields in the ORDER BY clause must be one of the group by clauses or aggregations, BEFORE applying the alias. Therefore, this will not work:
But, this will work:
Does pagination work in GROUP BY queries?
No. Pagination only works for SELECTION queries.
How do I increase timeout for a query ?
You can add this at the end of your query: option(timeoutMs=X). Tthe following example uses a timeout of 20 seconds for the query:
You can also use SET "timeoutMs" = 20000; SELECT COUNT(*) from myTable.
For changing the timeout on the entire cluster, set this property pinot.broker.timeoutMs in either broker configs or cluster configs (using the POST /cluster/configs API from Swagger).
How do I cancel a query?
Add these two configs for Pinot server and broker to start tracking of running queries. The query tracks are added and cleaned as query starts and ends, so should not consume much resource.
Then use the Rest APIs on Pinot controller to list running queries and cancel them via the query ID and broker ID (as query ID is only local to broker), like in the following:
How do I optimize my Pinot table for doing aggregations and group-by on high cardinality columns ?
In order to speed up aggregations, you can enable metrics aggregation on the required column by adding a metric field in the corresponding schema and setting aggregateMetrics to true in the table configuration. You can also use a star-tree index config for columns like these (see here for more about star-tree).
How do I verify that an index is created on a particular column ?
There are two ways to verify this:
Log in to a server that hosts segments of this table. Inside the data directory, locate the segment directory for this table. In this directory, there is a file named index_map which lists all the indexes and other data structures created for each segment. Verify that the requested index is present here.
During query: Use the column in the filter predicate and check the value of numEntriesScannedInFilter. If this value is 0, then indexing is working as expected (works for Inverted index).
Does Pinot use a default value for LIMIT in queries?
Yes, Pinot uses a default value of LIMIT 10 in queries. The reason behind this default value is to avoid unintentionally submitting expensive queries that end up fetching or processing a lot of data from Pinot. Users can always overwrite this by explicitly specifying a LIMIT value.
Does Pinot cache query results?
Pinot does not cache query results. Each query is computed in its entirety. Note though, running the same or similar query multiple times will naturally pull in segment pages into memory making subsequent calls faster. Also, for real-time systems, the data is changing in real-time, so results cannot be cached. For offline-only systems, caching layer can be built on top of Pinot, with invalidation mechanism built-in to invalidate the cache when data is pushed into Pinot.
I'm noticing that the first query is slower than subsequent queries. Why is that?
Pinot memory maps segments. It warms up during the first query, when segments are pulled into the memory by the OS. Subsequent queries will have the segment already loaded in memory, and hence will be faster. The OS is responsible for bringing the segments into memory, and also removing them in favor of other segments when other segments not already in memory are accessed.
How do I determine if the star-tree index is being used for my query?
The query execution engine will prefer to use the star-tree index for all queries where it can be used. The criteria to determine whether the star-tree index can be used is as follows:
All aggregation function + column pairs in the query must exist in the star-tree index.
All dimensions that appear in filter predicates and group-by should be star-tree dimensions.
For queries where above is true, a star-tree index is used. For other queries, the execution engine will default to using the next best index available.
Vector Index is implemented using HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor (ANN) search.
Adds support for a predicate and function:
VECTOR_SIMILARITY(v1, v2, [optional topK]) to retrieve the topK closest vectors based on similarity.
The similarity function can be used as part of a query to filter and rank results.
Examples
Below is an example schema designed for a use case involving product reviews with vector embeddings for each review.
Schema
In this schema:
• The embedding column is a multi-valued float array designed to store high-dimensional vector embeddings (e.g., 1536 dimensions from an NLP model).
• Other fields, such as ProductId, UserId, and Text, store metadata and review text.
Table Config
To enable the Vector Index, configure the table with the appropriate fieldConfigList. The embedding column is specified to use the Vector Index with HNSW for similarity searches.
Explanation of Properties:
vectorIndexType:
Specifies the type of vector index to use. Currently supports HNSW.
vectorDimension:
Defines the dimensionality of the vectors stored in the column. (e.g., 1536 for typical embeddings from models like OpenAI or BERT).
vectorDistanceFunction:
Specifies the distance metric for similarity computation. Options include:
INNER_PRODUCT:
• Computes the inner product (dot product) of the two vectors.
• Typically used when vectors are normalized and higher scores indicate greater similarity.
L2:
• Measures the Euclidean distance between vectors.
• Suitable for tasks where spatial closeness in high-dimensional space indicates similarity.
L1:
• Measures the Manhattan distance between vectors (sum of absolute differences of coordinates).
• Useful for some scenarios where simpler distance metrics are preferred.
COSINE:
• Measures cosine similarity, which considers the angle between vectors.
• Ideal for normalized vectors where orientation matters more than magnitude.
version:
Specifies the version of the Vector Index implementation.
Query
VECTOR_SIMILARITY:
A predicate that retrieves the top k closest vectors to the query vector.
Inputs:
embedding: The vector column.
Query vector (literal array).
Optional topK parameter (default: 10).
isDimTable: Set to true.
ingestionConfig.batchIngestionConfig.segmentIngestionType: Set to REFRESH.
dimensionTableConfig.disablePreload: By default, dimension tables are preloaded to allow for fast lookups. Set to true to trade off speed for memory by storing only the segment reference and docID. Otherwise, the whole row is stored in the Dimension table hash map.
controller.dimTable.maxSize: Determines the maximum size quota for a dimension table in a cluster. Table creation will fail if the storage quota exceeds this maximum size.
dimensionFieldSpecs: To look up dimension values, dimension tables need a primary key. For details, see dimensionFieldSpecs.
# executionFrameworkSpec: Defines ingestion jobs to be running.
executionFrameworkSpec:
# name: execution framework name
name: 'hadoop'
# segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentGenerationJobRunner'
# segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentTarPushJobRunner'
# segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentUriPushJobRunner'
# segmentMetadataPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentMetadataPushJobRunner'
# extraConfigs: extra configs for execution framework.
extraConfigs:
# stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
stagingDir: your/local/dir/staging
SELECT count(colA) as aliasA, colA from tableA GROUP BY colA ORDER BY aliasA
SELECT count(colA) as sumA, colA from tableA GROUP BY colA ORDER BY count(colA)
SELECT COUNT(*) from myTable option(timeoutMs=20000)
pinot.server.enable.query.cancellation=true // false by default
pinot.broker.enable.query.cancellation=true // false by default
GET /queries: to show running queries as tracked by all brokers
Response example: `{
"Broker_192.168.0.105_8000": {
"7": "select G_old from baseballStats limit 10",
"8": "select G_old from baseballStats limit 100"
}
}`
DELETE /query/{brokerId}/{queryId}[?verbose=false/true]: to cancel a running query
with queryId and brokerId. The verbose is false by default, but if set to true,
responses from servers running the query also return.
Response example: `Cancelled query: 8 with responses from servers:
{192.168.0.105:7501=404, 192.168.0.105:7502=200, 192.168.0.105:7500=200}`
SELECT ProductId,
UserId,
l2_distance(embedding, ARRAY[-0.0013143676, -0.011042999, ...]) AS l2_dist,
n_tokens,
combined
FROM fineFoodReviews
WHERE VECTOR_SIMILARITY(embedding, ARRAY[-0.0013143676, -0.011042999, ...], 5)
ORDER BY l2_dist ASC
LIMIT 10;
To compact segments on upserts, complete the following steps:
Ensure task scheduling is enabled and a minion is available.
Add the following to your table configuration. These configurations (except schedule)determine which segments to compact.
bufferTimePeriod: To compact segments once they are complete, set to “0d”. To delay compaction (as the configuration above shows by 7 days ("7d")), specify the number of days to delay compaction after a segment completes.
invalidRecordsThresholdPercent (Optional) Limits the older records allowed in the completed segment as a percentage of the total number of records in the segment. In the example above, the completed segment may be selected for compaction when 30% of the records in the segment are old.
invalidRecordsThresholdCount (Optional) Limits the older records allowed in the completed segment by record count. In the example above, if the segment contains more than 100K records, it may be selected for compaction.
tableMaxNumTasks (Optional) Limits the number of tasks allowed to be scheduled.
validDocIdsType (Optional) Specifies the source of validDocIds to fetch when running the data compaction. The valid types are SNAPSHOT, IN_MEMORY, IN_MEMORY_WITH_DELETE
SNAPSHOT: Default validDocIds type. This indicates that the validDocIds bitmap is loaded from the snapshot from the Pinot segment. UpsertConfig's enableSnapshot must be enabled for this type.
WARNING
Using in-memory based validDocids type (IN_MEMORY, IN_MEMORY_WITH_DELETE) is dangerous as it will not guarantee us the consistency in some edge cases (e.g. fetching validDocIds bitmap while the server is restarting & updating validDocIds).
Because segment compaction is an expensive operation, we do not recommend setting invalidRecordsThresholdPercent and invalidRecordsThresholdCount too low (close to 1). By default, all configurations above are 0, so no thresholds are applied.
Example
The following example includes a dataset with 24M records and 240K unique keys that have each been duplicated 100 times. After ingesting the data, there are 6 segments (5 completed segments and 1 consuming segment) with a total estimated size of 22.8MB.
Example dataset
Submitting the query “set skipUpsert=true; select count(*) from transcript_upsert” before compaction produces 24,000,000 results:
Results before segment compaction
After the compaction tasks are complete, the Minion Task Manager UI reports the following.
Minion compaction task completed
Segment compactions generates a task for each segment to compact. Five tasks were generated in this case because 90% of the records (3.6–4.5M records) are considered ready for compaction in the completed segments, exceeding the configured thresholds.
If a completed segment only contains old records, Pinot immediately deletes the segment (rather than creating a task to compact it).
Submitting the query again shows the count matches the set of 240K unique keys.
Results after segment compaction
Once segment compaction has completed, the total number of segments remain the same and the total estimated size drops to 2.77MB.
To further improve query latency, merge small segments into larger one.
A Bloom filter is a probabilistic data structure used to definitively determine if an element is not present in a dataset, but it cannot be employed to determine if an element is present in the dataset. This limitation arises because Bloom filters may produce false positives but never yield false negatives.
An intriguing aspect of these filters is the existence of a mathematical formula that establishes a relationship between their size, the cardinality of the dataset they index, and the rate of false positives.
In Pinot, this cardinality corresponds to the number of unique values expected within each segment. If necessary, the false positive rate and the index size can be configured.
Configuration
Bloom filters are deactivated by default, implying that columns will not be indexed unless they are explicitly configured within the table configuration.
There are 3 optional parameters to configure the Bloom filter:
Parameter
Default
Description
fpp
0.05
False positive probability of the Bloom filter (from 0 to 1).
maxSizeInBytes
0 (unlimited)
Maximum size of the Bloom filter.
loadOnHeap
false
Whether to load the Bloom filter using heap memory or off-heap memory.
The lower the fpp (false positive probability), the greater the accuracy of the Bloom filter, but this reduction in fpp will also lead to an increase in the index size. It's important to note that maxSizeInBytes takes precedence over fpp. If maxSizeInBytes is set to a value greater than 0 and the calculated size of the Bloom filter, based on the specified fpp, exceeds this size limit, Pinot will adjust the fpp to ensure that the Bloom filter size remains within the specified limit.
Similar to other indexes, a Bloom filter can be explicitly deactivated by setting the special parameter disabled to true.
Example
For example the following table config enables the Bloom filter in the playerId column using the default values:
In case some parameter needs to be customized, they can be included in fieldConfigList.indexes.bloom. Remember that even the example customizes all parameters, you can just modify the ones you need.
Older configuration
Use default settings
To use default values, include the name of the column in tableIndexConfig.bloomFilterColumns.
For example:
Customized parameters
To specify custom parameters, add a new entry in tableIndexConfig.bloomFilterConfig object. The key should be the name of the column and the value should be an object similar to the one that can be used in the Bloom section of fieldConfigList.
The Docker instructions on this page are still WIP
This example assumes you have set up your cluster using Pinot in Docker.
Data Stream
First, we need to set up a stream. Pinot has out-of-the-box real-time ingestion support for Kafka. Other streams can be plugged in for use, see Pluggable Streams.
Let's set up a demo Kafka cluster locally, and create a sample topic transcript-topic.
Start Kafka
Create a Kafka Topic
Start Kafka
Start Kafka cluster on port 9876 using the same Zookeeper from the quick-start examples.
Create a Kafka topic
Creating a schema
If you followed , you have already pushed a schema for your sample table. If not, see to learn how to create a schema for your sample data.
Creating a table configuration
If you followed , you pushed an offline table and schema. To create a real-time table configuration for the sample use this table configuration for the transcript table. For a more detailed overview about table, see .
Uploading your schema and table configuration
Next, upload the table and schema to the cluster. As soon as the real-time table is created, it will begin ingesting from the Kafka topic.
Loading sample data into stream
Use the following sample JSON file for transcript table data in the following step.
Push the sample JSON file into the Kafka topic, using the Kafka script from the Kafka download.
Ingesting streaming data
As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Browse to the running in your Pinot instance (we use localhost in this link as an example) to examine the real-time data.
Amazon S3
This guide shows you how to import data from files stored in Amazon S3.
Enable the Amazon S3 file system backend by including the pinot-s3 plugin. In the controller or server configuration, add the config:
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...
You can configure the S3 file system using the following options:
Configuration
Description
Each of these properties should be prefixed by pinot.[node].storage.factory.s3. where node is either controller or server depending on the config
e.g.
S3 Filesystem supports authentication using the . The credential provider looks for the credentials in the following order -
Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey
You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups.
Examples
Job spec
Controller config
Server config
Minion config
Stream ingestion with CLP
Support for encoding fields with CLP during ingestion.
This is an experimental feature. Configuration options and usage may change frequently until it is stabilized.
When performing stream ingestion of JSON records using Kafka, users can encode specific fields with CLP by using a CLP-specific StreamMessageDecoder.
CLP is a compressor designed to encode unstructured log messages in a way that makes them more compressible while retaining the ability to search them. It does this by decomposing the message into three fields:
the message's static text, called a log type;
repetitive variable values, called dictionary variables; and
non-repetitive variable values (called encoded variables since we encode them specially if possible).
Searches are similarly decomposed into queries on the individual fields.
Although CLP is designed for log messages, other unstructured text like file paths may also benefit from its encoding.
For example, consider this JSON record:
If the user specifies the fields message and logPath should be encoded with CLP, then the StreamMessageDecoder will output:
In the fields with the _logtype suffix, \x11 is a placeholder for an integer variable, \x12 is a placeholder for a dictionary variable, and \x13 is a placeholder for a float variable. In message_encoedVars, the float variable 0.335 is encoded as an integer using CLP's custom encoding.
All remaining fields are processed in the same way as they are in org.apache.pinot.plugin.inputformat.json.JSONRecordExtractor. Specifically, fields in the table's schema are extracted from each record and any remaining fields are dropped.
Configuration
Table Index
Assuming the user wants to encode message and logPath as in the example, they should change/add the following settings to their tableIndexConfig (we omit irrelevant settings for brevity):
stream.kafka.decoder.prop.fieldsForClpEncoding is a comma-separated list of names for fields that should be encoded with CLP.
We use for the logtype and dictionary variables since their length can vary significantly.
Schema
For the table's schema, users should configure the CLP-encoded fields as follows (we omit irrelevant settings for brevity):
We use the maximum possible length for the logtype and dictionary variable columns.
The dictionary and encoded variable columns are multi-valued columns.
Searching and decoding CLP-encoded fields
To decode CLP-encoded fields, use .
To search CLP-encoded fields, you can combine CLPDECODE with LIKE. Note, this may decrease performance when querying a large number of rows.
We are working to integrate efficient searches on CLP-encoded columns as another UDF. The development of this feature is being tracked in this .
Spark
Batch ingestion of data into Apache Pinot using Apache Spark.
Pinot supports Apache Spark (2.x and 3.x) as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.
To set up Spark, do one of the following:
Use the Spark-Pinot Connector. For more information, see the .
Create and update a table configuration
Create and edit a table configuration in the Pinot UI or with the API.
In Apache Pinot, create a table by creating a JSON file, generally referred to as your table config. Update, add, or delete parameters as needed, and then reload the file.
Create a Pinot table configuration
Before you create a Pinot table configuration, you must first have a running Pinot cluster with broker and server tenants.
select count(*),
datetrunc('WEEK', ts) as tsWeek
from airlineStats
WHERE datetrunc('WEEK', ts) > fromDateTime('2014-01-16', 'yyyy-MM-dd')
group by tsWeek
limit 10
select dateTrunc('YEAR', event_time) as y,
dateTrunc('MONTH', event_time) as m,
sum(pull_request_commits)
from githubEvents
group by y, m
limit 1000
Option(timeoutMs=3000000)
(Optional) The server-side encryption algorithm used when storing this object in Amazon S3 (Now supports aws:kms), set to null to disable SSE.
ssekmsKeyId
(Optional, but required when serverSideEncryption=aws:kms) Specifies the AWS KMS key ID to use for object encryption. All GET and PUT requests for an object protected by AWS KMS will fail if not made via SSL or using SigV4.
ssekmsEncryptionContext
(Optional) Specifies the AWS KMS Encryption Context to use for object encryption. The value of this header is a base64-encoded UTF-8 string holding JSON with the encryption context key-value pairs.
Web Identity Token credentials from the environment or container
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable,
Instance profile credentials delivered through the Amazon EC2 metadata service
region
The AWS Data center region in which the bucket is located
accessKey
(Optional) AWS access key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
secretKey
(Optional) AWS secret key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
endpoint
(Optional) Override endpoint for s3 client.
disableAcl
If this is set tofalse, bucket owner is granted full access to the objects created by pinot. Default value is true.
You can follow the wiki to build Pinot from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
If you do build Pinot from Source, you should consider opting into using the build-shaded-jar jar profile with -Pbuild-shaded-jar. While Pinot does not bundle spark into its jar, it does bundle certain hadoop libraries.
Next, you need to change the execution config in the job spec to the following:
To run Spark ingestion, you need the following jars in your classpath
pinot-batch-ingestion-spark plugin jar - available in plugins-external directory in the package
pinot-all jar - available in lib directory in the package
These jars can be specified using spark.driver.extraClassPath or any other option.
For loading any other plugins that you want to use, use:
The complete spark-submit command should look like this:
Ensure environment variables PINOT_ROOT_DIR and PINOT_VERSION are set properly.
Note: You should change the master to yarn and deploy-mode to cluster for production environments.
We have stopped including spark-core dependency in our jars post 0.10.0 release. Users can try 0.11.0-SNAPSHOT and later versions of pinot-batch-ingestion-spark in case of any runtime issues. You can either build from source or download latest master build jars.
Running in Cluster Mode on YARN
If you want to run the spark job in cluster mode on YARN/EMR cluster, the following needs to be done -
Build Pinot from source with option -DuseProvidedHadoop
Copy Pinot binaries to S3, HDFS or any other distributed storage that is accessible from all nodes.
Copy Ingestion spec YAML file to S3, HDFS or any other distributed storage. Mention this path as part of --files argument in the command
Add --jars options that contain the s3/hdfs paths to all the required plugin and pinot-all jar
Point classPath to spark working directory. Generally, just specifying the jar names without any paths works. Same should be done for main jar as well as the spec YAML file
Example
For Spark 3.x, replace pinot-batch-ingestion-spark-2.4 with pinot-batch-ingestion-spark-3.2 in all places in the commands.
Also, ensure the classpath in ingestion spec is changed from org.apache.pinot.plugin.ingestion.batch.spark.
to
org.apache.pinot.plugin.ingestion.batch.spark3.
FAQ
Q - I am getting the following exception - Class has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
Since 0.8.0 release, Pinot binaries are compiled with JDK 11. If you are using Spark along with Hadoop 2.7+, you need to use the Java8 version of Pinot. Currently, you need to build jdk 8 version from source.
Q - I am not able to find pinot-batch-ingestion-spark jar.
For Pinot version prior to 0.10.0, the spark plugin is located in plugin dir of binary distribution. For 0.10.0 and later, it is located in pinot-external dir.
Q - Spark is not able to find the jarsleading tojava.nio.file.NoSuchFileException
This means the classpath for spark job has not been configured properly. If you are running spark in a distributed environment such as Yarn or k8s, make sure both spark.driver.classpath and spark.executor.classpath are set. Also, the jars in driver.classpath should be added to --jars argument in spark-submit so that spark can distribute those jars to all the nodes in your cluster. You also need to take provide appropriate scheme with the file path when running the jar. In this doc, we have used local:\\ but it can be different depending on your cluster setup.
Q - Spark job failing while pushing the segments.
It can be because of misconfigured controllerURI in job spec yaml file. If the controllerURI is correct, make sure it is accessible from all the nodes of your YARN or k8s cluster.
If already set to APPEND, this is likely due to a missing timeColumnName in your table config. If you can't provide a time column, use our segment name generation configs in ingestion spec. Generally using inputFile segment name generator should fix your issue.
Q - I am getting java.lang.RuntimeException: java.io.IOException: Failed to create directory: pinot-plugins-dir-0/plugins/*
Removing -Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins from spark.driver.extraJavaOptions should fix this. As long as plugins are mentioned in classpath and jars argument it should not be an issue.
Q - Getting Class not found: exception
Check if extraClassPath arguments contain all the plugin jars for both driver and executors. Also, all the plugin jars are mentioned in the --jars argument. If both of these are correct, check if the extraClassPath contains local filesystem classpaths and not s3 or hdfs or any other distributed file system classpaths.
To update existing data and segments, after you update and save the changes to the table config file, do the following as applicable:
When you add or modify indexes or the table schema, perform a segment reload. To reload all segments:
In the Pinot UI, from the table page, click Reload All Segments.
Using the Pinot API, send POST /segments/{tableName}/reload.
When you re-partition data, perform a segment refresh. To refresh, replace an existing segment with a new one by uploading a segment reusing the existing filename. Use the Pinot API, send POST /segments?tableName={yourTableName}.
When you change the transform function used to populate a derived field or increase the number of partitions in an upsert-enabled table, perform a table re-bootstrap. One way to do this is to delete and recreate the table:
Using the Pinot API, first send DELETE /tables/{tableName} followed by POST /tables with the new table configuration.
When you change the stream topic or change the Kafka cluster containing the Kafka topic you want to consume from, perform a real-time ingestion pause and resume. To pause and resume real-time ingestion:
Using the Pinot API, first send POST /tables/{tableName}/pauseConsumption followed by POST /tables/{tableName}/resumeConsumption.
Update a Pinot table in the UI
To update a table configuration in the Pinot UI, do the following:
In the Cluster Manager click the Tenant Name of the tenant that hosts the table you want to modify.
Click the Table Name in the list of tables in the tenant.
Click the Edit Table button. This creates a pop-up window containing the table configuration. Edit the contents in this window. Click Save when you are done.
Update a Pinot table using the API
To update a table configuration using the Pinot API, do the following:
Get the current table configuration with GET /tables/{tableName}.
Modify the file locally.
Upload the edited file with PUT /table/{tableName} fileName.json.
Example Pinot table configuration file
This example comes from the Apache Pinot Quickstart Examples. This table configuration defines a table called airlineStats_OFFLINE, which you can interact with by running the example.
# executionFrameworkSpec: Defines ingestion jobs to be running.
executionFrameworkSpec:
# name: execution framework name
name: 'spark'
# segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
# segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
# segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
#segmentMetadataPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'
# extraConfigs: extra configs for execution framework.
extraConfigs:
# stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
stagingDir: your/local/dir/staging
IN_MEMORY: This indicates that the validDocIds bitmap is loaded from the real-time server's in-memory.
IN_MEMORY_WITH_DELETE: This indicates that the validDocIds bitmap is read from the real-time server's in-memory. The valid document ids here does take account into the deleted records. UpsertConfig's deleteRecordColumn must be provided for this type.
First, download the Pinot distribution for this tutorial. You can either download a packaged release or build a distribution from the source code.
Prerequisites
Install with JDK 11 or 21. JDK 17 should work, but it is not officially supported.
For JDK 8 support, Pinot 0.12.1 is the last version compilable from the source code.
Pinot 1.0+ doesn't support JDK 8 anymore, build with JDK 11+
Note that some installations of the JDK do not contain the JNI bindings necessary to run all tests. If you see an error like java.lang.UnsatisfiedLinkError while running tests, you might need to change your JDK.
Download the distribution or build from source by selecting one of the following tabs:
Download the latest binary release from , or use this command:
Extract the TAR file:
Navigate to the directory containing the launcher scripts:
You can also find older versions of Apache Pinot at . For example, to download Pinot 0.10.0, run the following command:
Follow these steps to checkout code from and build Pinot locally
Set up a cluster
Now that we've downloaded Pinot, it's time to set up a cluster. There are two ways to do this: through quick start or through setting up a cluster manually.
Quick start
Pinot comes with quick start commands that launch instances of Pinot components in the same process and import pre-built datasets.
For example, the following quick start command launches Pinot with a baseball dataset pre-loaded:
For a list of all the available quick start commands, see the .
Manual cluster
If you want to play with bigger datasets (more than a few megabytes), you can launch each component individually.
The video below is a step-by-step walk through for launching the individual components of Pinot and scaling them to multiple instances.
You can find the commands that are shown in this video in the .
The examples below assume that you are using Java 11+.
If you are using Java 8, add the following settings insideJAVA_OPTS. So, for example, instead of this:
Use the following:
Start Zookeeper
You can use to browse the Zookeeper instance.
Start Pinot Controller
Start Pinot Broker
Start Pinot Server
Start Pinot Minion
Start Kafka
Once your cluster is up and running, you can head over to to learn how to run queries against the data.
Setup cluster with config files
Users could start and customize the cluster by modifying the config files and start the components with config files:
Start a Pinot component in debug mode with IntelliJ
Set break points and inspect variables by starting a Pinot component with debug mode in IntelliJ.
The following example demonstrates server debugging:
First, startzookeeper , controller, and broker using the .
Then, use the following configuration under $PROJECT_DIR$\.run ) to start the server, replacing the metrics-core version and cluster name as needed.
This is an example of how to use it.
Complex Type (Array, Map) Handling
Complex type handling in Apache Pinot.
Commonly, ingested data has a complex structure. For example, Avro schemas have records and arrays while JSON supports objects and arrays.
Apache Pinot's data model supports primitive data types (including int, long, float, double, BigDecimal, string, bytes), and limited multi-value types, such as an array of primitive types. Simple data types allow Pinot to build fast indexing structures for good query performance, but does require some handling of the complex structures.
There are two options for complex type handling:
Convert the complex-type data into a JSON string and then build a JSON index.
Use the built-in complex-type handling rules in the ingestion configuration.
On this page, we'll show how to handle these complex-type structures with each of these two approaches. We will process some example data, consisting of the field group from the .
This object has two child fields and the child group is a nested array with elements of object type.
JSON indexing
Apache Pinot provides a powerful to accelerate the value lookup and filtering for the column. To convert an object group with complex type to JSON, add the following to your table configuration.
The config transformConfigs transforms the object group to a JSON string group_json, which then creates the JSON indexing with configuration jsonIndexColumns. To read the full spec, see .
Also, note that group is a reserved keyword in SQL and therefore needs to be quoted in transformFunction.
The columnName can't use the same name as any of the fields in the source JSON data, for example, if our source data contains the field group and we want to transform the data in that field before persisting it, the destination column name would need to be something different, like group_json.
Note that you do not need to worry about the maxLength of the field group_json on the schema, because "JSON" data type does not have a maxLength and will not be truncated. This is true even though "JSON" is stored as a string internally.
The schema will look like this:
For the full specification, see .
With this, you can start to query the nested fields under group. For more details about the supported JSON function, see ).
Ingestion configurations
Though JSON indexing is a handy way to process the complex types, there are some limitations:
It’s not performant to group by or order by a JSON field, because JSON_EXTRACT_SCALAR is needed to extract the values in the GROUP BY and ORDER BY clauses, which invokes the function evaluation.
It does not work with Pinot's such as DISTINCTCOUNTMV.
Alternatively, from Pinot 0.8, you can use the complex-type handling in ingestion configurations to flatten and unnest the complex structure and convert them into primitive types. Then you can reduce the complex-type data into a flattened Pinot table, and query it via SQL. With the built-in processing rules, you do not need to write ETL jobs in another compute framework such as Flink or Spark.
To process this complex type, you can add the configuration complexTypeConfig to the ingestionConfig. For example:
With the complexTypeConfig , all the map objects will be flattened to direct fields automatically. And with unnestFields , a record with the nested collection will unnest into multiple records. For instance, the example at the beginning will transform into two rows with this configuration example.
Note that:
The nested field group_id under group is flattened to group.group_id. The default value of the delimiter is . You can choose another delimiter by specifying the configuration delimiter under complexTypeConfig. This flattening rule also applies to maps in the collections to be unnested.
You can find the full specifications of the table config and the table schema .
You can then query the table with primitive values using the following SQL query:
. is a reserved character in SQL, so you need to quote the flattened columns in the query.
Infer the Pinot schema from the Avro schema and JSON data
When there are complex structures, it can be challenging and tedious to figure out the Pinot schema manually. To help with schema inference, Pinot provides utility tools to take the Avro schema or JSON data as input and output the inferred Pinot schema.
To infer the Pinot schema from Avro schema, you can use a command like this:
Note you can input configurations like fieldsToUnnest similar to the ones in complexTypeConfig. And this will simulate the complex-type handling rules on the Avro schema and output the Pinot schema in the file specified in outputDir.
Similarly, you can use the command like the following to infer the Pinot schema from a file of JSON objects.
You can check out an example of this run in this .
0.4.0
0.4.0 release introduced the theta-sketch based distinct count function, an S3 filesystem plugin, a unified star-tree index implementation, migration from TimeFieldSpec to DateTimeFieldSpec, etc.
Summary
0.4.0 release introduced various new features, including the theta-sketch based distinct count aggregation function, an S3 filesystem plugin, a unified star-tree index implementation, deprecation of TimeFieldSpec in favor of DateTimeFieldSpec, etc. Miscellaneous refactoring, performance improvement and bug fixes were also included in this release. See details below.
Notable New Features
Made DateTimeFieldSpecs mainstream and deprecated TimeFieldSpec (#2756)
Used time column from table config instead of schema (#5320)
Included dateTimeFieldSpec in schema columns of Pinot Query Console #5392
Major Bug Fixes
Do not release the PinotDataBuffer when closing the index (#5400)
Handled a no-arg function in query parsing and expression tree (#5375)
Fixed compatibility issues during rolling upgrade due to unknown json fields (#5376)
Work in Progress
Upsert: support overriding data in the real-time table (#4261).
Add pinot upsert features to pinot common (#5175)
Enhancements for theta-sketch, e.g. multiValue aggregation support, complex predicates, performance tuning, etc
Backward Incompatible Changes
TableConfig no longer support de-serialization from json string of nested json string (i.e. no \" inside the json) (#5194)
The following APIs are changed in AggregationFunction (use TransformExpressionTree instead of String as the key of blockValSetMap) (#5371):
Ingest records with dynamic schemas
Storing records with dynamic schemas in a table with a fixed schema.
Some domains (e.g., logging) generate records where each record can have a different set of keys, whereas Pinot tables have a relatively static schema. For records with varying keys, it's impractical to store each field in its own table column. However, most (if not all) fields may be important, so fields should not be dropped unnecessarily.
Additionally, searching patterns on such table could also be complex and change frequently. Exact match, range query, prefix/suffix match, wildcard search and aggregation functions could be used on any old or newly created keys or values.
SchemaConformingTransformer
0.5.0
This release includes many new features on Pinot ingestion and connectors, query capability and a revamped controller UI.
Summary
This release includes many new features on Pinot ingestion and connectors (e.g., support for filtering during ingestion which is configurable in table config; support for json during ingestion; proto buf input format support and a new Pinot JDBC client), query capability (e.g., a new GROOVY transform function UDF) and admin functions (a revamped Cluster Manager UI & Query Console UI). It also contains many key bug fixes. See details below.
The release was cut from the following commit:
and the following cherry-picks:
If you're building with JDK 8, add Maven option -Djdk.version=8.
Navigate to the directory containing the setup scripts. Note that Pinot scripts are located under pinot-distribution/target, not the target directory under root.
Pinot can also be installed on Mac OS using the Brew package manager. For instructions on installing Brew, see the Brew documentation.
PINOT_VERSION=1.1.0#set to the Pinot version you decide to usewgethttps://downloads.apache.org/pinot/apache-pinot-$PINOT_VERSION/apache-pinot-$PINOT_VERSION-bin.tar.gz
The nested array group_topics under group is unnested into the top-level, and converts the output to a collection of two rows. Note the handling of the nested field within group_topics, and the eventual top-level field of group.group_topics.urlkey. All the collections to unnest shall be included in the configuration fieldsToUnnest.
Collections not specified in fieldsToUnnestwill be serialized into JSON string, except for the array of primitive values, which will be ingested as a multi-value column by default. The behavior is defined by the collectionNotUnnestedToJson config, which takes the following values:
NON_PRIMITIVE- Converts the array to a multi-value column. (default)
ALL- Converts the array of primitive values to JSON string.
The SchemaConformingTransformer is a RecordTransformer that can transform records with dynamic schemas such that they can be ingested in a table with a static schema. The transformer takes record fields that don't exist in the schema and stores them in a type of catchall field. Moreover, it builds a __mergedTextIndex field and takes advantage of Lucene to fulfill text search.
For example, consider this record:
Let's say the table's schema contains the following fields:
arrayField
mapField
nestedFields
nestedFields.stringField
json_data
json_data_no_idx
__mergedTextIndex
Without this transformer, stringField field and fields ends with _noIdx would be dropped. mapField and nestedFields fields' storage needs to rely on the global setup in complexTransformers without granular customizations. However, with this transformer, the record would be transformed into the following:
Notice that there are 3 reserved (and configurable) fields json_data, json_data_no_idx and __mergedTextIndex. And the transformer does the following:
Flattens nested fields all the way to the leaf node and:
Conducts special treatments if necessary according to the config
If the key path matches the schema, put the data into the dedicated field
Otherwise, put them into json_data or json_data_no_idx depending on its key suffix
For keys in dedicated columns or json_data, puts them into __mergedTextIndex in the form of "Begin Anchor + value + Separator + key + End Anchor" to power the text matches.
Additional functionalities by configurations
Drop fields fieldPathsToDrop
Preserve the subtree without flattening fieldPathsToPreserveInput and fieldPathsToPreserveInputWithIndex
Table Configurations
SchemaConformingTransformer Configuration
To use the transformer, add the schemaConformingTransformerConfig option in the ingestionConfig section of your table configuration, as shown in the following example.
Other index config of 3 reserved columns could be set like:
Specifically, customizable json index could be set according to json index indexPaths.
Power the text search
Schema Design
With the help of SchemaConformingTransformer, all data could be kept even without specifying special dedicated columns in table schema. However, to optimize the storage and various query patterns, dedicated columns should be created based on the usage:
Fields with frequent exact match query, e.g. region, log_level, runtime_env
Fields with range query, e.g. timestamp
High frequency fields from messages
Reduce json index size
Optimize group by queries
Text Search
After putting each key/value pairs into the __mergedTextIndex field, there will neeed to be luceneAnalyzerClass to tokenize the document and luceneQueryParserClass to query by tokens. Some example common searching patterns and their queries are:
Exact key/value match TEXT_MATCH(__mergedTextIndex, '"valuer:key"')
Wildcard value search in a key TEXT_MATCH(__mergedTextIndex, '/.* value .*:key/')
Global value exact match TEXT_MATCH(__mergedTextIndex, '/"value"/')
Global value wildcard match TEXT_MATCH(__mergedTextIndex, '/.* value .*/')
The luceneAnalyzerClass and luceneQueryParserClass usually need to have similar delimiter set. It also needs to consider the values below.
With given example, each key/value pair would be stored as "\u0002value\u001ekey\u0003". The prefix and suffix match on key or value need to be adjusted accordingly in the luceneQueryParserClass.
Allowing update on an existing instance config: PUT /instances/{instanceName} with Instance object as the pay-load (#PR4952)
Add PinotServiceManager to start Pinot components (#PR5266)
Support for protocol buffers input format. (#PR5293)
Add GenericTransformFunction wrapper for simple ScalarFunctions () — Adding support to invoke any scalar function via GenericTransformFunction
Add Support for SQL CASE Statement ()
Support distinctCountRawThetaSketch aggregation that returns serialized sketch. ()
Add multi-value support to SegmentDumpTool () — add segment dump tool as part of the pinot-tool.sh script
Add json_format function to convert json object to string during ingestion. () — Can be used to store complex objects as a json string (which can later be queries using jsonExtractScalar)
Support escaping single quote for SQL literal () — This is especially useful for DistinctCountThetaSketch because it stores expression as literal E.g. DistinctCountThetaSketch(..., 'foo=''bar''', ...)
Support expression as the left-hand side for BETWEEN and IN clause ()
Add a new field IngestionConfig in TableConfig — FilterConfig: ingestion level filtering of records, based on filter function. () — TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release ().
Allow star-tree creation during segment load () — Introduced a new boolean config enableDynamicStarTreeCreation in IndexingConfig to enable/disable star-tree creation during segment load.
Support for Pinot clients using JDBC connection ()
Support customized accuracy for distinctCountHLL, distinctCountHLLMV functions by adding log2m value as the second parameter in the function. () —Adding cluster config: default.hyperloglog.log2m to allow user set default log2m value.
Add segment encryption on Controller based on table config ()
Add a constraint to the message queue for all instances in Helix, with a large default value of 100000. ()
Support order-by aggregations not present in SELECT () — Example: "select subject from transcript group by subject order by count() desc" This is equivalent to the following query but the return response should not contain count(). "select subject, count() from transcript group by subject order by count() desc"
Add geo support for Pinot queries () — Added geo-spatial data model and geospatial functions
Cluster Manager UI & Query Console UI revamp ( and ) — updated cluster manage UI and added table details page and segment details page
Add Controller API to explore Zookeeper ()
Support BYTES type for dictinctCount and group-by ( and ) —Add BYTES type support to DistinctCountAggregationFunction —Correctly handle BYTES type in DictionaryBasedAggregationOperator for DistinctCount
Support for ingestion job spec in JSON format ()
Improvements to RealtimeProvisioningHelper command () — Improved docs related to ingestion and plugins
Added GROOVY transform function UDF () — Ability to run a groovy script in the query as a UDF. e.g. string concatenation: SELECT GROOVY('{"returnType": "INT", "isSingleValue": true}', 'arg0 + " " + arg1', columnA, columnB) FROM myTable
Special notes
Changed the stream and metadata interface (PR#5542) — This PR concludes the work for the issue #5359 to extend offset support for other streams
TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release (PR#5681).
Config key enable.case.insensitive.pql in Helix cluster config is deprecated, and replaced with enable.case.insensitive. ()
Change default segment load mode to MMAP. () —The load mode for segments currently defaults to heap.
Major Bug fixes
Fix bug in distinctCountRawHLL on SQL path (#5494)
Fix backward incompatibility for existing stream implementations (#5549)
Fix backward incompatibility in StreamFactoryConsumerProvider (#5557)
Fix logic in isLiteralOnlyExpression. ()
Fix double memory allocation during operator setup ()
Allow segment download url in Zookeeper to be deep store uri instead of hardcoded controller uri ()
Fix a backward compatible issue of converting BrokerRequest to QueryContext when querying from Presto segment splits ()
Fix the issue that PinotSegmentToAvroConverter does not handle BYTES data type. ()
Backward Incompatible Changes
PQL queries with HAVING clause will no longer be accepted for the following reasons: (#PR5570) — HAVING clause does not apply to PQL GROUP-BY semantic where each aggregation column is ordered individually — The current behavior can produce inaccurate results without any notice — HAVING support will be added for SQL queries in the next release
Because of the standardization of the DistinctCountThetaSketch predicate strings, upgrade Broker before Server. The new Broker can handle both standard and non-standard predicate strings for backward-compatibility. (#PR5613)
Discover the segment component in Apache Pinot for efficient data storage and querying within Pinot clusters, enabling optimized data processing and analysis.
Pinot tables are stored in one or more independent shards called segments. A small table may be contained by a single segment, but Pinot lets tables grow to an unlimited number of segments. There are different processes for creating segments (see ingestion). Segments have time-based partitions of table data, and are stored on Pinot servers that scale horizontally as needed for both storage and computation.
Pinot achieves this by breaking the data into smaller chunks known as segments (similar to shards/partitions in relational databases). Segments can be seen as time-based partitions.
A segment is a horizontal shard representing a chunk of table data with some number of rows. The segment stores data for all columns of the table. Each segment packs the data in a columnar fashion, along with the dictionaries and indices for the columns. The segment is laid out in a columnar format so that it can be directly mapped into memory for serving queries.
Columns can be single or multi-valued and the following types are supported: STRING, BOOLEAN, INT, LONG, FLOAT, DOUBLE, TIMESTAMP or BYTES. Only single-valued BIG_DECIMAL data type is supported.
Columns may be declared to be metric or dimension (or specifically as a time dimension) in the schema. Columns can have default null values. For example, the default null value of a integer column can be 0. The default value for bytes columns must be hex-encoded before it's added to the schema.
Pinot uses dictionary encoding to store values as a dictionary ID. Columns may be configured to be “no-dictionary” column in which case raw values are stored. Dictionary IDs are encoded using minimum number of bits for efficient storage (e.g. a column with a cardinality of 3 will use only 2 bits for each dictionary ID).
A forward index is built for each column and compressed for efficient memory use. In addition, you can optionally configure inverted indices for any set of columns. Inverted indices take up more storage, but improve query performance. Specialized indexes like Star-Tree index are also supported. For more details, see .
Creating a segment
Once the table is configured, we can load some data. Loading data involves generating pinot segments from raw data and pushing them to the pinot cluster. Data can be loaded in batch mode or streaming mode. For more details, see the page.
Load data in batch
Prerequisites
Below are instructions to generate and push segments to Pinot via standalone scripts. For a production setup, you should use frameworks such as Hadoop or Spark. For more details on setting up data ingestion jobs, see
Job Spec YAML
To generate a segment, we need to first create a job spec YAML file. This file contains all the information regarding data format, input data location, and pinot cluster coordinates. Note that this assumes that the controller is RUNNING to fetch the table config and schema. If not, you will have to configure the spec to point at their location. For full configurations, see .
Create and push segment
To create and push the segment in one go, use the following:
Sample Console Output
Alternately, you can separately create and then push, by changing the jobType to SegmentCreation or SegmenTarPush.
Templating Ingestion Job Spec
The Ingestion job spec supports templating with Groovy Syntax.
This is convenient if you want to generate one ingestion job template file and schedule it on a daily basis with extra parameters updated daily.
e.g. you could set inputDirURI with parameters to indicate the date, so that the ingestion job only processes the data for a particular date. Below is an example that templates the date for input and output directories.
You can pass in arguments containing values for ${year}, ${month}, ${day} when kicking off the ingestion job: -values $param=value1 $param2=value2...
This ingestion job only generates segments for date 2014-01-03
Load data in streaming
Prerequisites
Below is an example of how to publish sample data to your stream. As soon as data is available to the real-time stream, it starts getting consumed by the real-time servers.
Kafka
Run below command to stream JSON data into Kafka topic: flights-realtime
Run below command to stream JSON data into Kafka topic: flights-realtime
Dictionary index
When dealing with extensive datasets, it's common for values to be repeated multiple times. To enhance storage efficiency and reduce query latencies, we strongly recommend employing a dictionary index for repetitive data. This is the reason Pinot enables dictionary encoding by default, even though it is advisable to disable it for columns with high cardinality.
Influence on other indexes
In Pinot, dictionaries serve as both an index and actual encoding. Consequently, when dictionaries are enabled, the behavior or layout of certain other indexes undergoes modification. The relationship between dictionaries and other indexes is outlined in the following table:
Index
Conditional
Description
Configuration
Deterministically enable or disable dictionaries
Unlike many other indexes, dictionary indexes are enabled by default, under the assumption that the count of unique values will be significantly lower than the number of rows.
If this assumption does not hold true, you can deactivate the dictionary for a specific column by setting the disabled property to true within indexes.dictionary:
Alternatively, the encodingType property can be changed. For example:
You may choose the option you prefer, but it's essential to maintain consistency, as Pinot will reject table configurations where the same column and index are defined in different locations.
Heuristically enable dictionaries
Most of the time the domain expert that creates the table knows whether a dictionary will be useful or not. For example, a column with random values or public IPs will probably have a large cardinality, so they can be immediately be targeted as raw encoded while columns like employee ids will have a small cardinality and therefore can be easily be recognized as good dictionary candidates. But sometimes the decision may not be clear. To help in these situations, Pinot can be configured to heuristically create the dictionary depending on the actual values and a relation factor.
When this heuristic is enabled, Pinot calculates a saving factor for each candidate column. This factor is the ratio between the forward index size encoded as raw and the same index encoded as a dictionary. If the saving factor for a candidate column is less than a saving ratio, the dictionary is not created.
In order to be considered as a candidate for the heuristic, a column must:
Be marked as dictionary encoded (columns marked as raw are always encoded as raw).
Be single valued (multi-valued columns are never considered by the heuristic).
Be of a fixed size type such as int, long, double, timestamp, etc. Variable size types like json, strings or bytes are never considered by the heuristic.
Optionally this feature can be applied only to metric columns, skipping dimension columns.
This functionality can be enabled within the indexingConfig object within the table configuration. The parameters that govern these heuristics are:
Parameter
Default
Description
It's important to emphasize that:
These parameters are configured for all columns within the table.
optimizeDictionary takes precedence over optimizeDictionaryForMetrics.
Parameters
Dictionaries can be configured with the following options
Parameter
Default
Description
Variable length dictionaries
The useVarLengthDictionary parameter only impacts columns with values that vary in the number of bytes they occupy. This includes column types that require a variable number of bytes, such as strings, bytes, or big decimals, and scenarios where not all values within a segment occupy the same number of bytes. For example, even strings in general require a variable number of bytes to be stored, if a segment contains only the values "a", "b", and "c" Pinot will identify that all values in the segment can be represented with the same number of bytes.
By default, useVarLengthDictionary is set to false, which means Pinot will calculate the length of the largest value contained within the segment. This length will then be used for all values. This approach ensures that all values can be stored efficiently, resulting in faster access and a more compressed layout when the lengths of values are similar.
If your dataset includes a few very large values and a multitude of very small ones, it is advisable to instruct Pinot to utilize variable-length encoding by setting useVarLengthDictionary to true. When variable encoding is employed, Pinot is required to store the length of each entry. Consequently, the cost of storing an entry becomes its actual size plus an additional 4 bytes for the offset.
On-heap dictionaries
Dictionary data is always stored off-heap. In general, it is recommended to keep dictionaries that way. However, in cases where the cardinality is small, and the on-heap memory usage is acceptable, you can copy them into memory by setting the onHeap parameter to true.
Remember: On-heap dictionaries are not recommended.
On-heap dictionaries can slightly reduce latency but will significantly increase the heap memory used by Pinot and increase garbage collection times, which may result in out of memory issues.
When off-heap dictionaries are used, data is deserialized each time it is accessed. This isn't a problem with primitive types (such as int or long), but with complex types (like strings or bytes), this means that the data is deserialized each time it is accessed. On-heap dictionaries solve this problem by keeping the data in memory in deserialized format so no allocations are needed at query time.
However, on-heap dictionaries have a cost in terms of memory usage and that cost is proportional to the number of segments that are accessed concurrently. It is important to note that, as with all other indexes, the dictionary scope is limited to segments. This means that if we have a table with 1,000 segments and a dictionary for a column, we may have 1,000 dictionaries in memory. This can be a waste of memory in cases where unique values are repeated across segments. To solve this problem, Pinot can retain a cache of the dictionary values and reuse them across segments. This cache is not shared between different tables or columns and its maximum size is controlled by the dictionary.intern.capacity option.
Only string and byte columns can be interned. Pinot ignores the intern configuration when used on columns with a different data type.
Here's an example of configuring a dictionary to use on-heap dictionaries with intern mode enabled:
Running Pinot in Docker
This guide will show you to run a Pinot cluster using Docker.
Get started setting up a Pinot cluster with Docker using the guide below.
Prerequisites:
Install
Configure Docker memory with the following minimum resources:
Quick Start Examples
This section describes quick start commands that launch all Pinot components in a single process.
Pinot ships with QuickStart commands that launch Pinot components in a single process and import pre-built datasets. These quick start examples are a good place if you're just getting started with Pinot. The examples begin with the example, after the following notes:
Prerequisites
You must have either or . The examples are available in each option and work the same. The decision of which to choose depends on your installation preference and how you generally like to work. If you don't know which to choose, using Docker will make your cleanup easier after you are done with the examples.
Ingest streaming data from Apache Pulsar
This guide shows you how to ingest a stream of records from an Apache Pulsar topic into a Pinot table.
Pinot supports consuming data from via the pinot-pulsar plugin. You need to enable this plugin so that Pulsar specific libraries are present in the classpath.
Enable the Pulsar plugin with the following config at the time of Pinot setup: -Dplugins.include=pinot-pulsar
The pinot-pulsar
Geospatial
This page talks about geospatial support in Pinot.
Pinot supports SQL/MM geospatial data and is compliant with the . This includes:
Geospatial data types, such as point, line and polygon;
Geospatial functions, for querying of spatial properties and relationships.
The Docker-based examples on this page use pinot:latest, which instructs Docker to pull and use the most recent release of Apache Pinot. If you prefer to use a specific release instead, you can designate it by replacing latest with the release number, like this: pinot:0.12.1.
The local install-based examples that are run using the launcher scripts will use the Apache Pinot version you installed.
Stopping a running example
To stop a running example, enter Ctrl+C in the same terminal where you ran the docker run command to start the example.
macOS Monterey Users
By default the Airplay receiver server runs on port 7000, which is also the port used by the Pinot Server in the Quick Start. You may see the following error when running these examples:
If you disable the Airplay receiver server and try again, you shouldn't see this error message anymore.
Batch Processing
This example demonstrates how to do batch processing with Pinot. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates the baseballStats table
Launches a standalone data ingestion job that builds one segment for a given CSV data file for the baseballStats table and pushes the segment to the Pinot Controller.
Issues sample queries to Pinot
Batch JSON
This example demonstrates how to import and query JSON documents in Pinot. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates the githubEvents table
Launches a standalone data ingestion job that builds one segment for a given JSON data file for the githubEvents table and pushes the segment to the Pinot Controller.
Issues sample queries to Pinot
Batch with complex data types
This example demonstrates how to do batch processing in Pinot where the the data items have complex fields that need to be unnested. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.
Creates the githubEvents table
Launches a standalone data ingestion job that builds one segment for a given JSON data file for the githubEvents table and pushes the segment to the Pinot Controller.
Issues sample queries to Pinot
Streaming
This example demonstrates how to do stream processing with Pinot. The command:
Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot
Issues sample queries to Pinot
Streaming with minion cleanup
This example demonstrates how to do stream processing in Pinot with RealtimeToOfflineSegmentsTask and MergeRollupTask minion tasks continuously optimizing segments as data gets ingested. The command:
Publishes data to a Kafka topic githubEvents that is subscribed to by Pinot.
Issues sample queries to Pinot
Streaming with complex data types
This example demonstrates how to do stream processing in Pinot where the stream contains items that have complex fields that need to be unnested. The command:
Launches a standalone data ingestion job that builds segments under a given directory of Avro files for the airlineStats table and pushes the segments to the Pinot Controller.
Launches a stream of flights stats
Publishes data to a Kafka topic airlineStatsEvents that is subscribed to by Pinot.
Issues sample queries to Pinot
Join
This example demonstrates how to do joins in Pinot using the Lookup UDF. The command:
Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server in the same container.
Creates the baseballStats table
Launches a data ingestion job that builds one segment for a given CSV data file for the baseballStats table and pushes the segment to the Pinot Controller.
Creates the dimBaseballTeams table
Launches a data ingestion job that builds one segment for a given CSV data file for the dimBaseballStats table and pushes the segment to the Pinot Controller.
The quick start scripts launch Pinot with minimal resources. If you want to play with bigger datasets (more than a few MB), you can launch each of the Pinot components individually.
Note that these are sample configurations to be used as references. You will likely want to customize them to meet your needs for production use.
Docker
Create a Network
Create an isolated bridge network in docker
Export Docker Image tags
Export the necessary docker image tags for Pinot, Zookeeper, and Kafka.
Start Zookeeper
Start Zookeeper in daemon mode. This is a single node zookeeper setup. Zookeeper is the central metadata store for Pinot and should be set up with replication for production use. For more information, see Running Replicated Zookeeper.
Start Pinot Controller
Start Pinot Controller in daemon and connect to Zookeeper.
The command below expects a 4GB memory container. Tune-Xms and-Xmx if your machine doesn't have enough resources.
Start Pinot Broker
Start Pinot Broker in daemon and connect to Zookeeper.
The command below expects a 4GB memory container. Tune-Xms and-Xmx if your machine doesn't have enough resources.
Start Pinot Server
Start Pinot Server in daemon and connect to Zookeeper.
The command below expects a 16GB memory container. Tune-Xms and-Xmx if your machine doesn't have enough resources.
Start Kafka
Optionally, you can also start Kafka for setting up real-time streams. This brings up the Kafka broker on port 9092.
Now all Pinot related components are started as an empty cluster.
Run the below command to check container status:
Sample Console Output
Docker Compose
Export Docker Image tags
Optionally, export the necessary docker image tags for Pinot, Zookeeper, and Kafka.
Create docker-compose.yml file
Create a file called docker-compose.yml that contains the following:
Launch the components
Run the following command to launch all the required components:
OR, optionally, run the following command to launch all the components, including kafka:
Run the below command to check the container status:
Sample Console Output
Once your cluster is up and running, see Exploring Pinot to learn how to run queries against the data.
Failed to start a Pinot [SERVER]
java.lang.RuntimeException: java.net.BindException: Address already in use
at org.apache.pinot.core.transport.QueryServer.start(QueryServer.java:103) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da21132906]
at org.apache.pinot.server.starter.ServerInstance.start(ServerInstance.java:158) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da21132906]
at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:110) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da2113
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
accc70bc7f07 bitnami/kafka:3.6 "/opt/bitnami/script…" About a minute ago Up About a minute 0.0.0.0:9092->9092/tcp kafka
1b8b80395959 apachepinot/pinot:1.2.0 "./bin/pinot-admin.s…" About a minute ago Up About a minute 8096-8097/tcp, 8099/tcp, 9000/tcp, 0.0.0.0:8098->8098/tcp pinot-server
134a67eec957 apachepinot/pinot:1.2.0 "./bin/pinot-admin.s…" About a minute ago Up About a minute 8096-8098/tcp, 9000/tcp, 0.0.0.0:8099->8099/tcp pinot-broker
4fcc72cb7302 apachepinot/pinot:1.2.0 "./bin/pinot-admin.s…" About a minute ago Up About a minute 8096-8099/tcp, 0.0.0.0:9000->9000/tcp pinot-controller
144304524f6c zookeeper:3.9.2 "/docker-entrypoint.…" About a minute ago Up About a minute 2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp, 8080/tcp pinot-zookeeper
export KAFKA_REPLICAS=1
docker compose --project-name pinot-demo up
docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f34a046ac69f bitnami/kafka:3.6 "/opt/bitnami/script…" 9 minutes ago Up About a minute (healthy) 0.0.0.0:9092->9092/tcp kafka
f28021bd5b1d apachepinot/pinot:1.2.0 "./bin/pinot-admin.s…" 18 minutes ago Up About a minute (healthy) 8096-8097/tcp, 8099/tcp, 9000/tcp, 0.0.0.0:8098->8098/tcp pinot-server
e938453054b0 apachepinot/pinot:1.2.0 "./bin/pinot-admin.s…" 18 minutes ago Up About a minute (healthy) 8096-8098/tcp, 9000/tcp, 0.0.0.0:8099->8099/tcp pinot-broker
e0d0c71303a8 apachepinot/pinot:1.2.0 "./bin/pinot-admin.s…" 18 minutes ago Up About a minute (healthy) 8096-8099/tcp, 0.0.0.0:9000->9000/tcp pinot-controller
4be5f168f252 zookeeper:3.9.2 "/docker-entrypoint.…" 18 minutes ago Up About a minute (healthy) 2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp, 8080/tcp pinot-zookeeper
FST
Requires dictionary.
Incompatible with dictionary.
Not indexed by text index or JSON index (as they are only useful when cardinality is very large).
Enables the heuristic for all columns and activates some extra rules.
optimizeDictionaryForMetrics
false
Enables the heuristic for metric columns.
noDictionarySizeRatioThreshold
0.85
The saving ratio used in the heuristics.
onHeap
false
Specifies whether the index should be loaded on heap or off heap.
useVarLengthDictionary
false
Determines how to store variable-length values.
intern
empty object
Configuration for interning. Only for on-heap dictionaries. Read about that below.
intern.capacity
Disables dictionary.
null
plugin is not part of official 0.10.0 binary. You can download the plugin from
and add it to the libs or plugins directory in pinot.
Set up Pulsar table
Here is a sample Pulsar stream config. You can use the streamConfigs section from this sample and make changes for your corresponding table.
Pulsar configuration options
You can change the following Pulsar specifc configurations for your tables
Property
Description
streamType
This should be set to "pulsar"
stream.pulsar.topic.name
Your pulsar topic name
stream.pulsar.bootstrap.servers
Comma-separated broker list for Apache Pulsar
stream.pulsar.metadata.populate
set to true to populate metadata
stream.pulsar.metadata.fields
set to comma separated list of metadata fields
Authentication
The Pinot-Pulsar connector supports authentication using security tokens. To generate a token, follow the instructions in Pulsar documentation. Once generated, add the following property to streamConfigs to add an authentication token for each request:
OAuth2 Authentication
The Pinot-Pulsar connector supports authentication using OAuth2, for example, if connecting to a StreamNative Pulsar cluster. For more information, see how to Configure OAuth2 authentication in Pulsar clients. Once configured, you can add the following properties to streamConfigs:
TLS support
The Pinot-pulsar connector also supports TLS for encrypted connections. You can follow the official pulsar documentation to enable TLS on your pulsar cluster. Once done, you can enable TLS in pulsar connector by providing the trust certificate file location generated in the previous step.
Also, make sure to change the brokers url from pulsar://localhost:6650 to pulsar+ssl://localhost:6650 so that secure connections are used.
Pinot currently relies on Pulsar client version 2.7.2. Make sure the Pulsar broker is compatible with the this client version.
Extract record headers as Pinot table columns
Pinot's Pulsar connector supports automatically extracting record headers and metadata into the Pinot table columns. Pulsar supports a large amount of per-record metadata. Reference the official Pulsar documentation for the meaning of the metadata fields.
The following table shows the mapping for record header/metadata to Pinot table column names:
Pulsar Message
Pinot table Column
Comments
Available By Default
key : String
__key : String
Yes
properties : Map<String, String>
Each header key is listed as a separate column: __header$HeaderKeyName : String
Yes
publishTime : Long
__metadata$publishTime : String
In order to enable the metadata extraction in a Pulsar table, set the stream config metadata.populate to true. The fields eventTime, publishTime, brokerPublishTime, and key are populated by default. If you would like to extract additional fields from the Pulsar Message, populate the metadataFields config with a comma separated list of fields to populate. The fields are referenced by the field name in the Pulsar Message. For example, setting:
Will make the __metadata$messageId, __metadata$messageBytes, __metadata$eventTime, and __metadata$topicName, fields available for mapping to columns in the Pinot schema.
In addition to this, if you want to use any of these columns in your table, you have to list them explicitly in your table's schema.
For example, if you want to add only the offset and key as dimension columns in your Pinot table, it can listed in the schema as follows:
Once the schema is updated, these columns are similar to any other pinot column. You can apply ingestion transforms and / or define indexes on them.
Geospatial indexing, used for efficient processing of spatial operations
Geospatial data types
Geospatial data types abstract and encapsulate spatial structures such as boundary and dimension. In many respects, spatial data types can be understood simply as shapes. Pinot supports the Well-Known Text (WKT) and Well-Known Binary (WKB) forms of geospatial objects, for example:
It is common to have data in which the coordinates are geographics or latitude/longitude. Unlike coordinates in Mercator or UTM, geographic coordinates are not Cartesian coordinates.
Geographic coordinates do not represent a linear distance from an origin as plotted on a plane. Rather, these spherical coordinates describe angular coordinates on a globe.
Spherical coordinates specify a point by the angle of rotation from a reference meridian (longitude), and the angle from the equator (latitude).
You can treat geographic coordinates as approximate Cartesian coordinates and continue to do spatial calculations. However, measurements of distance, length and area will be nonsensical. Since spherical coordinates measure angular distance, the units are in degrees.
Pinot supports both geometry and geography types, which can be constructed by the corresponding functions as shown in section. And for the geography types, the measurement functions such as ST_Distance and ST_Area calculate the spherical distance and area on earth respectively.
Geospatial functions
For manipulating geospatial data, Pinot provides a set of functions for analyzing geometric components, determining spatial relationships, and manipulating geometries. In particular, geospatial functions that begin with the ST_ prefix support the SQL/MM specification.
Following geospatial functions are available out of the box in Pinot:
Aggregations
ST_Union(geometry[] g1_array) → Geometry This aggregate function returns a MULTI geometry or NON-MULTI geometry from a set of geometries. it ignores NULL geometries.
ST_Area(Geometry/Geography g) → double For geometry type, it returns the 2D Euclidean area of a geometry. For geography, returns the area of a polygon or multi-polygon in square meters using a spherical model for Earth.
ST_Distance(Geometry/Geography g1, Geometry/Geography g2) → double For geometry type, returns the 2-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units. For geography, returns the great-circle distance in meters between two SphericalGeography points. Note that g1, g2 shall have the same type.
ST_Contains(Geometry/Geography, Geometry/Geography) → boolean Returns true if and only if no points of the second geometry/geography lie in the exterior of the first geometry/geography, and at least one point of the interior of the first geometry lies in the interior of the second geometry. Warning: ST_Contains on Geography only give close approximation
ST_Equals(Geometry, Geometry) → boolean Returns true if the given geometries represent the same geometry/geography.
ST_Within(Geometry, Geometry) → boolean Returns true if first geometry is completely inside second geometry.
Geospatial index
Geospatial functions are typically expensive to evaluate, and using geoindex can greatly accelerate the query evaluation. Geoindexing in Pinot is based on Uber’s H3, a hexagon-based hierarchical gridding.
A given geospatial location (longitude, latitude) can map to one hexagon (represented as H3Index). And its neighbors in H3 can be approximated by a ring of hexagons. To quickly identify the distance between any given two geospatial locations, we can convert the two locations in the H3Index, and then check the H3 distance between them. H3 distance is measured as the number of hexagons.
For example, in the diagram below, the red hexagons are within the 1 distance of the central hexagon. The size of the hexagon is determined by the resolution of the indexing. Check this table for the level of resolutions and the corresponding precision (measured in km).
Hexagonal grid in H3
How to use geoindex
To use the geoindex, first declare the geolocation field as bytes in the schema, as in the example of the QuickStart example.
Note the use of transformFunction that converts the created point into SphericalGeography format, which is needed by the ST_Distance function.
It is recommended to do the latter by using the indexes section:
Alternative the older way to configure H3 indexes is still supported:
The query below will use the geoindex to filter the Starbucks stores within 5km of the given point in the bay area.
How geoindex works
The Pinot geoindex accelerates query evaluation while maintaining accuracy. Currently, geoindex supports the ST_Distance function in the WHERE clause.
At the high level, geoindex is used for retrieving the records within the nearby hexagons of the given location, and then use ST_Distance to accurately filter the matched results.
Geoindex example
As in the example diagram above, if we want to find all relevant points within a given distance around San Francisco (area within the red circle), then the algorithm with geoindex will:
First find the H3 distance x that contains the range (for example, within a red circle).
Then, for the points within the H3 distance (those covered by the hexagons completely within kRing(x)), directly accept those points without filtering.
Finally, for the points contained in the hexagons of kRing(x) at the outer edge of the red circle H3 distance, the algorithm will filter them by evaluating the condition ST_Distance(loc1, loc2) < x to find only those that are within the circle.
Understand how the components of Apache Pinot™ work together to create a scalable OLAP database that can deliver low-latency, high-concurrency queries at scale.
Apache Pinot™ is a distributed OLAP database designed to serve real-time, user-facing use cases, which means handling large volumes of data and many concurrent queries with very low query latencies. Pinot supports the following requirements:
Ultra low-latency queries (as low as 10ms P95)
High query concurrency (as many as 100,000 queries per second)
High data freshness (streaming data available for query immediately upon ingestion)
Large data volume (up to petabytes)
Distributed design principles
To accommodate large data volumes with stringent latency and concurrency requirements, Pinot is designed as a distributed database that supports the following requirements:
Highly available: Pinot has no single point of failure. When tables are configured for replication, and a node goes down, the cluster is able to continue processing queries.
Horizontally scalable: Operators can scale a Pinot cluster by adding new nodes when the workload increases. There are even two node types ( and ) to scale query volume, query complexity, and data size independently.
Immutable data
Core components
As described in the Pinot , Pinot has four node types:
Apache Helix and ZooKeeper
Distributed systems do not maintain themselves, and in fact require sophisticated scheduling and resource management to function. Pinot uses for this purpose. Helix exists as an independent project, but it was designed by the original creators of Pinot for Pinot's own cluster management purposes, so the architectures of the two systems are well-aligned. Helix takes the form of a process on the controller, plus embedded agents on the brokers and servers. It uses as a fault-tolerant, strongly consistent, durable state store.
Helix maintains a picture of the intended state of the cluster, including the number of servers and brokers, the configuration and schema of all tables, connections to streaming ingest sources, currently executing batch ingestion jobs, the assignment of table segments to the servers in the cluster, and more. All of these configuration items are potentially mutable quantities, since operators routinely change table schemas, add or remove streaming ingest sources, begin new batch ingestion jobs, and so on. Additionally, physical cluster state may change as servers and brokers fail or suffer network partition. Helix works constantly to drive the actual state of the cluster to match the intended state, pushing configuration changes to brokers and servers as needed.
There are three physical node types in a Helix cluster:
Participant: These nodes do things, like store data or perform computation. Participants host resources, which are Helix's fundamental storage abstraction. Because Pinot servers store segment data, they are participants.
Spectator: These nodes see things, observing the evolving state of the participants through events pushed to the spectator. Because Pinot brokers need to know which servers host which segments, they are spectators.
In addition, Helix defines two logical components to express its storage abstraction:
Partition. A unit of data storage that lives on at least one participant. Partitions may be replicated across multiple participants. A Pinot segment is a partition.
Resource. A logical collection of partitions, providing a single view over a potentially large set of data stored across a distributed system. A Pinot table is a resource.
In summary, the Pinot architecture maps onto Helix components as follows:
Pinot Component
Helix Component
Helix uses ZooKeeper to maintain cluster state. ZooKeeper sends Helix spectators notifications of changes in cluster state (which correspond to changes in ZNodes). Zookeeper stores the following information about the cluster:
Resource
Stored Properties
Zookeeper, as a first-class citizen of a Pinot cluster, may use the well-known ZNode structure for operations and troubleshooting purposes. Be advised that this structure can change in future Pinot releases.
Controller
The Pinot schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of and ). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.
Fault tolerance
Only one controller can be active at a time, so when multiple controllers are present in a cluster, they elect a leader. When that controller instance becomes unavailable, the remaining instances automatically elect a new leader. Leader election is achieved using Apache Helix. A Pinot cluster can serve queries without an active controller, but it can't perform any metadata-modifying operations, like adding a table or consuming a new segment.
Controller REST interface
The controller provides a REST interface that allows read and write access to all logical storage resources (e.g., servers, brokers, tables, and segments). See for more information on the web-based admin tool.
Broker
The responsibility is to route queries to the appropriate instances, or in the case of multi-stage queries, to compute a complete query plan and distribute it to the servers required to execute it. The broker collects and merges the responses from all servers into a final result, then sends the result back to the requesting client. The broker exposes an HTTP endpoint that accepts SQL queries in JSON format and returns the response in JSON.
Each broker maintains a query routing table. The routing table maps segments to the servers that store them. (When replication is configured on a table, each segment is stored on more than one server.) The broker computes multiple routing tables depending on the configured strategy for a table. The default strategy is to balance the query load across all available servers.
Advanced routing strategies are available, such as replica-aware routing, partition-based routing, and minimal server selection routing.
Query processing
Every query processed by a broker uses the single-stage engine or the . For single-stage queries, the broker does the following:
Computes query routes based on the routing strategy defined in the configuration.
Computes the list of segments to query on each . (See for further details on this process.)
Sends the query to each of those servers for local execution against their segments.
For multi-stage queries, the broker performs the following:
Computes a query plan that runs on multiple sets of servers. The servers selected for the first stage are selected based on the segments required to execute the query, which are determined in a process similar to single-stage queries.
Sends the relevant portions of the query plan to one or more servers in the cluster for each stage of the query plan.
The servers that received query plans each execute their part of the query. For more details on this process, read about the .
Server
host on locally attached storage and process queries on those segments. By convention, operators speak of "real-time" and "offline" servers, although there is no difference in the server process itself or even its configuration that distinguishes between the two. This is merely a convention reflected in the assignment strategy to confine the two different kinds of workloads to two groups of physical instances, since the performance-limiting factors differ between the two kinds of workloads. For example, offline servers might optimize for larger storage capacity, whereas real-time servers might optimize for memory and CPU cores.
Offline servers
Offline servers host segments created by ingesting batch data. The controller writes these segments to the offline server according to the table's replication factor and segment assignment strategy. Typically, the controller writes new segments to the , and affected servers download the segment from deep store. The controller then notifies brokers that a new segment exists, and is available to participate in queries.
Because offline tables tend to have long retention periods, offline servers tend to scale based on the size of the data they store.
Real-time servers
Real-time servers ingest data from streaming sources, like Apache Kafka®, Apache Pulsar®, or AWS Kinesis. Streaming data ends up in conventional segment files just like batch data, but is first accumulated in an in-memory data structure known as a consuming segment. Each message consumed from a streaming source is written immediately to the relevant consuming segment, and is available for query processing from the consuming segment immediately, since consuming segments participate in query processing as first-class citizens. Consuming segments get flushed to disk periodically based on a completion threshold, which can be calculated by row count, ingestion time, or segment size. A flushed segment on a real-time table is called a completed segment, and is functionally equivalent to a segment created during offline ingest.
Real-time servers tend to be scaled based on the rate at which they ingest streaming data.
Minion
A Pinot is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minion tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.
Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function without minions, they are typically present to support routine tasks like ingesting batch data.
Data ingestion overview
Pinot exist in two varieties: offline (or batch) and real-time. Offline tables contain data from batch sources like CSV, Avro, or Parquet files, and real-time tables contain data from streaming sources like like Apache Kafka®, Apache Pulsar®, or AWS Kinesis.
Offline (batch) ingest
Pinot ingests batch data using an , which follows a process like this:
The job transforms a raw data source (such as a CSV file) into . This is a potentially complex process resulting in a file that is typically several hundred megabytes in size.
The job then transfers the file to the cluster's and notifies the that a new segment exists.
The controller (in its capacity as a Helix controller) updates the ideal state of the cluster in its cluster metadata map.
Real-time ingest
Ingestion is established at the time a real-time table is created, and continues as long as the table exists. When the controller receives the metadata update to create a new real-time table, the table configuration specifies the source of the streaming input data—often a topic in a Kafka cluster. This kicks off a process like this:
The controller picks one or more servers to act as direct consumers of the streaming input source.
The controller creates consuming segments for the new table. It does this by creating an entry in the global metadata map for a new consuming segment for each of the real-time servers selected in step 1.
Through Helix functionality on the controller and the relevant servers, the servers proceed to create consuming segments in memory and establish a connection to the streaming input source. When this input source is Kafka, each server acts as a Kafka consumer directly, with no other components involved in the integration.
Ingestion FAQ
This page has a collection of frequently asked questions about ingestion with answers from the community.
This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, make a pull request.
Data processing
What is a good segment size?
While Apache Pinot can work with segments of various sizes, for optimal use of Pinot, you want to get your segments sized in the 100MB to 500MB (un-tarred/uncompressed) range. Having too many (thousands or more) tiny segments for a single table creates overhead in terms of the metadata storage in Zookeeper as well as in the Pinot servers' heap. At the same time, having too few really large (GBs) segments reduces parallelism of query execution, as on the server side, the thread parallelism of query execution is at segment level.
Can multiple Pinot tables consume from the same Kafka topic?
Yes. Each table can be independently configured to consume from any given Kafka topic, regardless of whether there are other tables that are also consuming from the same Kafka topic.
If I add a partition to a Kafka topic, will Pinot automatically ingest data from this partition?
Pinot automatically detects new partitions in Kafka topics. It checks for new partitions whenever RealtimeSegmentValidationManager periodic job runs and starts consumers for new partitions.
You can configure the interval for this job using thecontroller.realtime.segment.validation.frequencyPeriod property in the controller configuration.
Does Pinot support partition pruning on multiple partition columns?
Pinot supports multi-column partitioning for offline tables. Map multiple columns under Pinot assigns the input data to each partition according to the partition configuration individually for each column.
The following example partitions the segment based on two columns, memberID and caseNumber. Note that each partition column is handled separately, so in this case the segment is partitioned on memberID (partition ID 1) and also partiitoned on caseNumber (partition ID 2).
For multi-column partitioning to work, you must also set routing.segementPrunerTypes as follows:
How do I enable partitioning in Pinot when using Kafka stream?
Set up partitioner in the Kafka producer:
The partitioning logic in the stream should match the partitioning config in Pinot. Kafka uses murmur2, and the equivalent in Pinot is the Murmur function.
Set the partitioning configuration as below using same column used in Kafka:
and also set:
To learn how partition works, see .
How do I store BYTES column in JSON data?
For JSON, you can use a hex encoded string to ingest BYTES.
How do I flatten my JSON Kafka stream?
See the function which can store a top level json field as a STRING in Pinot.
Then you can use these during query time, to extract fields from the json string.
NOTE
This works well if some of your fields are nested json, but most of your fields are top level json keys. If all of your fields are within a nested JSON key, you will have to store the entire payload as 1 column, which is not ideal.
How do I escape Unicode in my Job Spec YAML file?
To use explicit code points, you must double-quote (not single-quote) the string, and escape the code point via "\uHHHH", where HHHH is the four digit hex code for the character. See for more details.
Is there a limit on the maximum length of a string column in Pinot?
By default, Pinot limits the length of a String column to 512 bytes. If you want to overwrite this value, you can set the maxLength attribute in the schema as follows:
When are new events queryable when getting ingested into a real-time table?
Events are available to queries as soon as they are ingested. This is because events are instantly indexed in memory upon ingestion.
The ingestion of events into the real-time table is not transactional, so replicas of the open segment are not immediately consistent. Pinot trades consistency for availability upon network partitioning (CAP theorem) to provide ultra-low ingestion latencies at high throughput.
However, when the open segment is closed and its in-memory indexes are flushed to persistent storage, all its replicas are guaranteed to be consistent, with the .
How to reset a CONSUMING segment stuck on an offset which has expired from the stream?
This typically happens if:
The consumer is lagging a lot.
The consumer was down (server down, cluster down), and the stream moved on, resulting in offset not found when consumer comes back up.
In case of Kafka, to recover, set property "auto.offset.reset":"earliest" in the streamConfigs section and reset the CONSUMING segment. See for more details about the configuration.
You can also also use the "Resume Consumption" endpoint with "resumeFrom" parameter set to "smallest" (or "largest" if you want). See for more details.
Indexing
How to set inverted indexes?
Inverted indexes are set in the tableConfig's tableIndexConfig -> invertedIndexColumns list. For more info on table configuration, see . For an example showing how to configure an inverted index, see .
Applying inverted indexes to a table configuration will generate an inverted index for all new segments. To apply the inverted indexes to all existing segments, see
How to apply an inverted index to existing segments?
Add the columns you want to index to the tableIndexConfig-> invertedIndexColumns list. To update the table configuration use the Pinot Swagger API: .
Invoke the reload API: .
Once you've done that, you can check whether the index has been applied by querying the segment metadata API at . Don't forget to include the names of the column on which you have applied the index.
The output from this API should look something like the following:
Can I retrospectively add an index to any segment?
Not all indexes can be retrospectively applied to existing segments.
If you want to add or change the or adjust you will need to manually re-load any existing segments.
How to create star-tree indexes?
Star-tree indexes are configured in the table config under the tableIndexConfig -> starTreeIndexConfigs (list) and enableDefaultStarTree (boolean). See here for more about how to configure star-tree indexes:
The new segments will have star-tree indexes generated after applying the star-tree index configurations to the table configuration.
Handling time in Pinot
How does Pinot’s real-time ingestion handle out-of-order events?
Pinot does not require ordering of event time stamps. Out of order events are still consumed and indexed into the "currently consuming" segment. In a pathological case, if you have a 2 day old event come in "now", it will still be stored in the segment that is open for consumption "now". There is no strict time-based partitioning for segments, but star-indexes and hybrid tables will handle this as appropriate.
See the for more details about how hybrid tables handle this. Specifically, the time-boundary is computed as max(OfflineTIme) - 1 unit of granularity. Pinot does store the min-max time for each segment and uses it for pruning segments, so segments with multiple time intervals may not be perfectly pruned.
When generating star-indexes, the time column will be part of the star-tree so the tree can still be efficiently queried for segments with multiple time intervals.
What is the purpose of a hybrid table not using max(OfflineTime) to determine the time-boundary, and instead using an offset?
This lets you have an old event up come in without building complex offline pipelines that perfectly partition your events by event timestamps. With this offset, even if your offline data pipeline produces segments with a maximum timestamp, Pinot will not use the offline dataset for that last chunk of segments. The expectation is if you process offline the next time-range of data, your data pipeline will include any late events.
Why are segments not strictly time-partitioned?
It might seem odd that segments are not strictly time-partitioned, unlike similar systems such as Apache Druid. This allows real-time ingestion to consume out-of-order events. Even though segments are not strictly time-partitioned, Pinot will still index, prune, and query segments intelligently by time intervals for the performance of hybrid tables and time-filtered data.
When generating offline segments, the segments generated such that segments only contain one time interval and are well partitioned by the time column.
Batch import example
Step-by-step guide for pushing your own data into the Pinot cluster
This example assumes you have set up your cluster using .
Preparing your data
Let's gather our data files and put them in pinot-quick-start/rawdata.
Supported file formats are CSV, JSON, AVRO, PARQUET, THRIFT, ORC. If you don't have sample data, you can use this sample CSV.
Complex Type Examples
Additional examples that demonstrate handling of complex types.
Additional examples that demonstrate handling of complex types.
Unnest Root Level Collection
In this example, we would look at un-nesting json records that are batched together as part of a single key at the root level. We will make use of the configs to persist the individual student records as separate rows in Pinot.
0.6.0
This release introduced some excellent new features, including upsert, tiered storage, pinot-spark-connector, support of having clause, more validations on table config and schema, support of ordinals
Summary
This release introduced some excellent new features, including upsert, tiered storage, pinot-spark-connector, support of having clause, more validations on table config and schema, support of ordinals in GROUP BY and ORDER BY clause, array transform functions, adding push job type of segment metadata only mode, and some new APIs like updating instance tags, new health check endpoint. It also contains many key bug fixes. See details below.
The release was cut from the following commit:
and the following cherry-picks:
{
"fieldConfigList": [{
"name": "location_st_point",
"encodingType":"RAW", // this actually disables the dictionary
"indexTypes":["H3"],
"properties": {
"resolutions": "13, 5, 6" // Here resolutions must be a string with ints separated by commas
}
}],
...
}
SELECT address, ST_DISTANCE(location_st_point, ST_Point(-122, 37, 1))
FROM starbucksStores
WHERE ST_DISTANCE(location_st_point, ST_Point(-122, 37, 1)) < 5000
limit 1000
publish time as determined by the producer
Yes
brokerPublishTime: Optional
__metadata$brokerPublishTime : String
publish time as determined by the broker
Yes
eventTime : Long
__metadata$eventTime : String
Yes
messageId : MessageId -> String
__metadata$messageId : String
String representation of the MessagId field. The format is ledgerId:entryId:partitionIndex
messageId : MessageId -> bytes
__metadata$messageBytes : String
Base64 encoded version of the bytes returned from calling MessageId.toByteArray()
: Pinot assumes all stored data is immutable, which helps simplify the parts of the system that handle data storage and replication. However, Pinot still supports upserts on streaming entity data and background purges of data to comply with data privacy regulations.
Dynamic configuration changes: Operations like adding new tables, expanding a cluster, ingesting data, modifying an existing table, and adding indexes do not impact query availability or performance.
Controller: This node observes and manages the state of participant nodes. The controller is responsible for coordinating all state transitions in the cluster and ensures that state constraints are satisfied while maintaining cluster stability.
Receives the results from each server and merges them.
Sends the query result to the client.
The broker receives a complete result set from the final stage of the query, which is always a single server.
The broker sends the query result to the client.
The controller then assigns the segment to one or more "offline" servers (depending on replication factor) and notifies them that new segments are available.
The servers then download the newly created segments directly from the deep store.
The cluster's brokers, which watch for state changes as Helix spectators, detect the new segments and update their segment routing tables accordingly. The cluster is now able to query the new offline segments.
Through Helix functionality on the controller and all of the cluster's brokers, the brokers become aware of the consuming segments, and begin including them in query routing immediately.
The consuming servers simultaneously begin consuming messages from the streaming input source, storing them in the consuming segment.
When a server decides its consuming segment is complete, it commits the in-memory consuming segment to a conventional segment file, uploads it to the deep store, and notifies the controller.
The controller and the server create a new consuming segment to continue real-time ingestion.
The controller marks the newly committed segment as online. Brokers then discover the new segment through the Helix notification mechanism, allowing them to route queries to it in the usual fashion.
Segment
Helix Partition
Table
Helix Resource
Controller
Helix Controller or Helix agent that drives the overall state of the cluster
Server
Helix Participant
Broker
A Helix Spectator that observes the cluster for changes in the state of segments and servers. To support multi-tenancy, brokers are also modeled as Helix Participants.
Minion
Helix Participant that performs computation rather than storing data
Controller
Controller that is assigned as the current leader
Servers and Brokers
List of servers and brokers
Configuration of all current servers and brokers
Health status of all current servers and brokers
Tables
List of tables
Table configurations
Table schema
List of the table's segments
Segment
Exact server locations of a segment
State of each segment (online/offline/error/consuming)
Schema is used to define the columns and data types of the Pinot table. A detailed overview of the schema can be found in Schema.
Columns are categorized into 3 types:
Column Type
Description
Dimensions
Typically used in filters and group by, for slicing and dicing into data
Metrics
Typically used in aggregations, represents the quantitative data
Time
Optional column, represents the timestamp associated with each row
In our example transcript-schema, the studentID,firstName,lastName,gender,subject columns are the dimensions, the score column is the metric and timestampInEpoch is the time column.
Once you have identified the dimensions, metrics and time columns, create a schema for your data, using the following reference.
Creating a table configuration
A table configuration is used to define the configuration related to the Pinot table. A detailed overview of the table can be found in Table.
Here's the table configuration for the sample CSV file. You can use this as a reference to build your own table configuration. Edit the tableName and schemaName.
Uploading your table configuration and schema
Review the directory structure so far.
Upload the table configuration using the following command.
Use the Rest API that is running on your Pinot instance to review the table configuration and schema and make sure it was successfully uploaded. This link uses localhost as an example.
Creating a segment
Pinot table data is stored as Pinot segments. A detailed overview of segments can be found in Segment.
To generate a segment, first create a job specification (JobSpec) yaml file. A JobSpec yaml file contains all the information regarding data format, input data location, and pinot cluster coordinates. Copy the following job specification file (example from Pinot quickstart file). If you're using your own data, be sure to do the following:
Replace transcript with your table name
Set the correct recordReaderSpec
Depending if you're using Docker or a launcher script, choose one of the following commands to generate a segment to upload to Pinot:
Here is some sample output.
Querying your data
If everything worked, find your table in the Query Console to run queries against it.
Allow modifying/removing existing star-trees during segment reload ()
Implement off-heap bloom filter reader ()
Support for multi-threaded Group By reducer for SQL. ()
Add OnHeapGuavaBloomFilterReader ()
Support using ordinals in GROUP BY and ORDER BY clause ()
Merge common APIs for Dictionary ()
Add table level lock for segment upload ([#6165])
Added recursive functions validation check for group by ()
Add StrictReplicaGroupInstanceSelector ()
Add IN_SUBQUERY support ()
Add IN_PARTITIONED_SUBQUERY support ()
Some UI features (, , , )
Special notes
Brokers should be upgraded before servers in order to keep backward-compatible:
Change group key delimiter from '\t' to '\0' (#5858)
Support for exact distinct count for non int data types (#5872)
Pinot Components have to be deployed in the following order:
(PinotServiceManager -> Bootstrap services in role ServiceRole.CONTROLLER -> All remaining bootstrap services in parallel)
Starts Broker and Server in parallel when using ServiceManager ()
New settings introduced and old ones deprecated:
This aggregation function is still in beta version. This PR involves change on the format of data sent from server to broker, so it works only when both broker and server are upgraded to the new version:
$ ls /tmp/pinot-quick-start
rawdata transcript-schema.json transcript-table-offline.json
$ ls /tmp/pinot-quick-start/rawdata
transcript.csv
SegmentGenerationJobSpec:
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**\/*.csv
inputDirURI: /tmp/pinot-quick-start/rawdata/
jobType: SegmentCreationAndTarPush
outputDirURI: /tmp/pinot-quick-start/segments
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: 'http://localhost:9000'}
pinotFSSpecs:
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: null
recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader,
configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig,
configs: null, dataFormat: csv}
segmentNameGeneratorSpec: null
tableSpec: {schemaURI: 'http://localhost:9000/tables/transcript/schema', tableConfigURI: 'http://localhost:9000/tables/transcript',
tableName: transcript}
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Finished building StatsCollector!
Collected stats for 4 documents
Using fixed bytes value dictionary for column: studentID, size: 9
Created dictionary for STRING column: studentID with cardinality: 3, max length in bytes: 3, range: 200 to 202
Using fixed bytes value dictionary for column: firstName, size: 12
Created dictionary for STRING column: firstName with cardinality: 3, max length in bytes: 4, range: Bob to Nick
Using fixed bytes value dictionary for column: lastName, size: 15
Created dictionary for STRING column: lastName with cardinality: 3, max length in bytes: 5, range: King to Young
Created dictionary for FLOAT column: score with cardinality: 4, range: 3.2 to 3.8
Using fixed bytes value dictionary for column: gender, size: 12
Created dictionary for STRING column: gender with cardinality: 2, max length in bytes: 6, range: Female to Male
Using fixed bytes value dictionary for column: subject, size: 21
Created dictionary for STRING column: subject with cardinality: 3, max length in bytes: 7, range: English to Physics
Created dictionary for LONG column: timestampInEpoch with cardinality: 4, range: 1570863600000 to 1572418800000
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
Converting segment: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to v3 format
v3 segment location for segment: transcript_OFFLINE_1570863600000_1572418800000_0 is /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3
Deleting files in v1 segment directory: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0
Starting building 1 star-trees with configs: [StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[org.apache.pinot.core.startree.v2.AggregationFunctionColumnPair@3a48efdc],maxLeafRecords=1]] using OFF_HEAP builder
Starting building star-tree with config: StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[org.apache.pinot.core.startree.v2.AggregationFunctionColumnPair@3a48efdc],maxLeafRecords=1]
Generated 3 star-tree records from 4 segment records
Finished constructing star-tree, got 9 tree nodes and 4 records under star-node
Finished creating aggregated documents, got 6 aggregated records
Finished building star-tree in 10ms
Finished building 1 star-trees in 27ms
Computed crc = 3454627653, based on files [/var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/columns.psf, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/index_map, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/metadata.properties, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index_map]
Driver, record read time : 0
Driver, stats collector time : 0
Driver, indexing time : 0
Tarring segment from: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz
Size for segment: transcript_OFFLINE_1570863600000_1572418800000_0, uncompressed: 6.73KB, compressed: 1.89KB
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: [/tmp/pinot-quick-start/segments/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz]... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@243c4f91] for table transcript
Pushing segment: transcript_OFFLINE_1570863600000_1572418800000_0 to location: http://localhost:9000 for table transcript
Sending request: http://localhost:9000/v2/segments?tableName=transcript to controller: nehas-mbp.hsd1.ca.comcast.net, version: Unknown
Response for pushing table transcript segment transcript_OFFLINE_1570863600000_1572418800000_0 to location http://localhost:9000 - 200: {"status":"Successfully uploaded segment: transcript_OFFLINE_1570863600000_1572418800000_0 of table: transcript"}
How much heap should I allocate for my Pinot instances?
Typically, Apache Pinot components try to use as much off-heap (MMAP/DirectMemory) wherever possible. For example, Pinot servers load segments in memory-mapped files in MMAP mode (recommended), or direct memory in HEAP mode. Heap memory is used mostly for query execution and storing some metadata. We have seen production deployments with high throughput and low-latency work well with just 16 GB of heap for Pinot servers and brokers. The Pinot controller may also cache some metadata (table configurations etc) in heap, so if there are just a few tables in the Pinot cluster, a few GB of heap should suffice.
DR
Does Pinot provide any backup/restore mechanism?
Pinot relies on deep-storage for storing a backup copy of segments (offline as well as real-time). It relies on Zookeeper to store metadata (table configurations, schema, cluster state, and so on). It does not explicitly provide tools to take backups or restore these data, but relies on the deep-storage (ADLS/S3/GCP/etc), and ZK to persist these data/metadata.
Alter Table
Can I change a column name in my table, without losing data?
Changing a column name or data type is considered backward incompatible change. While Pinot does support schema evolution for backward compatible changes, it does not support backward incompatible changes like changing name/data-type of a column.
How to change number of replicas of a table?
You can change the number of replicas by updating the table configuration's segmentsConfig section. Make sure you have at least as many servers as the replication.
Note that if you are using replica groups, it's expected these configurations equal numReplicaGroups. If they do not match, Pinot will use numReplicaGroups.
How to set or change table retention?
By default there is no retention set for a table in Apache Pinot. You may however, set retention by setting the following properties in the segmentsConfig section inside table configs:
retentionTimeUnit
retentionTimeValue
Updating the retention value in the table config should be good enough, there is no need to rebalance the table or reload its segments.
Why does my real-time table not use the new nodes I added to the cluster?
Likely explanation: num partitions * num replicas < num servers.
In real-time tables, segments of the same partition always remain on the same node. This sticky assignment is needed for replica groups and is critical if using upserts. For instance, if you have 3 partitions, 1 replica, and 4 nodes, only ¾ nodes will be used, and all of p0 segments will be on 1 node, p1 on 1 node, and p2 on 1 node. One server will be unused, and will remain unused through rebalances.
There’s nothing we can do about CONSUMING segments, they will continue to use only 3 nodes if you have 3 partitions. But we can rebalance such that completed segments use all nodes. If you want to force the completed segments of the table to use the new server use this config:
Segments
How to control the number of segments generated?
The number of segments generated depends on the number of input files. If you provide only 1 input file, you will get 1 segment. If you break up the input file into multiple files, you will get as many segments as the input files.
What are the common reasons my segment is in a BAD state ?
This typically happens when the server is unable to load the segment. Possible causes: out-of-memory, no disk space, unable to download segment from deep-store, and similar other errors. Check server logs for more information.
How to reset a segment when it runs into a BAD state?
Use the segment reset controller REST API to reset the segment:
What's the difference between Reset, Refresh, and Reload?
Reset: Gets a segment in ERROR state back to ONLINE or CONSUMING state. Behind the scenes, the Pinot controller takes the segment to the OFFLINE state, waits for External View to stabilize, and then moves it back to ONLINE or CONSUMING state, thus effectively resetting segments or consumers in error states.
Refresh: Replaces the segment with a new one, with the same name but often different data. Under the hood, the Pinot controller sets new segment metadata in Zookeeper, and notifies brokers and servers to check their local states about this segment and update accordingly. Servers also download the new segment to replace the old one, when both have different checksums. There is no separate rest API for refreshing, and it is done as part of the SegmentUpload API.
Reload: Loads the segment again, often to generate a new index as updated in the table configuration. Underlying, the Pinot server gets the new table configuration from Zookeeper, and uses it to guide the segment reloading. In fact, the last step of REFRESH as explained above is to load the segment into memory to serve queries. There is a dedicated rest API for reloading. By default, it doesn't download segments, but the option is provided to force the server to download the segment to replace the local one cleanly.
In addition, RESET brings the segment OFFLINE temporarily; while REFRESH and RELOAD swap the segment on server atomically without bringing down the segment or affecting ongoing queries.
Tenants
How can I make brokers/servers join the cluster without the DefaultTenant tag?
Set this property in your controller.conf file:
Now your brokers and servers should join the cluster as broker_untagged and server_untagged. You can then directly use the POST /tenants API to create the tenants you want, as in the following:
Minion
How do I tune minion task timeout and parallelism on each worker?
There are two task configurations, but they are set as part of cluster configurations, like in the following example. One controls the task's overall timeout (1hr by default) and one sets how many tasks to run on a single minion worker (1 by default). The <taskType> is the task to tune, such as MergeRollupTask or RealtimeToOfflineSegmentsTask etc.
Yes, replica groups work for real-time. There's 2 parts to enabling replica groups:
Replica groups segment assignment.
Replica group query routing.
Replica group segment assignment
Replica group segment assignment is achieved in real-time, if number of servers is a multiple of number of replicas. The partitions get uniformly sprayed across the servers, creating replica groups.
For example, consider we have 6 partitions, 2 replicas, and 4 servers.
r1
r2
p1
S0
S1
p2
S2
S3
p3
S0
S1
p4
S2
S3
As you can see, the set (S0, S2) contains r1 of every partition, and (s1, S3) contains r2 of every partition. The query will only be routed to one of the sets, and not span every server.
If you are are adding/removing servers from an existing table setup, you have to run rebalance for segment assignment changes to take effect.
Replica group query routing
Once replica group segment assignment is in effect, the query routing can take advantage of it. For replica group based query routing, set the following in the table config's routing section, and then restart brokers
Overwrite index configs at tier level
When using tiered storage, user may want to have different encoding and indexing types for a column in different tiers to balance query latency and cost saving more flexibly. For example, segments in the hot tier can use dict-encoding, bloom filter and all kinds of relevant index types for very fast query execution. But for segments in the cold tier, where cost saving matters more than low query latency, one may want to use raw values and bloom filters only.
The following two examples show how to overwrite encoding type and index configs for tiers. Similar changes are also demonstrated in the MultiDirQuickStart example.
Overwriting single-column index configs using fieldConfigList. All top level fields in FieldConfig class can be overwritten, and fields not overwritten are kept intact.
Overwriting star-tree index configurations using tableIndexConfig. The StarTreeIndexConfigs is overwritten as a whole. In fact, all top level fields defined in IndexingConfig class can be overwritten, so single-column index configs defined in tableIndexConfig can also be overwritten but it's less clear than using fieldConfigList.
Credential
How do I update credentials for real-time upstream without downtime?
Explore the table component in Apache Pinot, a fundamental building block for organizing and managing data in Pinot clusters, enabling effective data processing and analysis.
Pinot stores data in tables. A Pinot table is conceptually identical to a relational database table with rows and columns. Columns have the same name and data type, known as the table's schema.
Pinot schemas are defined in a JSON file. Because that schema definition is in its own file, multiple tables can share a single schema. Each table can have a unique name, indexing strategy, partitioning, data sources, and other metadata.
Pinot table types include:
real-time: Ingests data from a streaming source like Apache Kafka®
offline: Loads data from a batch source
hybrid: Loads data from both a batch source and a streaming source
Pinot breaks a table into multiple and stores these segments in a deep-store such as Hadoop Distributed File System (HDFS) as well as Pinot servers.
In the Pinot cluster, a table is modeled as a and each segment of a table is modeled as a .
Table naming in Pinot follows typical naming conventions, such as starting names with a letter, not ending with an underscore, and using only alphanumeric characters.
Pinot supports the following types of tables:
Type
Description
The user querying the database does not need to know the type of the table. They only need to specify the table name in the query.
e.g. regardless of whether we have an offline table myTable_OFFLINE, a real-time table myTable_REALTIME, or a hybrid table containing both of these, the query will be:
is used to define the table properties, such as name, type, indexing, routing, and retention. It is written in JSON format and is stored in Zookeeper, along with the table schema.
Use the following properties to make your tables faster or leaner:
Segment
Indexing
Tenants
Segments
A table is comprised of small chunks of data known as segments. Learn more about how Pinot creates and manages segments .
For offline tables, segments are built outside of Pinot and uploaded using a distributed executor such as Spark or Hadoop. For details, see .
For real-time tables, segments are built in a specific interval inside Pinot. You can tune the following for the real-time segments.
Flush
The Pinot real-time consumer ingests the data, creates the segment, and then flushes the in-memory segment to disk. Pinot allows you to configure when to flush the segment in the following ways:
Number of consumed rows: After consuming the specified number of rows from the stream, Pinot will persist the segment to disk.
Number of rows per segment: Pinot learns and then estimates the number of rows that need to be consumed. The learning phase starts by setting the number of rows to 100,000 (this value can be changed) and adjusts it to reach the appropriate segment size. Because Pinot corrects the estimate as it goes along, the segment size might go significantly over the correct size during the learning phase. You should set this value to optimize the performance of queries.
Replicas
A segment can have multiple replicas to provide higher availability. You can configure the number of replicas for a table segment .
Completion Mode
By default, if the in-memory segment in the is equivalent to the committed segment, then the non-winner server builds and replaces the segment. If the available segment is not equivalent to the committed segment, the server just downloads the committed segment from the controller.
However, in certain scenarios, the segment build can get very memory-intensive. In these cases, you might want to enforce the non-committer servers to just download the segment from the controller instead of building it again. You can do this by setting completionMode: "DOWNLOAD" in the table configuration.
For details, see .
Download Scheme
A Pinot server might fail to download segments from the deep store, such as HDFS, after its completion. However, you can configure servers to download these segments from peer servers instead of the deep store. Currently, only HTTP and HTTPS download schemes are supported. More methods, such as gRPC/Thrift, are planned be added in the future.
For more details about peer segment download during real-time ingestion, refer to this design doc on
Indexing
You can create multiple indices on a table to increase the performance of the queries. The following types of indices are supported:
Dictionary-encoded forward index with bit compression
Raw value forward index
For more details on each indexing mechanism and corresponding configurations, see .
Set up on columns to make queries faster. You can also keep segments in off-heap instead of on-heap memory for faster queries.
Pre-aggregation
Aggregate the real-time stream data as it is consumed to reduce segment sizes. We add the metric column values of all rows that have the same values for all dimension and time columns and create a single row in the segment. This feature is only available on REALTIME tables.
The only supported aggregation is SUM. The columns to pre-aggregate need to satisfy the following requirements:
All metrics should be listed in noDictionaryColumns.
No multi-value dimensions
All dimension columns are treated to have a dictionary, even if they appear as noDictionaryColumns in the config.
The following table config snippet shows an example of enabling pre-aggregation during real-time ingestion:
Tenants
Each table is associated with a tenant. A segment resides on the server, which has the same tenant as itself. For details, see .
Optionally, override if a table should move to a server with different tenant based on segment status. The example below adds a tagOverrideConfig under the tenants section for real-time tables to override tags for consuming and completed segments.
In the above example, the consuming segments will still be assigned to serverTenantName_REALTIME hosts, but once they are completed, the segments will be moved to serverTeantnName_OFFLINE.
You can specify the full name of any tag in this section. For example, you could decide that completed segments for this table should be in Pinot servers tagged as allTables_COMPLETED). To learn more about, see the section.
Hybrid table
A hybrid table is a table composed of two tables, one offline and one real-time, that share the same name. In a hybrid table, offline segments can be pushed periodically. The retention on the offline table can be set to a high value because segments are coming in on a periodic basis, whereas the retention on the real-time part can be small.
Once an offline segment is pushed to cover a recent time period, the brokers automatically switch to using the offline table for segments for that time period and use the real-time table only for data not available in the offline table.
To learn how time boundaries work for hybrid tables, see .
A typical use case for hybrid tables is pushing deduplicated, cleaned-up data into an offline table every day while consuming real-time data as it arrives. Data can remain in offline tables for as long as a few years, while the real-time data would be cleaned every few days.
Examples
Create a table config for your data, or see for all possible batch/streaming tables.
Prerequisites
Offline table creation
Sample console output
Check out the table config in the to make sure it was successfully uploaded.
Streaming table creation
Start Kafka
Create a Kafka topic
Create a streaming table
Sample output
Start Kafka-Zookeeper
Start Kafka
Check out the table config in the to make sure it was successfully uploaded.
Hybrid table creation
To create a hybrid table, you have to create the offline and real-time tables individually. You don't need to create a separate hybrid table.
0.7.1
This release introduced several awesome new features, including JSON index, lookup-based join support, geospatial support, TLS support for pinot connections, and various performance optimizations.
Summary
This release introduced several awesome new features, including JSON index, lookup-based join support, geospatial support, TLS support for pinot connections, and various performance optimizations and improvements.
It also adds several new APIs to better manage the segments and upload data to the offline table. It also contains many key bug fixes. See details below.
The release was cut from the following commit:
and the following cherry-picks:
Notable New Features
Add a server metric: queriesDisabled to check if queries disabled or not. ()
Optimization on GroupKey to save the overhead of ser/de the group keys () ()
Support validation for jsonExtractKey
Special notes
Pinot controller metrics prefix is fixed to add a missing dot (). This is a backward-incompatible change that JMX query on controller metrics must be updated
Legacy group key delimiter (\t) was removed to be backward-compatible with release 0.5.0 ()
Upgrade zookeeper version to 3.5.8 to fix ZOOKEEPER-2184: Zookeeper Client should re-resolve hosts when connection attempts fail. ()
Major Bug fixes
Fix the SIGSEGV for large index ()
Handle creation of segments with 0 rows so segment creation does not fail if data source has 0 rows. ()
Fix QueryRunner tool for multiple runs ()
Running in Kubernetes
Pinot quick start in Kubernetes
Get started running Pinot in Kubernetes.
Note: The examples in this guide are sample configurations to be used as reference. For production setup, you may want to customize it to your needs.
Ingest streaming data from Apache Kafka
This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.
Learn how to ingest data from Kafka, a stream processing platform. You should have a local cluster up and running, following the instructions in .
Install and Launch Kafka
Let's start by downloading Kafka to our local machine.
To pull down the latest Docker image, run the following command:
Indexing
This page describes the indexing techniques available in Apache Pinot
Apache Pinot™ supports the following indexing techniques:
Input formats
This section contains a collection of guides that will show you how to import data from a Pinot-supported input format.
Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time spent doing serialization-deserialization and speed up the ingestion.
Configuring input formats
To change the input format, adjust the recordReaderSpec config in the ingestion job specification.
Using "POST /cluster/configs API" on CLUSTER tab in Swagger, with this payload:
{
"<taskType>.timeoutMs": "600000",
"<taskType>.numConcurrentTasksPerInstance": "4"
}
Real Time Provisioning Helper tool improvement to take data characteristics as input instead of an actual segment (#6546)
Add the isolation level config isolation.level to Kafka consumer (2.0) to ingest transactionally committed messages only (#6580)
Enhance StarTreeIndexViewer to support multiple trees (#6569)
Improves ADLSGen2PinotFS with service principal based auth, auto create container on initial run. It's backwards compatible with key based auth. (#6531)
Add api for cluster manager to get table state (#6211)
Perf optimization for SQL GROUP BY ORDER BY (#6225)
Add support using environment variables in the format of ${VAR_NAME:DEFAULT_VALUE} in Pinot table configs. (#6271)
Add TLS-support for client-pinot and pinot-internode connections (#6418) Upgrades to a TLS-enabled cluster can be performed safely and without downtime. To achieve a live-upgrade, go through the following steps:
First, configure alternate ingress ports for https/netty-tls on brokers, controllers, and servers. Restart the components with a rolling strategy to avoid cluster downtime.
Second, verify manually that https access to controllers and brokers is live. Then, configure all components to prefer TLS-enabled connections (while still allowing unsecured access). Restart the individual components.
Third, disable insecure connections via configuration. You may also have to set controller.vip.protocol and controller.vip.port and update the configuration files of any ingestion jobs. Restart components a final time and verify that insecure ingress via http is not available anymore.
Apache Pinot has adopted SQL syntax and semantics. Legacy PQL (Pinot Query Language) is deprecated and no longer supported. Use SQL syntax to query Pinot on broker endpoint /query/sql and controller endpoint /sql
Use URL encoding for the generated segment tar name to handle characters that cannot be parsed to URI. (#6571)
Fix a bug of miscounting the top nodes in StarTreeIndexViewer (#6569)
Fix the raw bytes column in real-time segment (#6574)
Fixes a bug to allow using JSON_MATCH predicate in SQL queries (#6535)
Fix the overflow issue when loading the large dictionary into the buffer (#6476)
The Pinot repository has pre-packaged Helm charts for Pinot and Presto. The Helm repository index file is here.
Note: Specify StorageClass based on your cloud vendor. Don't mount a blob store (such as AzureFile, GoogleCloudStorage, or S3) as the data serving file system. Use only Amazon EBS/GCP Persistent Disk/Azure Disk-style disks.
For AWS: "gp2"
For GCP: "pd-ssd" or "standard"
For Azure: "AzureDisk"
For Docker-Desktop: "hostpath"
1.1.1 Update Helm dependency
1.1.2 Start Pinot with Helm
Check Pinot deployment status
Load data into Pinot using Kafka
Bring up a Kafka cluster for real-time data ingestion
Check Kafka deployment status
Ensure the Kafka deployment is ready before executing the scripts in the following steps. Run the following command:
Below is an example output showing the deployment is ready:
Create Kafka topics
Run the scripts below to create two Kafka topics for data ingestion:
Load data into Kafka and create Pinot schema/tables
The script below does the following:
Ingests 19492 JSON messages to Kafka topic flights-realtime at a speed of 1 msg/sec
Ingests 19492 Avro messages to Kafka topic flights-realtime-avro at a speed of 1 msg/sec
Uploads Pinot schema airlineStats
Creates Pinot table airlineStats to ingest data from JSON encoded Kafka topic flights-realtime
Creates Pinot table airlineStatsAvro to ingest data from Avro encoded Kafka topic flights-realtime-avro
Query with the Pinot Data Explorer
Pinot Data Explorer
The following script (located at ./pinot/helm/pinot) performs local port forwarding, and opens the Pinot query console in your default web browser.
Query Pinot with Superset
Bring up Superset using Helm
Install the SuperSet Helm repository:
Get the Helm values configuration file:
For Superset to install Pinot dependencies, edit /tmp/superset-values.yaml file to add apinotdb pip dependency into bootstrapScript field.
You can also build your own image with this dependency or use the image apachepinot/pinot-superset:latest instead.
Replace the default admin credentials inside the init section with a meaningful user profile and stronger password.
Install Superset using Helm:
Ensure your cluster is up by running:
Access the Superset UI
Run the below command to port forward Superset to your localhost:18088.
Navigate to Superset in your browser with the admin credentials you set in the previous section.
Create a new database connection with the following URI: pinot+http://pinot-broker.pinot-quickstart:8099/query?controller=http://pinot-controller.pinot-quickstart:9000/
Once the database is added, you can add more data sets and explore the dashboard options.
Access Pinot with Trino
Deploy Trino
Deploy Trino with the Pinot plugin installed:
See the charts in the Trino Helm chart repository:
In order to connect Trino to Pinot, you'll need to add the Pinot catalog, which requires extra configurations. Run the below command to get all the configurable values.
To add the Pinot catalog, edit the additionalCatalogs section by adding:
Pinot is deployed at namespace pinot-quickstart, so the controller serviceURL is pinot-controller.pinot-quickstart:9000
After modifying the /tmp/trino-values.yaml file, deploy Trino with:
Once you've deployed Trino, check the deployment status:
Query Pinot with the Trino CLI
Once Trino is deployed, run the below command to get a runnable Trino CLI.
Download the Trino CLI:
Port forward Trino service to your local if it's not already exposed:
Use the Trino console client to connect to the Trino service:
Query Pinot data using the Trino CLI, like in the sample queries below.
Sample queries to execute
List all catalogs
List all tables
Show schema
Count total documents
Access Pinot with Presto
Deploy Presto with the Pinot plugin
First, deploy Presto with default configurations:
To customize your deployment, run the below command to get all the configurable values.
After modifying the /tmp/presto-values.yaml file, deploy Presto:
Once you've deployed the Presto instance, check the deployment status:
Sample Output of K8s Deployment Status
Query Presto using the Presto CLI
Once Presto is deployed, you can run the below command from here, or follow the steps below.
Download the Presto CLI:
Port forward presto-coordinator port 8080 to localhost port 18080:
Start the Presto CLI with the Pinot catalog:
Query Pinot data with the Presto CLI, like in the sample queries below.
Sample queries to execute
List all catalogs
List all tables
Show schema
Count total documents
Delete a Pinot cluster in Kubernetes
To delete your Pinot cluster in Kubernetes, run the following command:
Note: The --network pinot-demo flag is optional and assumes that you have a Docker network named pinot-demo that you want to connect the Kafka container to.
We're going to generate some JSON messages from the terminal using the following script:
datagen.py
If you run this script (python datagen.py), you'll see the following output:
Ingesting Data into Kafka
Let's now pipe that stream of messages into Kafka, by running the following command:
We can check how many messages have been ingested by running the following command:
Output
And we can print out the messages themselves by running the following command
Output
Schema
A schema defines what fields are present in the table along with their data types in JSON format.
Create a file called /tmp/pinot/schema-stream.json and add the following content to it.
Table Config
A table is a logical abstraction that represents a collection of related data. It is composed of columns and rows (known as documents in Pinot). The table config defines the table's properties in JSON format.
Create a file called /tmp/pinot/table-config-stream.json and add the following content to it.
Create schema and table
Create the table and schema by running the appropriate command below:
Querying
Navigate to localhost:9000/#/query and click on the events table to run a query that shows the first 10 rows in this table.
Querying the events table
Kafka ingestion guidelines
Kafka versions in Pinot
Pinot supports two versions of the Kafka library: kafka-0.9 and kafka-2.x for low level consumers.
Post release 0.10.0, we have started shading kafka packages inside Pinot. If you are using our latest tagged docker images or master build, you should replace org.apache.kafka with shaded.org.apache.kafka in your table config.
Upgrade from Kafka 0.9 connector to Kafka 2.x connector
Update table config for low level consumer: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
Pinot does not support using high-level Kafka consumers (HLC). Pinot uses low-level consumers to ensure accurate results, supports operational complexity and scalability, and minimizes storage overhead.
How to consume from a Kafka version > 2.0.0
This connector is also suitable for Kafka lib version higher than 2.0.0. In Kafka 2.0 connector pom.xml, change the kafka.lib.version from 2.0.0 to 2.1.1 will make this Connector working with Kafka 2.1.1.
Kafka configurations in Pinot
Use Kafka partition (low) level consumer with SSL
Here is an example config which uses SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, ones starting with ssl. are for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.
Consume transactionally-committed messages
The connector with Kafka library 2.0+ supports Kafka transactions. The transaction support is controlled by config kafka.isolation.level in Kafka stream config, which can be read_committed or read_uncommitted (default). Setting it to read_committed will ingest transactionally committed messages in Kafka stream only.
For example,
Note that the default value of this config read_uncommitted to read all messages. Also, this config supports low-level consumer only.
Use Kafka partition (low) level consumer with SASL_SSL
Here is an example config which uses SASL_SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, some for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.
Extract record headers as Pinot table columns
Pinot's Kafka connector supports automatically extracting record headers and metadata into the Pinot table columns. The following table shows the mapping for record header/metadata to Pinot table column names:
Kafka Record
Pinot Table Column
Description
Record key: any type <K>
__key : String
For simplicity of design, we assume that the record key is always a UTF-8 encoded String
Record Headers: Map<String, String>
Each header key is listed as a separate column:
__header$HeaderKeyName : String
For simplicity of design, we directly map the string headers from kafka record to pinot table column
Record metadata - offset : long
__metadata$offset : String
Record metadata - partition : int
__metadata$partition : String
In order to enable the metadata extraction in a Kafka table, you can set the stream config metadata.populate to true.
In addition to this, if you want to use any of these columns in your table, you have to list them explicitly in your table's schema.
For example, if you want to add only the offset and key as dimension columns in your Pinot table, it can listed in the schema as follows:
Once the schema is updated, these columns are similar to any other pinot column. You can apply ingestion transforms and / or define indexes on them.
To avoid errors like The Avro schema must be provided, designate the location of the schema in your streamConfigs section. For example, if your current section contains the following:
Then add this key: "stream.kafka.decoder.prop.schema"followed by a value that denotes the location of your schema.
By default, Pinot creates a dictionary-encoded forward index for each column.
Enabling indexes
There are two ways to enable indexes for a Pinot table.
As part of ingestion, during Pinot segment generation
Indexing is enabled by specifying the column names in the table configuration. More details about how to configure each type of index can be found in the respective index's section linked above or in the table configuration reference.
Dynamically added or removed
Indexes can also be dynamically added to or removed from segments at any point. Update your table configuration with the latest set of indexes you want to have.
For example, if you have an inverted index on the foo field and now want to also include the bar field, you would update your table configuration from this:
To this:
The updated index configuration won't be picked up unless you invoke the reload API. This API sends reload messages via Helix to all servers, as part of which indexes are added or removed from the local segments. This happens without any downtime and is completely transparent to the queries.
When adding an index, only the new index is created and appended to the existing segment. When removing an index, its related states are cleaned up from Pinot servers. You can find this API under the Segments tab on Swagger:
className: Name of the class that implements the RecordReader interface. This class is used for parsing the data.
configClassName: Name of the class that implements the RecordReaderConfig interface. This class is used the parse the values mentioned in configs
configs: Key-value pair for format-specific configurations. This field is optional.
Supported input formats
Pinot supports multiple input formats out of the box. Specify the corresponding readers and the associated custom configurations to switch between formats.
CSV
CSV Record Reader supports the following configs:
fileFormat: default, rfc4180, excel, tdf, mysql
header: Header of the file. The columnNames should be separated by the delimiter mentioned in the configuration.
delimiter: The character seperating the columns.
multiValueDelimiter: The character separating multiple values in a single column. This can be used to split a column into a list.
skipHeader: Skip header record in the file. Boolean.
ignoreEmptyLines: Ignore empty lines (instead of filling them with default values). Boolean.
ignoreSurroundingSpaces: ignore spaces around column names and values. Boolean
quoteCharacter: Single character used for quotes in CSV files.
recordSeparator: Character used to separate records in the input file. Default is or \r depending on the platform.
nullStringValue: String value that represents null in CSV files. Default is empty string.
skipUnParseableLines : Skip lines that cannot be parsed. Note that this would result in data loss. Boolean.
Your CSV file may have raw text fields that cannot be reliably delimited using any character. In this case, explicitly set the multiValueDelimeter field to empty in the ingestion config.
multiValueDelimiter: ''
Avro
The Avro record reader converts the data in file to a GenericRecord. A Java class or .avro file is not required. By default, the Avro record reader only supports primitive types. To enable support for rest of the Avro data types, set enableLogicalTypes to true .
We use the following conversion table to translate between Avro and Pinot data types. The conversions are done using the offical Avro methods present in org.apache.avro.Conversions.
Avro Data Type
Pinot Data Type
Comment
INT
INT
LONG
LONG
FLOAT
FLOAT
JSON
Thrift
Thrift requires the generated class using .thrift file to parse the data. The .class file should be available in the Pinot's classpath. You can put the files in the lib/ folder of Pinot distribution directory.
Parquet
Since 0.11.0 release, the Parquet record reader determines whether to use ParquetAvroRecordReader or ParquetNativeRecordReader to read records. The reader looks for the parquet.avro.schema or avro.schema key in the parquet file footer, and if present, uses the Avro reader.
You can change the record reader manually in case of a misconfiguration.
For the support of DECIMAL and other parquet native data types, always use ParquetNativeRecordReader.
INT96
LONG
ParquetINT96 type converts nanoseconds
to Pinot INT64 type of milliseconds
INT64
LONG
INT32
INT
FLOAT
FLOAT
DOUBLE
For ParquetAvroRecordReader , you can refer to the Avro section above for the type conversions.
ORC
ORC record reader supports the following data types -
ORC Data Type
Java Data Type
BOOLEAN
String
SHORT
Integer
INT
Integer
LONG
Integer
FLOAT
Float
In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.
Protocol Buffers
The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc) from the .proto file using the command -
Batch Ingestion
Batch ingestion of data into Apache Pinot.
With batch ingestion you create a table using data already present in a file system such as S3. This is particularly useful when you want to use Pinot to query across large data with minimal latency or to test out new features using a simple data file.
To ingest data from a filesystem, perform the following steps, which are described in more detail in this page:
Create schema configuration
Create table configuration
Upload schema and table configs
Upload data
Batch ingestion currently supports the following mechanisms to upload the data:
Standalone
Here's an example using standalone local processing.
First, create a table using the following CSV data.
Create schema configuration
In our data, the only column on which aggregations can be performed is score. Secondly, timestampInEpoch is the only timestamp column. So, on our schema, we keep score as metric and timestampInEpoch as timestamp column.
Here, we have also defined two extra fields: format and granularity. The format specifies the formatting of our timestamp column in the data source. Currently, it's in milliseconds, so we've specified 1:MILLISECONDS:EPOCH.
Create table configuration
We define a table transcript and map the schema created in the previous step to the table. For batch data, we keep the tableType as OFFLINE.
Upload schema and table configs
Now that we have both the configs, upload them and create a table by running the following command:
Check out the table config and schema in the \[Rest API] to make sure it was successfully uploaded.
Upload data
We now have an empty table in Pinot. Next, upload the CSV file to this empty table.
A table is composed of multiple segments. The segments can be created in the following three ways:
Minion based ingestion\
Upload API\
Ingestion jobs
Minion-based ingestion
Refer to
Upload API
There are 2 controller APIs that can be used for a quick ingestion test using a small file.
When these APIs are invoked, the controller has to download the file and build the segment locally.
Hence, these APIs are NOT meant for production environments and for large input files.
/ingestFromFile
This API creates a segment using the given file and pushes it to Pinot. All steps happen on the controller.
Example usage:
To upload a JSON file data.json to a table called foo_OFFLINE, use below command
Note that query params need to be URLEncoded. For example, {"inputFormat":"json"} in the command below needs to be converted to %7B%22inputFormat%22%3A%22json%22%7D.
The batchConfigMapStr can be used to pass in additional properties needed for decoding the file. For example, in case of csv, you may need to provide the delimiter
/ingestFromURI
This API creates a segment using file at the given URI and pushes it to Pinot. Properties to access the FS need to be provided in the batchConfigMap. All steps happen on the controller.
Example usage:
Ingestion jobs
Segments can be created and uploaded using tasks known as DataIngestionJobs. A job also needs a config of its own. We call this config the JobSpec.
For our CSV file and table, the JobSpec should look like this:
For more detail, refer to .
Now that we have the job spec for our table transcript, we can trigger the job using the following command:
Once the job successfully finishes, head over to the \[query console] and start playing with the data.
Segment push job type
There are 3 ways to upload a Pinot segment:
Segment tar push
Segment URI push
Segment metadata push
Segment tar push
This is the original and default push mechanism.
Tar push requires the segment to be stored locally or can be opened as an InputStream on PinotFS. So we can stream the entire segment tar file to the controller.
The push job will:
Upload the entire segment tar file to the Pinot controller.
Pinot controller will:
Save the segment into the controller segment directory(Local or any PinotFS).
Extract segment metadata.
Add the segment to the table.
Segment URI push
This push mechanism requires the segment tar file stored on a deep store with a globally accessible segment tar URI.
URI push is light-weight on the client-side, and the controller side requires equivalent work as the tar push.
The push job will:
POST this segment tar URI to the Pinot controller.
Pinot controller will:
Download segment from the URI and save it to controller segment directory (local or any PinotFS).
Extract segment metadata.
Add the segment to the table.
Segment metadata push
This push mechanism also requires the segment tar file stored on a deep store with a globally accessible segment tar URI.
Metadata push is light-weight on the controller side, there is no deep store download involves from the controller side.
The push job will:
Download the segment based on URI.
Extract metadata.
Upload metadata to the Pinot Controller.
Pinot Controller will:
Add the segment to the table based on the metadata.
4. Segment Metadata Push with copyToDeepStore
This extends the original Segment Metadata Push for cases, where the segments are pushed to a location not used as deep store. The ingestion job can still do metadata push but ask Pinot Controller to copy the segments into deep store. Those use cases usually happen when the ingestion jobs don't have direct access to deep store but still want to use metadata push for its efficiency, thus using a staging location to keep the segments temporarily.
NOTE: the staging location and deep store have to use same storage scheme, like both on s3. This is because the copy is done via PinotFS.copyDir interface that assumes so; but also because this does copy at storage system side, so segments don't need to go through Pinot Controller at all.
To make this work, grant Pinot controllers access to the staging location. For example on AWS, this may require adding an access policy like this example for the controller EC2 instances:
Then use metadata push to add one extra config like this one:
Consistent data push and rollback
Pinot supports atomic update on segment level, which means that when data consisting of multiple segments are pushed to a table, as segments are replaced one at a time, queries to the broker during this upload phase may produce inconsistent results due to interleaving of old and new data.
See for how to enable this feature.
Segment fetchers
When Pinot segment files are created in external systems (Hadoop/spark/etc), there are several ways to push those data to the Pinot controller and server:
Push segment to shared NFS and let pinot pull segment files from the location of that NFS. See .
Push segment to a Web server and let pinot pull segment files from the Web server with HTTP/HTTPS link. See .
Push segment to PinotFS(HDFS/S3/GCS/ADLS) and let pinot pull segment files from PinotFS URI. See and .
The first three options are supported out of the box within the Pinot package. As long your remote jobs send Pinot controller with the corresponding URI to the files, it will pick up the file and allocate it to proper Pinot servers and brokers. To enable Pinot support for PinotFS, you'll need to provide configuration and proper Hadoop dependencies.
Persistence
By default, Pinot does not come with a storage layer, so all the data sent, won't be stored in case of a system crash. In order to persistently store the generated segments, you will need to change controller and server configs to add deep storage. Checkout for all the info and related configs.
Tuning
Standalone
Since pinot is written in Java, you can set the following basic Java configurations to tune the segment runner job -
Log4j2 file location with -Dlog4j2.configurationFile
Plugin directory location with -Dplugins.dir=/opt/pinot/plugins
JVM props, like -Xmx8g -Xms4G
If you are using the docker, you can set the following under JAVA_OPTS variable.
Hadoop
You can set -D mapreduce.map.memory.mb=8192 to set the mapper memory size when submitting the Hadoop job.
Spark
You can add config spark.executor.memory to tune the memory usage for segment creation when submitting the Spark job.
0.8.0
This release introduced several new features, including compatibility tests, enhanced complex type and Json support, partial upsert support, and new stream ingestion plugins.
Summary
This release introduced several awesome new features, including compatibility tests, enhanced complex type and Json support, partial upsert support, and new stream ingestion plugins (AWS Kinesis, Apache Pulsar). It contains a lot of query enhancements such as new timestamp and boolean type support and flexible numerical column comparison. It also includes many key bug fixes. See details below.
The release was cut from the following commit: fe83e95aa9124ee59787c580846793ff7456eaa5
and the following cherry-picks:
Notable New Features
Extract time handling for SegmentProcessorFramework ()
Add Apache Pulsar low level and high level connector ()
Enable parallel builds for compat checker ()
Special notes
After the 0.8.0 release, we will officially support jdk 11, and can now safely start to use jdk 11 features. Code is still compilable with jdk 8 ()
RealtimeToOfflineSegmentsTask config has some backward incompatible changes ()
— timeColumnTransformFunction is removed (backward-incompatible, but rollup is not supported anyway)
— Deprecate
Major Bug fixes
Fix race condition in MinionInstancesCleanupTask ()
Fix custom instance id for controller/broker/minion ()
Fix UpsertConfig JSON deserialization. ()
Forward index
The forward index is the mechanism Pinot employs to store the values of each column. At a conceptual level, the forward index can be thought of as a mapping from document IDs (also known as row indices) to the actual column values of each row.
Forward indexes are enabled by default, meaning that columns will have a forward index unless explicitly disabled. Disabling the forward index can save storage space when other indexes sufficiently cover the required data patterns. For information on how to disable the forward index and its implications, refer to .
Pinot Minion SegmentGenerationAndPush task: PinotFS configs inside taskSpec is always temporary and has higher priority than default PinotFS created by the minion server configs (#6744)
DataTable V3 implementation and measure data table serialization cost on server (#6710)
add uploadLLCSegment endpoint in TableResource (#6653)
Recover the segment from controller when LLC table cannot load it (#6647)
Adding a new API for validating specified TableConfig and Schema (#6620)
Introduce a metric for query/response size on broker. (#6590)
Adding a controller periodic task to clean up dead minion instances (#6543)
Adding new validation for Json, TEXT indexing (#6541)
Always return a response from query execution. (#6596)
collectorType
and replace it with
mergeType
— Add roundBucketTimePeriod and partitionBucketTimePeriod to config the time bucket for round and partition
Regex path for pluggable MinionEventObserverFactory is changed from org.apache.pinot.*.event.* to org.apache.pinot.*.plugin.minion.tasks.* (#6980)
Moved all pinot built-in minion tasks to the pinot-minion-builtin-tasks module and package them into a shaded jar (#6618)
Reloading consuming segment flag pinot.server.instance.reload.consumingSegment will be true by default (#7078)
Move JSON decoder from pinot-kafka to pinot-json package. (#7021)
Backward incompatible schema change through controller rest API PUT /schemas/{schemaName} will be blocked. (#6737)
Deprecated /tables/validateTableAndSchema in favor of the new configs/validate API and introduced new APIs for /tableConfigs to operate on the real-time table config, offline table config and schema in one shot. (#6840)
Fix the memory issue for selection query with large limit (#7112)
Fix the deleted segments directory not exist warning (#7097)
Fixing docker build scripts by providing JDK_VERSION as parameter (#7095)
How forward indexes are implemented depends on the index encoding and whether the column is sorted.
When the encoding is set to RAW, the forward index is implemented as an array, where the indices correspond to document IDs and the values represent the actual row values. For more details, refer to the raw value forward index section.
In the case of DICTIONARY encoding, the forward index doesn't store the actual row values but instead stores dictionary IDs. This introduces an additional level of indirection when reading values, but it allows for more efficient physical layouts when unique number of values in the column is significantly smaller than the number of rows.
The DICTIONARY encoding can be even more efficient if the segment is sorted by the indexed column. You can learn more about the dictionary encoded forward index and the sorted forward index in their respective sections.
When working out whether a column should use dictionary encoded or raw value encoding, the following comparison table may help:
Dictionary
Raw Value
Provides compression when low to medium cardinality.
Eliminates padding overhead
Allows for indexing (esp inv index).
No inv index (only JSON/Text/FST index)
Adds one level of dereferencing, so can increase disk seeks
Eliminates additional dereferencing, so good when all docs of interest are contiguous
For Strings, adds padding to make all values equal length in the dictionary
Chunk de-compression overhead with docs selected don't have spatial locality
Dictionary-encoded forward index with bit compression (default)
In this approach, each unique value in a column is assigned an ID, and a dictionary is constructed to map these IDs back to their corresponding values. Instead of storing the actual values, the default forward index stores these bit-compressed IDs. This method is particularly effective when dealing with columns containing few unique values, as it significantly improves space efficiency.
The below diagram shows the dictionary encoding for two columns with integer and string types. ForcolA, dictionary encoding saved a significant amount of space for duplicated values.
The diagram below illustrates dictionary encoding for two columns with different data types (integer and string). For colA, dictionary encoding leads to significant space savings due to duplicated values. However, for colB, which contains mostly unique values, the compression effect is limited, and padding overhead may be high.
When using the dictionary-encoded forward index for multi-value column, to further compress the forward index for repeated multi-value entires, enable the MV_ENTRY_DICT compression type which adds another level of dictionary encoding on the multi-value entries. This may be useful, for example, in cases where you pre-join a fact table with dimension table, where the multi-value entries in the dimension table are repeated after joining with the fact table.
It can be enabled with parameter:
Parameter
Default
Description
dictIdCompressionType
null
The compression that will be used for dictionary-encoded forward index
Sorted forward index with run-length encoding
When a column is physically sorted, Pinot employs a sorted forward index with run-length encoding, which builds upon dictionary encoding. Instead of storing dictionary IDs for each document ID, this approach stores pairs of start and end document IDs for each unique value.
Sorted forward index
(For simplicity, this diagram does not include the dictionary encoding layer.)
Sorted forward indexes offer the benefits of efficient compression and data locality and can also serve as an inverted index. They are active when two conditions are met: the segment is sorted by the column, and the dictionary is enabled for that column. Refer to the dictionary documentation for details on enabling the dictionary.
When dealing with multiple segments, it's crucial to ensure that data is sorted within each segment. Sorting across segments is not necessary.
To guarantee that a segment is sorted by a particular column, follow these steps:
For real-time tables, use the tableIndexConfig.sortedColumn property. If there is exactly one column specified in that array, Pinot will sort the segment by that column upon committing.
For offline tables, you must pre-sort the data by the specified column before ingesting it into Pinot.
It's crucial to note that for offline tables, the tableIndexConfig.sortedColumn property is indeed ignored.
Additionally, for online tables, even though this property is specified as a JSON array, at most one column should be included. Using an array with more than one column is incorrect and will not result in segments being sorted by all the columns listed in the array.
When a real-time segment is committed, rows will be sorted by the sorting column and it will be transformed into an offline segment.
During the creation of an offline segment, which also applies when a real-time segment is committed, Pinot scans the data in each column. If it detects that all values within a column are sorted in ascending order, Pinot concludes that the segment is sorted based on that particular column. In case this happens on more than one column, all of them are considered as sorting columns. Consequently, whether a segment is sorted by a column or not solely depends on the actual data distribution within the segment and entirely disregards the value of the sortedColumn property. This approach also implies that two segments belonging to the same table may have a different number of sorting columns. In the extreme scenario where a segment contains only one row, Pinot will consider all columns within that segment as sorting columns.
Here is an example of a table configuration that illustrates these concepts:
Checking sort status
You can check the sorted status of a column in a segment by running the following:
Alternatively, for offline tables and for committed segments in real-time tables, you can retrieve the sorted status from the getServerMetadata endpoint. The following example is based on the Batch Quick Start:
Raw value forward index
The raw value forward index stores actual values instead of IDs. This means that it eliminates the need for dictionary lookups when fetching values, which can result in improved query performance. Raw forward index is particularly effective for columns with a large number of unique values, where dictionary encoding doesn't provide significant compression benefits.
As shown in the diagram below, dictionary encoding can lead to numerous random memory accesses for dictionary lookups. In contrast, the raw value forward index allows for sequential value scanning, which can enhance query performance when applied appropriately.
Note: Raw value forward index currently does not support inverted index (all others JSON/TEXT/Range/etc are supported). Also, since reading a value from this index requires reading the entire chunk in memory and decompressing, it is not suitable for heavy random reads.
When using the raw format, you can configure the following parameters:
Parameter
Default
Description
chunkCompressionType
null
The compression that will be used. Replaced by compressionCodec since release 1.2.0
compressionCodec
null
The compression that will be used. Introduced in release 1.2.0
deriveNumDocsPerChunk
false
Modifies the behavior when storing variable length values (like string or bytes)
rawIndexWriterVersion
2
The compressionCodec parameter has the following valid values:
PASS_THROUGH
SNAPPY
ZSTANDARD
LZ4
GZIP (Introduced in release 1.2.0)
null (the JSON null value, not "null"), which is the default. In this case, PASS_THROUGH will be used for metrics and LZ4 for other columns.
deriveNumDocsPerChunk is only used when the datatype may have a variable length, such as with string, big decimal, bytes, etc. By default, Pinot uses a fixed number of elements that was chosen empirically. If changed to true, Pinot will use a heuristic value that depends on the column data.
rawIndexWriterVersion changes the algorithm used to create the index. This changes the actual data layout, but modern versions of Pinot can read indexes written in older versions. The latest version right now is 4.
targetDocsPerChunk changes the target number of docs to store in a chunk. For rawIndexWriterVersion versions 2 and 3, this will store exactly targetDocsPerChunk per chunk. For rawIndexWriterVersion version 4, this config is used in conjunction with targetMaxChunkSize and chunk size is determined with the formula min(lengthOfLongestDocumentInSegment * targetDocsPerChunk, targetMaxChunkSize). A negative value will disable dynamic chunk sizing and use the static targetMaxChunkSize.
targetMaxChunkSize changes the target max chunk size. For rawIndexWriterVersion versions 2 and 3, this can only be used with deriveNumDocsPerChunk. For rawIndexWriterVersion version 4, this sets the upper bound for a dynamically calculated chunk size. Documents larger than the targetMaxChunkSize will be given their own 'huge' chunk, therefore, it is recommended to size this such that huge chunks are avoided.
Raw forward index configuration
The recommended way to configure the forward index using raw format is by including the parameters explained above in the indexes.forward object. For example:
Deprecated
An alternative method to configure the raw format parameters is available. This older approach can still be used, although it is not recommended. Here are the details of this older method:
chunkCompressionType: This parameter can be defined as a sibling of name and encodingType in the fieldConfigList section.
deriveNumDocsPerChunk: You can configure this parameter with the property deriveNumDocsPerChunkForRawIndex. Note that in properties, all values must be strings, so valid values for this property are "true" and "false".
rawIndexWriterVersion: This parameter can be configured using the property rawIndexWriterVersion. Again, in properties, all values must be strings, so valid values for this property are "2", "3", and so on.
For example:
While this older method is still supported, it is not the recommended way to configure these parameters. There are no plans to remove support for this older method, but keep in mind that any new parameters added in the future may only be configurable in the forward JSON object.
Disabling the forward index
Traditionally the forward index has been a mandatory index for all columns in the on-disk segment file format.
However, certain columns may only be used as a filter in the WHERE clause for all queries. In such scenarios the forward index is not necessary as essentially other indexes and structures in the segments can provide the required SQL query functionality. Forward index just takes up extra storage space for such scenarios and can ideally be freed up.
Thus, to provide users an option to save storage space, a knob to disable the forward index is now available.
Forward index on one or more columns(s) in your Pinot table can be disabled with the following limitations:
Only supported for immutable (offline) segments.
If the column has a range index then the column must be of single-value type and use range index version 2.
MV columns with duplicates within a row will lose the duplicated entries on forward index regeneration. The ordering of data with an MV row may also change on regeneration. A backfill is required in such scenarios (to preserve duplicates or ordering).
If forward index regeneration support on reload (i.e. re-enabling the forward index for a forward index disabled column) is required then the dictionary and inverted index must be enabled on that particular column.
Sorted columns will allow the forward index to be disabled, but this operation will be treated as a no-op and the index (which acts as both a forward index and inverted index) will be created.
To disable the forward index, in table config under fieldConfigList, set the disabled property to true as shown below:
The older way to do so is still supported, but not recommended.
A table reload operation must be performed for the above config to take effect. Enabling / disabling other indexes on the column can be done via the usual table config options.
The forward index can also be regenerated for a column where it is disabled by enabling the index and reloading the segment. The forward index can only be regenerated if the dictionary and inverted index have been enabled for the column. If either have been disabled then the only way to get the forward index back is to regenerate the segments via the offline jobs and re-push / refresh the data.
Warning:
For multi-value (MV) columns the following invariants cannot be maintained after regenerating the forward index for a forward index disabled column:
Ordering guarantees of the MV values within a row
If entries within an MV row are duplicated, the duplicates will be lost. Regenerate the segments via your offline jobs and re-push / refresh the data to get back the original MV data with duplicates.
We will work on removing the second invariant in the future.
Examples of queries which will fail after disabling the forward index for an example column, columnA, can be found below:
Select
Forward index disabled columns cannot be present in the SELECT clause even if filters are added on it.
Group By Order By
Forward index disabled columns cannot be present in the GROUP BY and ORDER BY clauses. They also cannot be part of the HAVING clause.
Aggregation Queries
A subset of the aggregation functions do work when the forward index is disabled such as MIN, MAX, DISTINCTCOUNT, DISTINCTCOUNTHLL and more. Some of the other aggregation functions will not work such as the below:
Distinct
Forward index disabled columns cannot be present in the SELECT DISTINCT clause.
Range Queries
To run queries on single-value columns where the filter clause contains operators such as >, <, >=, <= a version 2 range index must be present. Without the range index such queries will fail as shown below:
Explore the minion component in Apache Pinot, empowering efficient data movement and segment generation within Pinot clusters.
A Pinot minion is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minon tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.
Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function with or without minions, they are typically present to support routine tasks like batch data ingest.
Starting a minion
Make sure you've . If you're using Docker, make sure to . To start a minion:
Interfaces
Pinot task generator
The Pinot task generator interface defines the APIs for the controller to generate tasks for minions to execute.
PinotTaskExecutorFactory
Factory for PinotTaskExecutor which defines the APIs for Minion to execute the tasks.
MinionEventObserverFactory
Factory for MinionEventObserver which defines the APIs for task event callbacks on minion.
Built-in tasks
SegmentGenerationAndPushTask
The PushTask can fetch files from an input folder e.g. from a S3 bucket and converts them into segments. The PushTask converts one file into one segment and keeps file name in segment metadata to avoid duplicate ingestion. Below is an example task config to put in TableConfig to enable this task. The task is scheduled every 10min to keep ingesting remaining files, with 10 parallel task at max and 1 file per task.
NOTE: You may want to simply omit "tableMaxNumTasks" due to this caveat: the task generates one segment per file, and derives segment name based on the time column of the file. If two files happen to have same time range and are ingested by tasks from different schedules, there might be segment name conflict. To overcome this issue for now, you can omit “tableMaxNumTasks” and by default it’s Integer.MAX_VALUE, meaning to schedule as many tasks as possible to ingest all input files in a single batch. Within one batch, a sequence number suffix is used to ensure no segment name conflict. Because the sequence number suffix is scoped within one batch, tasks from different batches might encounter segment name conflict issue said above.
When performing ingestion at scale remember that Pinot will list all of the files contained in the `inputDirURI` every time a `SegmentGenerationAndPushTask` job gets scheduled. This could become a bottleneck when fetching files from a cloud bucket like GCS. To prevent this make `inputDirURI` point to the least number of files possible.
RealtimeToOfflineSegmentsTask
See for details.
MergeRollupTask
See for details.
Enable tasks
Tasks are enabled on a per-table basis. To enable a certain task type (e.g. myTask) on a table, update the table config to include the task type:
Under each enable task type, custom properties can be configured for the task type.
There are also two task configs to be set as part of cluster configs like below. One controls task's overall timeout (1hr by default) and one for how many tasks to run on a single minion worker (1 by default).
Schedule tasks
Auto-schedule
There are 2 ways to enable task scheduling:
Controller level schedule for all minion tasks
Tasks can be scheduled periodically for all task types on all enabled tables. Enable auto task scheduling by configuring the schedule frequency in the controller config with the key controller.task.frequencyPeriod. This takes period strings as values, e.g. 2h, 30m, 1d.
Per table and task level schedule
Tasks can also be scheduled based on cron expressions. The cron expression is set in the schedule config for each task type separately. This config in the controller config, controller.task.scheduler.enabled should be set to true to enable cron scheduling.
As shown below, the RealtimeToOfflineSegmentsTask will be scheduled at the first second of every minute (following the syntax ).
Manual schedule
Tasks can be manually scheduled using the following controller rest APIs:
Rest API
Description
Schedule task on specific instances
Tasks can be scheduled on specific instances using the following config at task level:
By default, the value is minion_untagged to have backward-compatibility. This will allow users to schedule tasks on specific nodes and isolate tasks among tables / task-types.
Rest API
Description
Task level advanced configs
allowDownloadFromServer
When a task is executed on a segment, the minion node fetches the segment from deepstore. If the deepstore is not accessible, the minion node can download the segment from the server node. This is controlled by the allowDownloadFromServer config in the task config. By default, this is set to false.
We can also set this config at a minion instance level pinot.minion.task.allow.download.from.server (default is false). This instance level config helps in enforcing this behaviour if the number of tables / tasks is pretty high and we want to enable for all. Note: task-level config will override instance-level config value.
Plug-in custom tasks
To plug in a custom task, implement PinotTaskGenerator, PinotTaskExecutorFactory and MinionEventObserverFactory (optional) for the task type (all of them should return the same string for getTaskType()), and annotate them with the following annotations:
Implementation
Annotation
After annotating the classes, put them under the package of name org.apache.pinot.*.plugin.minion.tasks.*, then they will be auto-registered by the controller and minion.
Example
See where the TestTask is plugged-in.
Task Manager UI
In the Pinot UI, there is Minion Task Manager tab under Cluster Manager page. From that minion task manager tab, one can find a lot of task related info for troubleshooting. Those info are mainly collected from the Pinot controller that schedules tasks or Helix that tracks task runtime status. There are also buttons to schedule tasks in an ad hoc way. Below are some brief introductions to some pages under the minion task manager tab.
This one shows which types of Minion Task have been used. Essentially which task types have created their task queues in Helix.
Clicking into a task type, one can see the tables using that task. And a few buttons to stop the task queue, cleaning up ended tasks etc.
Then clicking into any table in this list, one can see how the task is configured for that table. And the task metadata if there is one in ZK. For example, MergeRollupTask tracks a watermark in ZK. If the task is cron scheduled, the current and next schedules are also shown in this page like below.
At the bottom of this page is a list of tasks generated for this table for this specific task type. Like here, one MergeRollup task has been generated and completed.
Clicking into a task from that list, we can see start/end time for it, and the subtasks generated for that task (as context, one minion task can have multiple subtasks to process data in parallel). In this example, it happened to have one sub-task here, and it shows when it starts and stops and which minion worker it's running.
Clicking into this subtask, one can see more details about it like the input task configs and error info if the task failed.
Task-related metrics
There is a controller job that runs every 5 minutes by default and emits metrics about Minion tasks scheduled in Pinot. The following metrics are emitted for each task type:
NumMinionTasksInProgress: Number of running tasks
NumMinionSubtasksRunning: Number of running sub-tasks
NumMinionSubtasksWaiting: Number of waiting sub-tasks (unassigned to a minion as yet)
The controller also emits metrics about how tasks are cron scheduled:
cronSchedulerJobScheduled: Number of current cron schedules registered to be triggered regularly according their cron expressions. It's a Gauge.
cronSchedulerJobTrigger: Number of cron scheduled triggered, as a Meter.
cronSchedulerJobSkipped: Number of late cron scheduled skipped, as a Meter.
For each task, the minion will emit these metrics:
TASK_QUEUEING: Task queueing time (task_dequeue_time - task_inqueue_time), assuming the time drift between helix controller and pinot minion is minor, otherwise the value may be negative
TASK_EXECUTION: Task execution time, which is the time spent on executing the task
NUMBER_OF_TASKS: number of tasks in progress on that minion. Whenever a Minion starts a task, increase the Gauge by 1, whenever a Minion completes (either succeeded or failed) a task, decrease it by 1
{
"tableName": "somePinotTable",
"fieldConfigList": [
{
"name": "playerID",
"encodingType": "RAW",
"chunkCompressionType": "PASS_THROUGH", // it can also be defined here
"properties": {
"deriveNumDocsPerChunkForRawIndex": "false", // here the string value has to be used
"rawIndexWriterVersion": "2" // here the string value has to be used
}
},
...
],
...
}
NumMinionSubtasksError: Number of error sub-tasks (completed with an error/exception)
PercentMinionSubtasksInQueue: Percent of sub-tasks in waiting or running states
PercentMinionSubtasksInError: Percent of sub-tasks in error
cronSchedulerJobExecutionTimeMs: Time used to complete task generation, as a Timer.
NUMBER_TASKS_EXECUTED: Number of tasks executed, as a Meter.
NUMBER_TASKS_COMPLETED: Number of tasks completed, as a Meter.
NUMBER_TASKS_CANCELLED: Number of tasks cancelled, as a Meter.
NUMBER_TASKS_FAILED: Number of tasks failed, as a Meter. Different from fatal failure, the task encountered an error which can not be recovered from this run, but it may still succeed by retrying the task.
NUMBER_TASKS_FATAL_FAILED: Number of tasks fatal failed, as a Meter. Different from failure, the task encountered an error, which will not be recoverable even with retrying the task.
POST /tasks/schedule
Schedule tasks for all task types on all enabled tables
POST /tasks/schedule?taskType=myTask
Schedule tasks for the given task type on all enabled tables
POST /tasks/schedule?tableName=myTable_OFFLINE
Schedule tasks for all task types on the given table
POST /tasks/schedule?taskType=myTask&tableName=myTable_OFFLINE
Schedule tasks for the given task type on the given table
POST /tasks/schedule?taskType=myTask&tableName=myTable_OFFLINE&minionInstanceTag=tag1_MINION
Schedule tasks for the given task type of the given table on the minion nodes tagged as tag1_MINION.
This guide shows you how to ingest a stream of records into a Pinot table.
Apache Pinot lets users consume data from streams and push it directly into the database. This process is called stream ingestion. Stream ingestion makes it possible to query data within seconds of publication.
Stream ingestion provides support for checkpoints for preventing data loss.
To set up Stream ingestion, perform the following steps, which are described in more detail in this page:
Create schema configuration
Create table configuration
Create ingestion configuration
Upload table and schema spec
Here's an example where we assume the data to be ingested is in the following format:
Create schema configuration
The schema defines the fields along with their data types. The schema also defines whether fields serve as dimensions , metrics, or timestamp. For more details on schema configuration, see .
For our sample data, the schema configuration looks like this:
Create table configuration with ingestion configuration
The next step is to create a table where all the ingested data will flow and can be queried. For details about each table component, see the reference.
The table configuration contains an ingestion configuration (ingestionConfig), which specifies how to ingest streaming data into Pinot. For details, see the reference.
Example table config with ingestionConfig
For our sample data and schema, the table config will look like this:
Example ingestionConfig for multi-topics ingestion
From , Pinot starts to support ingesting data from multiple stream partitions. (It is currently in Beta mode, and only supports multiple Kafka topics. Other stream types would be supported in the near future.) For our sample data and schema, assume that we duplicate it to 2 topics, transcript-topic1 and transcript-topic2. If we want to ingest from both topics, then the table config will look like this:
With multi-topics ingestion: (details please refer to the )
All transform functions would apply to both topics' ingestions.
Existing instance assignment strategy would all work as usual.
would still be handled in the same way.
Upload schema and table config
Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the configs are uploaded, Pinot will start ingesting available records from the topic.
Tune the stream config
Throttle stream consumption
There are some scenarios where the message rate in the input stream can come in bursts which can lead to long GC pauses on the Pinot servers or affect the ingestion rate of other real-time tables on the same server. If this happens to you, throttle the consumption rate during stream ingestion to better manage overall performance.
Stream consumption throttling can be tuned using the stream config topic.consumption.rate.limit which indicates the upper bound on the message rate for the entire topic.
Here is the sample configuration on how to configure the consumption throttling:
Some things to keep in mind while tuning this config are:
Since this configuration applied to the entire topic, internally, this rate is divided by the number of partitions in the topic and applied to each partition's consumer.
In case of multi-tenant deployment (where you have more than 1 table in the same server instance), you need to make sure that the rate limit on one table doesn't step on/starve the rate limiting of another table. So, when there is more than 1 table on the same server (which is most likely to happen), you may need to re-tune the throttling threshold for all the streaming tables.
Once throttling is enabled for a table, you can verify by searching for a log that looks similar to:
In addition, you can monitor the consumption rate utilization with the metric COSUMPTION_QUOTA_UTILIZATION.
Note that any configuration change for topic.consumption.rate.limit in the stream config will NOT take effect immediately. The new configuration will be picked up from the next consuming segment. In order to enforce the new configuration, you need to trigger forceCommit APIs. Refer to for more details.
Custom ingestion support
You can also write an ingestion plugin if the platform you are using is not supported out of the box. For a walkthrough, see .
Pause stream ingestion
There are some scenarios in which you may want to pause the real-time ingestion while your table is available for queries. For example, if there is a problem with the stream ingestion and, while you are troubleshooting the issue, you still want the queries to be executed on the already ingested data. For these scenarios, you can first issue a Pause request to a Controller host. After troubleshooting with the stream is done, you can issue another request to Controller to resume the consumption.
When a Pause request is issued, the controller instructs the real-time servers hosting your table to commit their consuming segments immediately. However, the commit process may take some time to complete. Note that Pause and Resume requests are async. An OK response means that instructions for pausing or resuming has been successfully sent to the real-time server. If you want to know if the consumption has actually stopped or resumed, issue a pause status request.
It's worth noting that consuming segments on real-time servers are stored in volatile memory, and their resources are allocated when the consuming segments are first created. These resources cannot be altered if consumption parameters are changed midway through consumption. It may take hours before these changes take effect. Furthermore, if the parameters are changed in an incompatible way (for example, changing the underlying stream with a completely new set of offsets, or changing the stream endpoint from which to consume messages), it will result in the table getting into an error state.
The pause and resume feature is helpful in these instances. When a pause request is issued by the operator, consuming segments are committed without starting new mutable segments. Instead, new mutable segments are started only when the resume request is issued. This mechanism provides the operators as well as developers with more flexibility. It also enables Pinot to be more resilient to the operational and functional constraints imposed by underlying streams.
There is another feature called Force Commit which utilizes the primitives of the pause and resume feature. When the operator issues a force commit request, the current mutable segments will be committed and new ones started right away. Operators can now use this feature for all compatible table config parameter changes to take effect immediately.
(v 0.12.0+) Once submitted, the forceCommit API returns a jobId that can be used to get the current progress of the forceCommit operation. A sample response and status API call:
The forceCommit request just triggers a regular commit before the consuming segments reaching the end criteria, so it follows the same mechanism as regular commit. It is one-time shot request, and not retried automatically upon failure. But it is idempotent so one may keep issuing it till success if needed.
This API is async, as it doesn't wait for the segment commit to complete. But a status entry is put in ZK to track when the request is issued and the consuming segments included. The consuming segments tracked in the status entry are compared with the latest IdealState to indicate the progress of forceCommit. However, this status is not updated or deleted upon commit success or failure, so that it could become stale. Currently, the most recent 100 status entries are kept in ZK, and the oldest ones only get deleted when the total number is about to exceed 100.
For incompatible parameter changes, an option is added to the resume request to handle the case of a completely new set of offsets. Operators can now follow a three-step process: First, issue a pause request. Second, change the consumption parameters. Finally, issue the resume request with the appropriate option. These steps will preserve the old data and allow the new data to be consumed immediately. All through the operation, queries will continue to be served.
Handle partition changes in streams
If a Pinot table is configured to consume using a (partition-based) stream type, then it is possible that the partitions of the table change over time. In Kafka, for example, the number of partitions may increase. In Kinesis, the number of partitions may increase or decrease -- some partitions could be merged to create a new one, or existing partitions split to create new ones.
Pinot runs a periodic task called RealtimeSegmentValidationManager that monitors such changes and starts consumption on new partitions (or stops consumptions from old ones) as necessary. Since this is a that is run on the controller, it may take some time for Pinot to recognize new partitions and start consuming from them. This may delay the data in new partitions appearing in the results that pinot returns.
If you want to recognize the new partitions sooner, then the periodic task so as to recognize such data immediately.
Infer ingestion status of real-time tables
Often, it is important to understand the rate of ingestion of data into your real-time table. This is commonly done by looking at the consumption lag of the consumer. The lag itself can be observed in many dimensions. Pinot supports observing consumption lag along the offset dimension and time dimension, whenever applicable (as it depends on the specifics of the connector).
The ingestion status of a connector can be observed by querying either the /consumingSegmentsInfo API or the table's /debug API, as shown below:
A sample response from a Kafka-based real-time table is shown below. The ingestion status is displayed for each of the CONSUMING segments in the table.
Term
Description
Monitor real-time ingestion
Real-time ingestion includes 3 stages of message processing: Decode, Transform, and Index.
In each of these stages, a failure can happen which may or may not result in an ingestion failure. The following metrics are available to investigate ingestion issues:
Decode stage -> an error here is recorded as INVALID_REALTIME_ROWS_DROPPED
Transform stage -> possible errors here are:
When a message gets dropped due to the transform, it is recorded as REALTIME_ROWS_FILTERED
There is yet another metric called ROWS_WITH_ERROR which is the sum of all error counts in the 3 stages above.
Furthermore, the metric REALTIME_CONSUMPTION_EXCEPTIONS gets incremented whenever there is a transient/permanent stream exception seen during consumption.
These metrics can be used to understand why ingestion failed for a particular table partition before diving into the server logs.
Usage: StartMinion
-help : Print this message. (required=false)
-minionHost <String> : Host name for minion. (required=false)
-minionPort <int> : Port number to start the minion at. (required=false)
-zkAddress <http> : HTTP address of Zookeeper. (required=false)
-clusterName <String> : Pinot cluster name. (required=false)
-configFileName <Config File Name> : Minion Starter Config file. (required=false)
public interface PinotTaskGenerator {
/**
* Initializes the task generator.
*/
void init(ClusterInfoAccessor clusterInfoAccessor);
/**
* Returns the task type of the generator.
*/
String getTaskType();
/**
* Generates a list of tasks to schedule based on the given table configs.
*/
List<PinotTaskConfig> generateTasks(List<TableConfig> tableConfigs);
/**
* Returns the timeout in milliseconds for each task, 3600000 (1 hour) by default.
*/
default long getTaskTimeoutMs() {
return JobConfig.DEFAULT_TIMEOUT_PER_TASK;
}
/**
* Returns the maximum number of concurrent tasks allowed per instance, 1 by default.
*/
default int getNumConcurrentTasksPerInstance() {
return JobConfig.DEFAULT_NUM_CONCURRENT_TASKS_PER_INSTANCE;
}
/**
* Performs necessary cleanups (e.g. remove metrics) when the controller leadership changes.
*/
default void nonLeaderCleanUp() {
}
}
public interface PinotTaskExecutorFactory {
/**
* Initializes the task executor factory.
*/
void init(MinionTaskZkMetadataManager zkMetadataManager);
/**
* Returns the task type of the executor.
*/
String getTaskType();
/**
* Creates a new task executor.
*/
PinotTaskExecutor create();
}
public interface PinotTaskExecutor {
/**
* Executes the task based on the given task config and returns the execution result.
*/
Object executeTask(PinotTaskConfig pinotTaskConfig)
throws Exception;
/**
* Tries to cancel the task.
*/
void cancel();
}
public interface MinionEventObserverFactory {
/**
* Initializes the task executor factory.
*/
void init(MinionTaskZkMetadataManager zkMetadataManager);
/**
* Returns the task type of the event observer.
*/
String getTaskType();
/**
* Creates a new task event observer.
*/
MinionEventObserver create();
}
public interface MinionEventObserver {
/**
* Invoked when a minion task starts.
*
* @param pinotTaskConfig Pinot task config
*/
void notifyTaskStart(PinotTaskConfig pinotTaskConfig);
/**
* Invoked when a minion task succeeds.
*
* @param pinotTaskConfig Pinot task config
* @param executionResult Execution result
*/
void notifyTaskSuccess(PinotTaskConfig pinotTaskConfig, @Nullable Object executionResult);
/**
* Invoked when a minion task gets cancelled.
*
* @param pinotTaskConfig Pinot task config
*/
void notifyTaskCancelled(PinotTaskConfig pinotTaskConfig);
/**
* Invoked when a minion task encounters exception.
*
* @param pinotTaskConfig Pinot task config
* @param exception Exception encountered during execution
*/
void notifyTaskError(PinotTaskConfig pinotTaskConfig, Exception exception);
}
Using "POST /cluster/configs" API on CLUSTER tab in Swagger, with this payload
{
"RealtimeToOfflineSegmentsTask.timeoutMs": "600000",
"RealtimeToOfflineSegmentsTask.numConcurrentTasksPerInstance": "4"
}
Underlying ingestion still works as LOWLEVEL mode, where
transcript-topic1 segments would be named like transcript__0__0__20250101T0000Z
transcript-topic2 segments would be named like transcript__10000__0__20250101T0000Z
When the transform pipeline sets the $INCOMPLETE_RECORD_KEY$ key in the message, it is recorded as INCOMPLETE_REALTIME_ROWS_CONSUMED , only when continueOnError configuration is enabled. If the continueOnError is not enabled, the ingestion fails.
Index stage -> When there is failure at this stage, the ingestion typically stops and marks the partition as ERROR.
currentOffsetsMap
Current consuming offset position per partition
latestUpstreamOffsetMap
(Wherever applicable) Latest offset found in the upstream topic partition
recordsLagMap
(Whenever applicable) Defines how far behind the current record's offset / pointer is from upstream latest record. This is calculated as the difference between the latestUpstreamOffset and currentOffset for the partition when the lag computation request is made.
recordsAvailabilityLagMap
(Whenever applicable) Defines how soon after record ingestion was the record consumed by Pinot. This is calculated as the difference between the time the record was consumed and the time at which the record was ingested upstream.
A consumption rate limiter is set up for topic <topic_name> in table <tableName> with rate limit: <rate_limit> (topic rate limit: <topic_rate_limit>, partition count: <partition_count>)
$ curl -X POST {controllerHost}/tables/{tableName}/forceCommit
$ curl -X POST {controllerHost}/tables/{tableName}/pauseConsumption
$ curl -X POST {controllerHost}/tables/{tableName}/resumeConsumption
$ curl -X POST {controllerHost}/tables/{tableName}/pauseStatus
$ curl -X POST {controllerHost}/tables/{tableName}/forceCommit
$ curl -X POST {controllerHost}/tables/{tableName}/resumeConsumption?resumeFrom=smallest
$ curl -X POST {controllerHost}/tables/{tableName}/resumeConsumption?resumeFrom=largest
# GET /tables/{tableName}/consumingSegmentsInfo
curl -X GET "http://<controller_url:controller_admin_port>/tables/meetupRsvp/consumingSegmentsInfo" -H "accept: application/json"
# GET /debug/tables/{tableName}
curl -X GET "http://localhost:9000/debug/tables/meetupRsvp?type=REALTIME&verbosity=1" -H "accept: application/json"
{
"_segmentToConsumingInfoMap": {
"meetupRsvp__0__0__20221019T0639Z": [
{
"serverName": "Server_192.168.0.103_7000",
"consumerState": "CONSUMING",
"lastConsumedTimestamp": 1666161593904,
"partitionToOffsetMap": { // <<-- Deprecated. See currentOffsetsMap for same info
"0": "6"
},
"partitionOffsetInfo": {
"currentOffsetsMap": {
"0": "6" // <-- Current consumer position
},
"latestUpstreamOffsetMap": {
"0": "6" // <-- Upstream latest position
},
"recordsLagMap": {
"0": "0" // <-- Lag, in terms of #records behind latest
},
"recordsAvailabilityLagMap": {
"0": "2" // <-- Lag, in terms of time
}
}
}
],
Text search support
This page talks about support for text search in Pinot.
This text index method is recommended over the experimental native text index.
Pinot supports super-fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index, and sorted index.
This is useful for a query like the following, which looks for exact matches on two columns of type STRING and INT respectively:
For arbitrary text data that falls into the BLOB/CLOB territory, we need more than exact matches. This often involves using regex, phrase, fuzzy queries on BLOB like data. Text indexes can efficiently perform arbitrary search on STRING columns where each column value is a large BLOB of text using the TEXT_MATCH function, like this:
where <column_name> is the column text index is created on and <search_expression> conforms to one of the following:
Current restrictions
Pinot supports text search with the following requirements:
The column type should be STRING.
The column should be single-valued.
Using a text index in coexistence with other Pinot indexes is not supported.
Sample Datasets
Text search should ideally be used on STRING columns where doing standard filter operations (EQUALITY, RANGE, BETWEEN) doesn't fit the bill because each column value is a reasonably large blob of text.
Apache Access Log
Consider the following snippet from an Apache access log. Each line in the log consists of arbitrary data (IP addresses, URLs, timestamps, symbols etc) and represents a column value. Data like this is a good candidate for doing text search.
Let's say the following snippet of data is stored in the ACCESS_LOG_COL column in a Pinot table.
Here are some examples of search queries on this data:
Count the number of GET requests.
Count the number of POST requests that have administrator in the URL (administrator/index)
Count the number of POST requests that have a particular URL and handled by Firefox browser
Resume text
Let's consider another example using text from job candidate resumes. Each line in this file represents skill-data from resumes of different candidates.
This data is stored in the SKILLS_COL column in a Pinot table. Each line in the input text represents a column value.
Here are some examples of search queries on this data:
Count the number of candidates that have "machine learning" and "gpu processing": This is a phrase search (more on this further in the document) where we are looking for exact match of phrases "machine learning" and "gpu processing", not necessarily in the same order in the original data.
Count the number of candidates that have "distributed systems" and either 'Java' or 'C++': This is a combination of searching for exact phrase "distributed systems" along with other terms.
Query Log
Next, consider a snippet from a log file containing SQL queries handled by a database. Each line (query) in the file represents a column value in the QUERY_LOG_COL column in a Pinot table.
Here are some examples of search queries on this data:
Count the number of queries that have GROUP BY
Count the number of queries that have the SELECT count... pattern
Count the number of queries that use BETWEEN filter on timestamp column along with GROUP BY
Read on for concrete examples on each kind of query and step-by-step guides covering how to write text search queries in Pinot.
A column in Pinot can be dictionary-encoded or stored RAW. In addition, we can create an inverted index and/or a sorted index on a dictionary-encoded column.
The text index is an addition to the type of per-column indexes users can create in Pinot. However, it only supports text index on a RAW column, not a dictionary-encoded column.
Enable a text index
Enable a text index on a column in the by adding a new section with the name "fieldConfigList".
Each column that has a text index should also be specified as noDictionaryColumns in tableIndexConfig:
You can configure text indexes in the following scenarios:
Adding a new table with text index enabled on one or more columns.
Adding a new column with text index enabled to an existing table.
Enabling a text index on an existing column.
When you're using a text index, add the indexed column to the noDictionaryColumns columns list to reduce unnecessary storage overhead.
For instructions on that configuration property, see the documentation.
Text index creation
Once the text index is enabled on one or more columns through a , segment generation code will automatically create the text index (per column).
Text index is supported for both offline and real-time segments.
Text parsing and tokenization
The original text document (denoted by a value in the column that has text index enabled) is parsed, tokenized and individual "indexable" terms are extracted. These terms are inserted into the index.
Pinot's text index is built on top of Lucene. Lucene's standard english text tokenizer generally works well for most classes of text. To build a custom text parser and tokenizer to suit particular user requirements, this can be made configurable for the user to specify on a per-column text-index basis.
There is a default set of "stop words" built in Pinot's text index. This is a set of high frequency words in English that are excluded for search efficiency and index size, including:
Any occurrence of these words will be ignored by the tokenizer during index creation and search.
In some cases, users might want to customize the set. A good example would be when IT (Information Technology) appears in the text that collides with "it", or some context-specific words that are not informative in the search. To do this, one can config the words in fieldConfig to include/exclude from the default stop words:
The words should be comma separated and in lowercase. Words appearing in both lists will be excluded as expected.
Writing text search queries
The TEXT_MATCH function enables using text search in SQL/PQL.
TEXT_MATCH(text_column_name, search_expression)
text_column_name - name of the column to do text search on.
search_expression - search query
You can use TEXT_MATCH function as part of queries in the WHERE clause, like this:
You can also use the TEXT_MATCH filter clause with other filter operators. For example:
You can combine multiple TEXT_MATCH filter clauses:
TEXT_MATCH can be used in WHERE clause of all kinds of queries supported by Pinot.
Selection query which projects one or more columns
User can also include the text column name in select list
Aggregation query
The search expression (the second argument to TEXT_MATCH function) is the query string that Pinot will use to perform text search on the column's text index.
Phrase query
This query is used to seek out an exact match of a given phrase, where terms in the user-specified phrase appear in the same order in the original text document.
The following example reuses the earlier example of resume text data containing 14 documents to walk through queries. In this sentence, "document" means the column value. The data is stored in the SKILLS_COL column and we have created a text index on this column.
This example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "Distributed systems":
The search expression is '\"Distributed systems\"'
The search expression is always specified within single quotes '<your expression>'
Since we are doing a phrase search, the phrase should be specified within double quotes inside the single quotes and the double quotes should be escaped
'\"<your phrase>\"'
The above query will match the following documents:
But it won't match the following document:
This is because the phrase query looks for the phrase occurring in the original document "as is". The terms as specified by the user in phrase should be in the exact same order in the original document for the document to be considered as a match.
NOTE: Matching is always done in a case-insensitive manner.
The next example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "query processing":
The above query will match the following documents:
Term query
Term queries are used to search for individual terms.
This example will query the SKILLS_COL column to look for documents where each matching document MUST contain the term 'Java'.
As mentioned earlier, the search expression is always within single quotes. However, since this is a term query, we don't have to use double quotes within single quotes.
Composite query using Boolean operators
The Boolean operators AND and OR are supported and we can use them to build a composite query. Boolean operators can be used to combine phrase and term queries in any arbitrary manner
This example queries the SKILLS_COL column to look for documents where each matching document MUST contain the phrases "machine learning" and "tensor flow". This combines two phrases using the AND Boolean operator.
The above query will match the following documents:
This example queries the SKILLS_COL column to look for documents where each document MUST contain the phrase "machine learning" and the terms 'gpu' and 'python'. This combines a phrase and two terms using Boolean operators.
The above query will match the following documents:
When using Boolean operators to combine term(s) and phrase(s) or both, note that:
The matching document can contain the terms and phrases in any order.
The matching document may not have the terms adjacent to each other (if this is needed, use appropriate phrase query).
Use of the OR operator is implicit. In other words, if phrase(s) and term(s) are not combined using AND operator in the search expression, the OR operator is used by default:
This example queries the SKILLS_COL column to look for documents where each document MUST contain ANY one of:
phrase "distributed systems" OR
term 'java' OR
term 'C++'.
Grouping using parentheses is supported:
This example queries the SKILLS_COL column to look for documents where each document MUST contain
phrase "distributed systems" AND
at least one of the terms Java or C++
Here the terms Java and C++ are grouped without any operator, which implies the use of OR. The root operator AND is used to combine this with phrase "distributed systems"
Prefix query
Prefix queries can be done in the context of a single term. We can't use prefix matches for phrases.
This example queries the SKILLS_COL column to look for documents where each document MUST contain text like stream, streaming, streams etc
The above query will match the following documents:
Regular Expression Query
Phrase and term queries work on the fundamental logic of looking up the terms in the text index. The original text document (a value in the column with text index enabled) is parsed, tokenized, and individual "indexable" terms are extracted. These terms are inserted into the index.
Based on the nature of the original text and how the text is segmented into tokens, it is possible that some terms don't get indexed individually. In such cases, it is better to use regular expression queries on the text index.
Consider a server log as an example where we want to look for exceptions. A regex query is suitable here as it is unlikely that 'exception' is present as an individual indexed token.
Syntax of a regex query is slightly different from queries mentioned earlier. The regular expression is written between a pair of forward slashes (/).
The above query will match any text document containing "exception".
Phrase search with wildcard term matching
Phrase search with wildcard and prefix term matching can match patterns like "pache pino" to the text "Apache Pinot" directly. The kind of queries is very common in use case like log search where user needs to search substrings across term boundary in long text. To enable such search (which can be more costly because Lucene by default does not allow * to start a pattern to avoid costly term matching), one can add a new config key to the column text index config:
With this config enabled, one can now perform the pharse wildcard search using the following syntax like
to match the string "Apache pinot" in the SIKLLS_COL. Boolean expressions like 'pache pino AND apche luce' are are supported.
Deciding Query Types
Combining phrase and term queries using Boolean operators and grouping lets you build a complex text search query expression.
The key thing to remember is that phrases should be used when the order of terms in the document is important and when separating the phrase into individual terms doesn't make sense from end user's perspective.
An example would be phrase "machine learning".
However, if we are searching for documents matching Java and C++ terms, using phrase query "Java C++" will actually result in in partial results (could be empty too) since now we are relying the on the user specifying these skills in the exact same order (adjacent to each other) in the resume text.
Term query using Boolean AND operator is more appropriate for such cases
Text Index Tuning
To improve Lucene index creation time, some configs have been provided. Field Config properties luceneUseCompoundFile and luceneMaxBufferSizeMB can provide faster index writing at but may increase file descriptors and/or memory pressure.
SELECT COUNT(*)
FROM MyTable
WHERE TEXT_MATCH(ACCESS_LOG_COL, 'GET')
SELECT COUNT(*)
FROM MyTable
WHERE TEXT_MATCH(ACCESS_LOG_COL, 'post AND administrator AND index')
SELECT COUNT(*)
FROM MyTable
WHERE TEXT_MATCH(ACCESS_LOG_COL, 'post AND administrator AND index AND firefox')
Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing
Java, Python, C++, Machine learning, building and deploying large scale production systems, concurrency, multi-threading, CPU processing
C++, Python, Tensor flow, database kernel, storage, indexing and transaction processing, building large scale systems, Machine learning
Amazon EC2, AWS, hadoop, big data, spark, building high performance scalable systems, building and deploying large scale production systems, concurrency, multi-threading, Java, C++, CPU processing
Distributed systems, database development, columnar query engine, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
CUDA, GPU, Python, Machine learning, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, database engine, cluster management, docker image building and distribution
Kubernetes, cluster management, operating systems, concurrency, multi-threading, apache airflow, Apache Spark,
Apache spark, Java, C++, query processing, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Big data stream processing, Apache Flink, Apache Beam, database kernel, distributed query engines for analytics and data warehouses
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems
Distributed systems, Apache Kafka, publish-subscribe, building and deploying large scale production systems, concurrency, multi-threading, C++, CPU processing, Java
Realtime stream processing, publish subscribe, columnar processing for data warehouses, concurrency, Java, multi-threading, C++,
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND "gpu processing"')
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"distributed systems" AND (Java C++)')
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560988800000 AND 1568764800000 GROUP BY dimensionCol3 TOP 2500
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560988800000 AND 1568764800000 GROUP BY dimensionCol3 TOP 2500
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1545436800000 AND 1553212800000 GROUP BY dimensionCol3 TOP 2500
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1537228800000 AND 1537660800000 GROUP BY dimensionCol3 TOP 2500
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1561366800000 AND 1561370399999 AND dimensionCol3 = 2019062409 LIMIT 10000
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1563807600000 AND 1563811199999 AND dimensionCol3 = 2019072215 LIMIT 10000
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1563811200000 AND 1563814799999 AND dimensionCol3 = 2019072216 LIMIT 10000
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1566327600000 AND 1566329400000 AND dimensionCol3 = 2019082019 LIMIT 10000
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560834000000 AND 1560837599999 AND dimensionCol3 = 2019061805 LIMIT 0
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560870000000 AND 1560871800000 AND dimensionCol3 = 2019061815 LIMIT 0
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560871800001 AND 1560873599999 AND dimensionCol3 = 2019061815 LIMIT 0
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560873600000 AND 1560877199999 AND dimensionCol3 = 2019061816 LIMIT 0
SELECT COUNT(*)
FROM MyTable
WHERE TEXT_MATCH(QUERY_LOG_COL, '"group by"')
SELECT COUNT(*)
FROM MyTable
WHERE TEXT_MATCH(QUERY_LOG_COL, '"select count"')
SELECT COUNT(*)
FROM MyTable
WHERE TEXT_MATCH(QUERY_LOG_COL, '"timestamp between" AND "group by"')
SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(...)
SELECT * FROM Foo WHERE TEXT_MATCH(...)
SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(...) AND some_other_column_1 > 20000
SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(...) AND some_other_column_1 > 20000 AND some_other_column_2 < 100000
SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(text_col_1, ....) AND TEXT_MATCH(text_col_2, ...)
Java, C++, worked on open source projects, coursera machine learning
Machine learning, Tensor flow, Java, Stanford university,
Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing
Java, Python, C++, Machine learning, building and deploying large scale production systems, concurrency, multi-threading, CPU processing
C++, Python, Tensor flow, database kernel, storage, indexing and transaction processing, building large scale systems, Machine learning
Amazon EC2, AWS, hadoop, big data, spark, building high performance scalable systems, building and deploying large scale production systems, concurrency, multi-threading, Java, C++, CPU processing
Distributed systems, database development, columnar query engine, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
CUDA, GPU, Python, Machine learning, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, database engine, cluster management, docker image building and distribution
Kubernetes, cluster management, operating systems, concurrency, multi-threading, apache airflow, Apache Spark,
Apache spark, Java, C++, query processing, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Big data stream processing, Apache Flink, Apache Beam, database kernel, distributed query engines for analytics and data warehouses
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems
Distributed systems, Apache Kafka, publish-subscribe, building and deploying large scale production systems, concurrency, multi-threading, C++, CPU processing, Java
Realtime stream processing, publish subscribe, columnar processing for data warehouses, concurrency, Java, multi-threading, C++,
C++, Java, Python, realtime streaming systems, Machine learning, spark, Kubernetes, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution
Database engine, OLAP systems, OLTP transaction processing at large scale, concurrency, multi-threading, GO, building large scale systems
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"Distributed systems"')
Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing
Distributed systems, database development, columnar query engine, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
Distributed systems, Java, database engine, cluster management, docker image building and distribution
Distributed systems, Apache Kafka, publish-subscribe, building and deploying large scale production systems, concurrency, multi-threading, C++, CPU processing, Java
Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution
Distributed data processing, systems design experience
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"query processing"')
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, 'Java')
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND "Tensor Flow"')
Machine learning, Tensor flow, Java, Stanford university,
C++, Python, Tensor flow, database kernel, storage, indexing and transaction processing, building large scale systems, Machine learning
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND gpu AND python')
CUDA, GPU, Python, Machine learning, database kernel, storage, indexing and transaction processing, building large scale systems
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"distributed systems" Java C++')
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, '"distributed systems" AND (Java C++)')
SELECT SKILLS_COL
FROM MyTable
WHERE TEXT_MATCH(SKILLS_COL, 'stream*')
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
Big data stream processing, Apache Flink, Apache Beam, database kernel, distributed query engines for analytics and data warehouses
Realtime stream processing, publish subscribe, columnar processing for data warehouses, concurrency, Java, multi-threading, C++,
C++, Java, Python, realtime streaming systems, Machine learning, spark, Kubernetes, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
SELECT SKILLS_COL
FROM MyTable
WHERE text_match(SKILLS_COL, '/.*Exception/')
SELECT SKILLS_COL
FROM MyTable
WHERE text_match(SKILLS_COL, '*pache pino*')
TEXT_MATCH(column, '"machine learning"')
TEXT_MATCH(column, '"Java C++"')
TEXT_MATCH(column, 'Java AND C++')
JSON index
This page describes configuring the JSON index for Apache Pinot.
The JSON index can be applied to JSON string columns to accelerate value lookups and filtering for the column.
When to use JSON index
JSON strings can be used to represent arrays, maps, and nested fields without forcing a fixed schema. While JSON strings are flexible, filtering on JSON string columns is expensive, so consider the use case.
Suppose we have some JSON records similar to the following sample record stored in the person column:
Without an index, to look up the key and filter records based on the value, Pinot must scan and reconstruct the JSON object from the JSON string for every record, look up the key and then compare the value.
For example, in order to find all persons whose name is "adam", the query will look like:
The JSON index is designed to accelerate the filtering on JSON string columns without scanning and reconstructing all the JSON objects.
Enable and configure a JSON index
To enable the JSON index, you can configure the following options in the table configuration:
Config Key
Description
Type
Default
Recommended way to configure
The recommended way to configure a JSON index is in the fieldConfigList.indexes object, within the json key.
All options are optional, so the following is a valid configuration that use the default parameter values:
Deprecated ways to configure JSON indexes
There are two older ways to configure the indexes that can be configured in the tableIndexConfig section inside table config.
The first one uses the same JSON explained above, but it is defined inside tableIndexConfig.jsonIndexConfigs.<column name>:
Like in the previous case, all parameters are optional, so the following is also valid:
The last option does not support to configure any parameter. In order to use this option, add the name of the column in tableIndexConfig.jsonIndexColumns like in this example:
Example:
With the following JSON document:
Using the default setting, we will flatten the document into the following records:
With maxValueLength set to 9:
With maxLevels set to 1:
With maxLevels set to 2:
With excludeArray set to true:
With disableCrossArrayUnnest set to true:
When cross array un-nesting is disabled, then number of documents produced during JSON flattening is the sum of all array sizes, e.g. 2+2 = 4 in the example above.
With disableCrossArrayUnnest set to false:
When cross array un-nesting is enabled, then number of documents produced during JSON flattening is the product of all array sizes, e.g. 2*2 = 4 in the example above. If JSON contains multiple large nested arrays, it might be necessary to disable cross array un-nesting (disableCrossArrayUnnest=true) to avoid hitting the 100k flattened documents limit and triggering 'Got to many combinations' error.
With includePaths set to ["$.name", "$.addresses[*].country"]:
With excludePaths set to ["$.age", "$.addresses[*].number"]:
With excludeFields set to ["age", "street"]:
With indexPaths set to ["*", "address..country"]:
With skipInvalidJson set to true, if we corrupt the original JSON, e.g. to
then flattening will be produce:
Note that the JSON index can only be applied to STRING/JSON columns whose values are JSON strings.
To reduce unnecessary storage overhead when using a JSON index, we recommend that you add the indexed column to the noDictionaryColumns columns list.
For instructions on that configuration property, see the documentation.
How to use the JSON index
The JSON index can be used via the JSON_MATCH predicate for filtering: JSON_MATCH(<column>, '<filterExpression>'). For example, to find every entry with the name "adam":
Note that the quotes within the filter expression need to be escaped.
The JSON index can also be used via the JSON_EXTRACT_INDEX predicate for value extraction (optionally with filtering): JSON_EXTRACT_INDEX(<column>, '<jsonPath>', ['resultsType'], ['filter']). For example, to extract every value for path $.name when the path $.id is less than 10:
More in-depth examples can be found in the .
Supported filter expressions
Simple key lookup
Find all persons whose name is "adam":
Chained key lookup
Find all persons who have an address (one of the addresses) with number 112:
Find all persons who have at least one address that is not in the US:
Regex based lookup
Find all persons who have an address (one of the addresses) where the street contains the term 'st':
Range lookup
Find all persons whose age is greater than 18:
Nested filter expression
Find all persons whose name is "adam" and also have an address (one of the addresses) with number 112:
NOT IN and != can't be used in nested filter expressions in Pinot versions older than 1.2.0. Note that IS NULL cannot be used in nested filter expressions currently.
Array access
Find all persons whose first address has number 112:
Since JSON index works based on flattened JSON documents, if cross array un-nesting is disabled ( disableCrossArrayUnnest = true ), then querying more than one array in a single JSON_MATCH function call returns empty result, e.g.
In such cases expression should be split into multiple JSON_MATCH calls, e.g.
Existence check
Find all persons who have a phone field within the JSON:
Find all persons whose first address does not contain floor field within the JSON:
JSON context is maintained
The JSON context is maintained for object elements within an array, meaning the filter won't cross-match different objects in the array.
To find all persons who live on "main st" in "ca":
This query won't match "adam" because none of his addresses matches both the street and the country.
If you don't want JSON context, use multiple separate JSON_MATCH predicates. For example, to find all persons who have addresses on "main st" and have addresses in "ca" (matches need not have the same address):
This query will match "adam" because one of his addresses matches the street and another one matches the country.
The array index is maintained as a separate entry within the element, so in order to query different elements within an array, multiple JSON_MATCH predicates are required. For example, to find all persons who have first address on "main st" and second address on "second st":
Supported JSON values
Object
See examples above.
Array
To find the records with array element "item1" in "arrayCol":
To find the records with second array element "item2" in "arrayCol":
Value
To find the records with value 123 in "valueCol":
Null
To find the records with null in "nullableCol":
Limitations
The key (left-hand side) of the filter expression must be the leaf level of the JSON object, for example, "$.addresses[*]"='main st' won't work.
0.9.0
Summary
This release introduces a new features: Segment Merge and Rollup to simplify users day to day operational work. A new metrics plugin is added to support dropwizard. As usual, new functionalities and many UI/ Performance improvements.
The release was cut from the following commit: 13c9ee9 and the following cherry-picks: 668b5e0, ee887b9
Support Segment Merge and Roll-up
LinkedIn operates a large multi-tenant cluster that serves a business metrics dashboard, and noticed that their tables consisted of millions of small segments. This was leading to slow operations in Helix/Zookeeper, long running queries due to having too many tasks to process, as well as using more space because of a lack of compression.
To solve this problem they added the Segment Merge task, which compresses segments based on timestamps and rolls up/aggregates older data. The task can be run on a schedule or triggered manually via the Pinot REST API.
At the moment this feature is only available for offline tables, but will be added for real-time tables in a future release.
Major Changes:
Integrate enhanced SegmentProcessorFramework into MergeRollupTaskExecutor ()
Merge/Rollup task scheduler for offline tables. ()
Fix MergeRollupTask uploading segments not updating their metadata ()
UI Improvement
This release also sees improvements to Pinot’s query console UI.
Cmd+Enter shortcut to run query in query console ()
Showing tooltip in SQL Editor ()
Make the SQL Editor box expandable ()
SQL Improvements
There have also been improvements and additions to Pinot’s SQL implementation.
New functions:
IN ()
LASTWITHTIME ()
ID_SET on MV columns ()
New predicates are supported:
LIKE()
REGEXP_EXTRACT()
FILTER()
Query compatibility improvements:
Infer data type for Literal ()
Support logical identifier in predicate ()
Support JSON queries with top-level array path expression. ()
Performance Improvements
This release contains many performance improvement, you may sense it for you day to day queries. Thanks to all the great contributions listed below:
Reduce the disk usage for segment conversion task ()
Simplify association between Java Class and PinotDataType for faster mapping ()
Avoid creating stateless ParseContextImpl once per jsonpath evaluation, avoid varargs allocation ()
Other Notable New Features and Changes
Human Readable Controller Configs ()
Add the support of geoToH3 function ()
Add Apache Pulsar as Pinot Plugin () ()
Major Bug fixes
Fix null pointer exception for non-existed metric columns in schema for JDBC driver ()
Fix the config key for TASK_MANAGER_FREQUENCY_PERIOD ()
Only include the given paths, e.g. "$.a.b", "$.a.c[*]" (mutual exclusive with excludePaths). Paths under the included paths will be included, e.g. "$.a.b.c" will be included when "$.a.b" is configured to be included.
Set<String>
null (include all paths)
excludePaths
Exclude the given paths, e.g. "$.a.b", "$.a.c[*]" (mutual exclusive with includePaths). Paths under the excluded paths will also be excluded, e.g. "$.a.b.c" will be excluded when "$.a.b" is configured to be excluded.
Set<String>
null (include all paths)
excludeFields
Exclude the given fields, e.g. "b", "c", even if it is under the included paths.
Set<String>
null (include all fields)
indexPaths
Index the given paths, e.g. *.*, a.**. Paths matches the indexed paths will be indexed, e.g. a.** will index everything whose first layer is "a", *.* will index everything with maxLevels=2. This config could work together with other configs, e.g. includePaths, excludePaths, maxLevels but usually does not have to because it should be flexible enough to catch any scenarios.
Set<String>
null that is equivalent to ** (include all fields)
maxValueLength
If the value of a json node (not the whole document) is longer than given value then replace it with $SKIPPED$ before indexing.
int
0 (disabled)
skipInvalidJson
If set, while adding json to index, instead of throwing exception, replace ill-formed json with empty key/path and $SKIPPED$ value .
boolean
false (disabled)
maxLevels
Max levels to flatten the json object (array is also counted as one level)
int
-1 (unlimited)
excludeArray
Whether to exclude array when flattening the object
boolean
false (include array)
disableCrossArrayUnnest
Whether to not unnest multiple arrays (unique combination of all elements in those arrays). If document contains two arrays holding, respectively M and N elements, then flattening produces M*N documents. If number of such combinations reaches 100k, error with "Got too many combinations" message is thrown.
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name"=''adam''')
SELECT jsonextractindex(repo, '$.name', 'STRING', 'dummyValue', '"$.id" < 10')
FROM mytable
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name"=''adam''')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].number"=112')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country" != ''us''')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, 'REGEXP_LIKE("$.addresses[*].street", ''.*st.*'')')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.age" > 18')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name"=''adam'' AND "$.addresses[*].number"=112')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[0].number"=112')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country"=''us'' AND "$.skills[*]"=''english''')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country"=''us''')
AND JSON_MATCH(person, '"$.skills[*]"=''english''')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.phone" IS NOT NULL')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[0].floor" IS NULL')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].street"=''main st'' AND "$.addresses[*].country"=''ca''')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].street"=''main st''')
AND JSON_MATCH(person, '"$.addresses[*].country"=''ca''')
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[0].street"=''main st''')
AND JSON_MATCH(person, '"$.addresses[1].street"=''second st''')
["item1", "item2", "item3"]
SELECT ...
FROM mytable
WHERE JSON_MATCH(arrayCol, '"$[*]"=''item1''')
SELECT ...
FROM mytable
WHERE JSON_MATCH(arrayCol, '"$[1]"=''item2''')
123
1.23
"Hello World"
SELECT ...
FROM mytable
WHERE JSON_MATCH(valueCol, '"$"=123')
null
SELECT ...
FROM mytable
WHERE JSON_MATCH(nullableCol, '"$" IS NULL')
Star-tree index
This page describes the indexing techniques available in Apache Pinot.
In this page you will learn what a star-tree index is and gain a conceptual understanding of how one works.
Unlike other index techniques which work on a single column, the star-tree index is built on multiple columns and utilizes pre-aggregated results to significantly reduce the number of values to be processed, resulting in improved query performance.
One of the biggest challenges in real-time OLAP systems is achieving and maintaining tight SLAs on latency and throughput on large data sets. Existing techniques such as or help improve query latencies, but speed-ups are still limited by the number of documents that need to be processed to compute results. On the other hand, pre-aggregating the results ensures a constant upper bound on query latencies, but can lead to storage space explosion.
Use the star-tree index to utilize pre-aggregated documents to achieve both low query latencies and efficient use of storage space for aggregation and group-by queries.
Existing solutions
Consider the following data set, which is used here as an example to discuss these indexes:
Country
Browser
Locale
Impressions
CA
Chrome
en
400
CA
Firefox
fr
200
MX
Safari
es
Sorted index
In this approach, data is sorted on a primary key, which is likely to appear as filter in most queries in the query set.
This reduces the time to search the documents for a given primary key value from linear scan O(n) to binary search O(logn), and also keeps good locality for the documents selected.
While this is a significant improvement over linear scan, there are still a few issues with this approach:
While sorting on one column does not require additional space, sorting on additional columns requires additional storage space to re-index the records for the various sort orders.
While search time is reduced from O(n) to O(logn), overall latency is still a function of the total number of documents that need to be processed to answer a query.
Inverted index
In this approach, for each value of a given column, we maintain a list of document id’s where this value appears.
Below are the inverted indexes for columns ‘Browser’ and ‘Locale’ for our example data set:
Browser
Doc Id
Firefox
1,5,6
Chrome
0,4
Safari
2,3
Locale
Doc Id
en
0,3,4,6
es
2,5
fr
1
For example, if we want to get all the documents where ‘Browser’ is ‘Firefox’, we can look up the inverted index for ‘Browser’ and identify that it appears in documents [1, 5, 6].
Using an inverted index, we can reduce the search time to constant time O(1). The query latency, however, is still a function of the selectivity of the query: it increases with the number of documents that need to be processed to answer the query.
Pre-aggregation
In this technique, we pre-compute the answer for a given query set upfront.
In the example below, we have pre-aggregated the total impressions for each country:
Country
Impressions
CA
600
MX
400
USA
1200
With this approach, answering queries about total impressions for a country is a value lookup, because we have eliminated the need to process a large number of documents. However, to be able to answer queries that have multiple predicates means we would need to pre-aggregate for various combinations of different dimensions, which leads to an exponential increase in storage space.
Star-tree solution
On one end of the spectrum we have indexing techniques that improve search times with a limited increase in space, but don't guarantee a hard upper bound on query latencies. On the other end of the spectrum, we have pre-aggregation techniques that offer a hard upper bound on query latencies, but suffer from exponential explosion of storage space
The star-tree data structure offers a configurable trade-off between space and time and lets us achieve a hard upper bound for query latencies for a given use case. The following sections cover the star-tree data structure, and explain how Pinot uses this structure to achieve low latencies with high throughput.
Definitions
Tree structure
The star-tree index stores data in a structure that consists of the following properties:
Star-tree index structure
Root node (Orange): Single root node, from which the rest of the tree can be traversed.
Leaf node (Blue): A leaf node can containing at most T records, where T is configurable.
Non-leaf node (Green): Nodes with more than T records are further split into children nodes.
Star node (Yellow): Non-leaf nodes can also have a special child node called the star node. This node contains the pre-aggregated records after removing the dimension on which the data was split for this level.
Dimensions split order ([D1, D2]): Nodes at a given level in the tree are split into children nodes on all values of a particular dimension. The dimensions split order is an ordered list of dimensions that is used to determine the dimension to split on for a given level in the tree.
Node properties
The properties stored in each node are as follows:
Dimension: The dimension that the node is split on
Start/End Document Id: The range of documents this node points to
Aggregated Document Id: One single document that is the aggregation result of all documents pointed by this node
Index generation
The star-tree index is generated in the following steps:
The data is first projected as per the dimensionsSplitOrder. Only the dimensions from the split order are reserved, others are dropped. For each unique combination of reserved dimensions, metrics are aggregated per configuration. The aggregated documents are written to a file and served as the initial star-tree documents (separate from the original documents).
Sort the star-tree documents based on the dimensionsSplitOrder. It is primary-sorted on the first dimension in this list, and then secondary sorted on the rest of the dimensions based on their order in the list. Each node in the tree points to a range in the sorted documents.
The tree structure can be created recursively (starting at root node) as follows:
If a node has more than T records, it is split into multiple children nodes, one for each value of the dimension in the split order corresponding to current level in the tree.
A star node can be created (per configuration) for the current node, by dropping the dimension being split on, and aggregating the metrics for rows containing dimensions with identical values. These aggregated documents are appended to the end of the star-tree documents.
If there is only one value for the current dimension, a star node won’t be created because the documents under the star node are identical to the single node.
The above step is repeated recursively until there are no more nodes to split.
Multiple star-trees can be generated based on different configurations (dimensionsSplitOrder, aggregations, T)
Aggregation
Aggregation is configured as a pair of aggregation functions and the column to apply the aggregation.
All types of aggregation function that have a bounded-sized intermediate result are supported.
Supported functions
COUNT
MIN
MAX
SUM
SUM_PRECISION
The maximum precision can be optionally configured in functionParameters using the key precision. For example: {"precision": 20}.
AVG
MIN_MAX_RANGE
PERCENTILE_EST
PERCENTILE_RAW_EST
PERCENTILE_TDIGEST
The compression factor for the TDigest histogram can be optionally configured in functionParameters using the key compressionFactor. For example: {"compressionFactor": 200}. If not configured, the default value of 100 will be used.
PERCENTILE_RAW_TDIGEST
The compression factor for the TDigest histogram can be optionally configured in functionParameters using the key compressionFactor. For example: {"compressionFactor": 200}. If not configured, the default value of 100 will be used.
DISTINCT_COUNT_BITMAP
NOTE: The intermediate result RoaringBitmap is not bounded-sized, use carefully on high cardinality columns.
DISTINCT_COUNT_HLL
The log2m value for the HyperLogLog structure can be optionally configured in functionParameters , for example: {"log2m": 16}. If not configured, the default value of 8 will be used. Remember that a larger log2m value leads to better accuracy but also a larger memory footprint.
DISTINCT_COUNT_RAW_HLL
The log2m value for the HyperLogLog structure can be optionally configured in functionParameters , for example: {"log2m": 16}. If not configured, the default value of 8 will be used. Remember that a larger log2m value leads to better accuracy but also a larger memory footprint.
DISTINCT_COUNT_HLL_PLUS
The p (precision value of normal set) and sp (precision value of sparse set) values for the HyperLogLogPlus structure can be optionally configured in functionParameters, for example: {"p": 16, "sp": 32}. If not configured, p will have the default value of 14
DISTINCT_COUNT_RAW_HLL_PLUS
The p (precision value of normal set) and sp (precision value of sparse set) values for the HyperLogLogPlus structure can be optionally configured in functionParameters, for example: {"p": 16, "sp": 32}. If not configured, p will have the default value of 14
DISTINCT_COUNT_THETA_SKETCH
The nominalEntries value for the Theta Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192}
DISTINCT_COUNT_RAW_THETA_SKETCH
The nominalEntries value for the Theta Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192}
DISTINCT_COUNT_TUPLE_SKETCH
The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192}
DISTINCT_COUNT_RAW_INTEGER_SUM_TUPLE_SKETCH
The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192}
SUM_VALUES_INTEGER_SUM_TUPLE_SKETCH
The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192}
AVG_VALUE_INTEGER_SUM_TUPLE_SKETCH
The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192}
DISTINCT_COUNT_CPC_SKETCH
The lgK value for the CPC Sketch can be optionally configured in functionParameters, for example: {"lgK": 13}. If not configured, the default value of 12 will be used. Note that the nominalEntries provided at query time should be 2 ^ lgK in order for a star-tree index to be used. For instance, a star-tree index with
DISTINCT_COUNT_RAW_CPC_SKETCH
DISTINCT_COUNT_ULL
The p value (precision parameter) for the UltraLogLog structure can be optionally configured in functionParameters, for example: {"p": 20}. If not configured, the default value of 12 will be used.
DISTINCT_COUNT_RAW_ULL
The p value (precision parameter) for the UltraLogLog structure can be optionally configured in functionParameters, for example: {"p": 20}. If not configured, the default value of 12 will be used.
Unsupported functions
DISTINCT_COUNT
Intermediate result Set is unbounded.
SEGMENT_PARTITIONED_DISTINCT_COUNT:
Intermediate result Set is unbounded.
PERCENTILE
Intermediate result List is unbounded.
Functions to be supported
ST_UNION
Index generation configuration
Multiple index generation configurations can be provided to generate multiple star-trees. Each configuration should contain the following properties:
Property
Description
dimensionsSplitOrder
An ordered list of dimension names can be specified to configure the split order. Only the dimensions in this list are reserved in the aggregated documents. The nodes will be split based on the order of this list. For example, split at level i is performed on the values of dimension at index i in the list.
- The star-tree dimension does not have to be a dimension column in the table, it can also be time column, date-time column, or metric column if necessary.
- The star-tree dimension column should be dictionary encoded in order to generate the star-tree index.
- All columns in the filter and group-by clause of a query should be included in this list in order to use the star-tree index.
skipStarNodeCreationForDimensions
(Optional, default empty): A list of dimension names for which to not create the Star-Node.
functionColumnPairs
A list of aggregation function and column pairs (split by double underscore “__”). E.g. SUM__Impressions (SUM of column Impressions) or COUNT__*.
aggregationConfigs
Check
maxLeafRecords
(Optional, default 10000): The threshold T to determine whether to further split each node.
`functionColumnPairs` and `aggregationConfigs` are interchangeable. Consider using `aggregationConfigs` since it supports additional parameters like compression.
AggregationConfigs
All aggregations of a query should be included in `aggregationConfigs` or in `functionColumnPairs` in order to use the star-tree index.
Property
Description
columnName
(Required) Name of the column to aggregate. The column can be either dictionary encoded or raw.
aggregationFunction
(Required) Name of the aggregation function to use.
compressionCodec
(Optional, default PASS_THROUGH, introduced in release 1.1.0) Used to configure the compression enabled on the star-tree-index. Useful when aggregating on columns that contain big values. For example, a BYTES column containing HLL counters serialisations used to calculate DISTINCTCOUNTHLL. In this case setting "compressionCodec": "LZ4" can significantly reduce the space used by the index. Equivalent to compressionCodec in
deriveNumDocsPerChunk
(Optional, introduced in release 1.2.0) Equivalent to deriveNumDocsPerChunk in
indexVersion
(Optional, introduced in release 1.2.0) Equivalent to rawIndexWriterVersion in
targetMaxChunkSize
(Optional, introduced in release 1.2.0) Equivalent to targetMaxChunkSize in
Default index generation configuration
A default star-tree index can be added to a segment by using the boolean config enableDefaultStarTree under the tableIndexConfig.
A default star-tree will have the following configuration:
All dictionary-encoded single-value dimensions with cardinality smaller or equal to a threshold (10000) will be included in the dimensionsSplitOrder, sorted by their cardinality in descending order.
All dictionary-encoded Time/DateTime columns will be appended to the _dimensionsSplitOrder _following the dimensions, sorted by their cardinality in descending order. Here we assume that time columns will be included in most queries as the range filter column and/or the group by column, so for better performance, we always include them as the last elements in the dimensionsSplitOrder.
Include COUNT(*) and SUM for all numeric metrics in the functionColumnPairs.
Use default maxLeafRecords (10000).
Example
For our example data set, in order to solve the following query efficiently:
We may configure the star-tree index as follows:
Alternatively using aggregationConfigs instead of functionColumnPairs and enabling compression on the aggregation:
Note: In above example configs maxLeafRecords is set to 1 so that all of the dimension combinations are pre-aggregated for clarity in visual below.
The star-tree and documents should be something like below:
Tree structure
The values in the parentheses are the aggregated sum of Impressions for all the documents under the node.
Star-tree documents
Country
Browser
Locale
SUM__Impressions
CA
Chrome
en
400
CA
Firefox
fr
200
MX
Safari
en
Query execution
For query execution, the idea is to first check metadata to determine whether the query can be solved with the star-tree documents, then traverse the Star-Tree to identify documents that satisfy all the predicates. After applying any remaining predicates that were missed while traversing the star-tree to the identified documents, apply aggregation/group-by on the qualified documents.
The algorithm to traverse the tree can be described as follows:
Start from root node.
For each level, what child node(s) to select depends on whether there are any predicates/group-by on the split dimension for the level in the query.
If there is no predicate or group-by on the split dimension, select the Star-Node if exists, or all child nodes to traverse further.
If there are predicate(s) on the split dimension, select the child node(s) that satisfy the predicate(s).
If there is no predicate, but there is a group-by on the split dimension, select all child nodes except Star-Node.
Recursively repeat the previous step until all leaf nodes are reached, or all predicates are satisfied.
Collect all the documents pointed by the selected nodes.
If all predicates and group-by's are satisfied, pick the single aggregated document from each selected node.
Otherwise, collect all the documents in the document range from each selected node.note
Predicates
Supported Predicates
EQ (=)
NOT EQ (!=)
IN
NOT IN
RANGE (>, >=, <, <=, BETWEEN)
AND
Unsupported Predicates
REGEXP_LIKE: It is intentionally left unsupported because it requires scanning the entire dictionary.
IS NULL: Currently NULL value info is not stored in star-tree index, and the dimension will be indexed as default value. A workaround is to do col = <default> instead.
IS NOT NULL: Same as IS NULL. A workaround is to do col != <default>.
Limited Support Predicates
OR
It can be applied to predicates on the same dimension, e.g. WHERE d1 < 10 OR d1 > 50)
It CANNOT be applied to predicates on multiple dimensions because star-tree index will double counting with pre-aggregated results.
NOT (Added since 1.2.0)
It can be applied to simple predicate and NOT
It CANNOT be applied on top of AND/OR because star-tree index will double counting with pre-aggregated results.
In scenarios where you have a transform on a column(s) which is in the dimension split order (should include all columns that are either a predicate or a group by column in target query(ies)) AND used in a group-by, then Star-tree index will get applied automatically. If a transform is applied to a column(s) which is used in predicate (WHERE clause) then Star-tree index won't apply.
For e.g if query contains round(colA,600) as roundedValue from tableA group by roundedValue and colA is included in dimensionSplitOrder then Pinot will use the pre-aggregated records to first scan matching records and then apply transform round() to derive roundedValue.
(Optional, introduced in release 1.2.0) Equivalent to targetDocsPerChunk in
functionParameters
(Optional) A configuration map used to pass in additional configurations to the aggregation function. For example, on DISTINCTCOUNTHLL, this could look like {"log2m": 16} in order to build the star-tree index using DISTINCTCOUNTHLL with a non-default value for log2m. Note that the index will only be used for queries using the same value for log2m with DISTINCTCOUNTHLL.
Apache Pinot 0.11.0 has introduced many new features to extend the query abilities, e.g. the Multi-Stage query engine enables Pinot to do distributed joins, more sql syntax(DML support), query functions and indexes(Text index, Timestamp index) supported for new use cases. And as always, more integrations with other systems(E.g. Spark3, Flink).
Note: there is a major upgrade for Apache Helix to 1.0.4, so make sure you upgrade the system in the order of:
The new multi-stage query engine (a.k.a V2 query engine) is designed to support more complex SQL semantics such as JOIN, OVER window, MATCH_RECOGNIZE and eventually, make Pinot support closer to full ANSI SQL semantics.
More to read:
Pause Stream Consumption on Apache Pinot
Pinot operators can pause real-time consumption of events while queries are being executed, and then resume consumption when ready to do so again.\
More to read:
Gap-filling function
The gapfilling functions allow users to interpolate data and perform powerful aggregations and data processing over time series data.
More to read:
Add support for Spark 3.x ()
Long waiting feature for segment generation on Spark 3.x.
Add Flink Pinot connector ()
Similar to the Spark Pinot connector, this allows Flink users to dump data from the Flink application to Pinot.
Show running queries and cancel query by id ()
This feature allows better fine-grained control on pinot queries.
Timestamp Index ()
This allows users to have better query performance on the timestamp column for lower granularity. See:
Native Text Indices ()
Wanna search text in real time? The new text indexing engine in Pinot supports the following capabilities:
New operator: LIKE
New operator: CONTAINS
Native text index, built from the ground up, focusing on Pinot’s time series use cases and utilizing existing Pinot indices and structures(inverted index, bitmap storage).
Real Time Text Index
Read more:
Adding DML definition and parse SQL InsertFile ()
Now you can use INSERT INTO [database.]table FROM FILE dataDirURI OPTION ( k=v ) [, OPTION (k=v)]* to load data into Pinot from a file using Minion. See:
Deduplication ()
This feature supports enabling deduplication for real-time tables, via a top-level table config. At a high level, primaryKey (as defined in the table schema) hashes are stored into in-memory data structures, and each incoming row is validated against it. Duplicate rows are dropped.
The expectation while using this feature is for the stream to be partitioned by the primary key, strictReplicaGroup routing to be enabled, and the configured stream consumer type to be low level. These requirements are therefore mandated via table config API's input validations.
Functions support and changes:
Add support for functions arrayConcatLong, arrayConcatFloat, arrayConcatDouble ()
Add support for regexpReplace scalar function ()
Add support for Base64 Encode/Decode Scalar Functions ()
The full list of features introduced in this release
add query cancel APIs on controller backed by those on brokers ()
Add an option to search input files recursively in ingestion job. The default is set to true to be backward compatible. ()
Adding endpoint to download local log files for each component ()
Vulnerability fixs
Pinot has resolved all the high-level vulnerabilities issues:
Add a new workflow to check vulnerabilities using trivy ()
Disable Groovy function by default ()
Upgrade netty due to security vulnerability ()
Bug fixs
Nested arrays and map not handled correctly for complex types ()
Fix empty data block not returning schema ()
Allow mvn build with development webpack; fix instances default value ()
Stream ingestion with Upsert
Upsert support in Apache Pinot.
Pinot provides native support of upserts during real-time ingestion. There are scenarios where records need modifications, such as correcting a ride fare or updating a delivery status.
Partial upserts are convenient as you only need to specify the columns where values change, and you ignore the rest.
Overview of upserts in Pinot
See an overview of how upserts work in Pinot 1.0.
Optimize like to regexp conversion to do not include unnecessary ^._ and ._$ (#8893)
To update a record, you need a primary key to uniquely identify the record. To define a primary key, add the field primaryKeyColumns to the schema definition. For example, the schema definition of UpsertMeetupRSVP in the quick start example has this definition.
Note this field expects a list of columns, as the primary key can be a composite.
When two records of the same primary key are ingested, the record with the greater comparison value (timeColumn by default) is used. When records have the same primary key and event time, then the order is not determined. In most cases, the later ingested record will be used, but this may not be true in cases where the table has a column to sort by.
Partition the input stream by the primary key
An important requirement for the Pinot upsert table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the send API. If the original stream is not partitioned, then a streaming processing job (such as with Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.
Additionally if using segmentPartitionConfigto leverage Broker segment pruning then it's important to ensure that the partition function used matches both on the Kafka producer side as well as Pinot. In Kafka default for Java client is 32-bit murmur2 hash and for all other languages such as Python its CRC32 (Cyclic Redundancy Check 32-bit).
Enable upsert in the table configurations
To enable upsert, make the following configurations in the table configurations.
Upsert modes
Full upsert
The upsert mode defaults to FULL . FULL upsert means that a new record will replace the older record completely if they have same primary key. Example config:
Partial upserts
Partial upsert lets you choose to update only specific columns and ignore the rest.
To enable the partial upsert, set the mode to PARTIAL and specify partialUpsertStrategies for partial upsert columns. Since release-0.10.0, OVERWRITE is used as the default strategy for columns without a specified strategy. defaultPartialUpsertStrategy is also introduced to change the default strategy for all columns.
Note that null handling must be enabled for partial upsert to work.
For example:
Pinot supports the following partial upsert strategies:
Strategy
Description
OVERWRITE
Overwrite the column of the last record
INCREMENT
Add the new value to the existing values
APPEND
Add the new item to the Pinot unordered set
UNION
Add the new item to the Pinot unordered set if not exists
IGNORE
Ignore the new value, keep the existing value (v0.10.0+)
MAX
Keep the maximum value betwen the existing value and new value (v0.12.0+)
With partial upsert, if the value is null in either the existing record or the new coming record, Pinot will ignore the upsert strategy and the null value:
(null, newValue) -> newValue
(oldValue, null) -> oldValue
(null, null) -> null
None upserts
If set mode to NONE, the upsert is disabled.
Comparison column
By default, Pinot uses the value in the time column (timeColumn in tableConfig) to determine the latest record. That means, for two records with the same primary key, the record with the larger value of the time column is picked as the latest update. However, there are cases when users need to use another column to determine the order. In such case, you can use option comparisonColumn to override the column used for comparison. For example,
For partial upsert table, the out-of-order events won't be consumed and indexed. For example, for two records with the same primary key, if the record with the smaller value of the comparison column came later than the other record, it will be skipped.
NOTE: Please use comparisonColumns for single comparison column instead of comparisonColumn as it is currently deprecated. You may see unrecognizedProperties when using the old config, but it's converted to comparisonColumns automatically when adding the table.
Multiple comparison columns
In some cases, especially where partial upsert might be employed, there may be multiple producers of data each writing to a mutually exclusive set of columns, sharing only the primary key. In such a case, it may be helpful to use one comparison column per producer group so that each group can manage its own specific versioning semantics without the need to coordinate versioning across other producer groups.
Documents written to Pinot are expected to have exactly 1 non-null value out of the set of comparisonColumns; if more than 1 of the columns contains a value, the document will be rejected. When new documents are written, whichever comparison column is non-null will be compared against only that same comparison column seen in prior documents with the same primary key. Consider the following examples, where the documents are assumed to arrive in the order specified in the array.
The following would occur:
orderReceived: 1
Result: persisted
Reason: first doc seen for primary key "aa"
orderReceived: 2
Result: persisted (replacing orderReceived: 1)
Reason: comparison column (secondsSinceEpoch) larger than that previously seen
orderReceived: 3
Result: rejected
Reason: comparison column (secondsSinceEpoch) smaller than that previously seen
orderReceived: 4
Result: persisted (replacing orderReceived: 2)
Reason: comparison column (otherComparisonColumn) larger than previously seen (never seen previously), despite the value being smaller than that seen for secondsSinceEpoch
orderReceived: 5
Result: rejected
Reason: comparison column (otherComparisonColumn) smaller than that previously seen
orderReceived: 6
Result: persist (replacing orderReceived: 4)
Reason: comparison column (otherComparisonColumn) larger than that previously seen
Metadata time-to-live (TTL)
In Pinot, the metadata map is stored in heap memory. To decrease in-memory data and improve performance, minimize the time primary key entries are stored in the metadata map (metadata time-to-live (TTL)). Limiting the TTL is especially useful for primary keys with high cardinality and frequent updates.
Since the metadata TTL is applied on the first comparison column, the time unit of upsert TTL is the same as the first comparison column.
Configure how long primary keys are stored in metadata
To configure how long primary keys are stored in metadata, specify the length of time in upsertTTL. For example:{
In this example, Pinot will retain primary keys in metadata for 1 day.
Note that enabling upsert snapshot is required for metadata TTL for in-memory validDocsIDs recovery.
Delete column
Upsert Pinot table can support soft-deletes of primary keys. This requires the incoming record to contain a dedicated boolean single-field column that serves as a delete marker for a primary key. Once the real-time engine encounters a record with delete column set to true , the primary key will no longer be part of the queryable set of documents. This means the primary key will not be visible in the queries, unless explicitly requested via query option skipUpsert=true.
Note that the delete column has to be a single-value boolean column.
Note that when deleteRecordColumn is added to an existing table, it will require a server restart to actually pick up the upsert config changes.
A deleted primary key can be revived by ingesting a record with the same primary, but with higher comparison column value(s).
Note that when reviving a primary key in a partial upsert table, the revived record will be treated as the source of truth for all columns. This means any previous updates to the columns will be ignored and overwritten with the new record's values.
Deleted Keys time-to-live (TTL)
The above config deleteRecordColumn only soft-deletes the primary key. To decrease in-memory data and improve performance, minimize the time deleted-primary-key entries are stored in the metadata map (deletedKeys time-to-live (TTL)). Limiting the TTL is especially useful for deleted-primary-keys where there are no future updates foreseen.
Configure how long deleted-primary-keys are stored in metadata
To configure how long primary keys are stored in metadata, specify the length of time in deletedKeysTTL For example:
In this example, Pinot will retain the deleted-primary-keys in metadata for 1 day.
Note that the value of this field deletedKeysTTL should be the same as the unit of comparison column. If your comparison column is having values which corresponds to seconds, this config should also have values in seconds (see above example).
Data consistency with deletes and compaction together
When using deletedKeysTTL together with UpsertCompactionTask, there can be a scenario where a segment containing deleted-record (where deleteRecordColumn = true was set for the primary key) gets compacted first and a previous old record is not yet compacted. During server restart, now the old record is added to the metadata manager map and is treated as non-deleted. To prevent data inconsistencies in this scenario, we have added a new config enableDeletedKeysCompactionConsistency which when set to true, will ensure that the deleted records are not compacted until all the previous records from all other segments are compacted for the deleted primary-key.
Data consistency when queries and upserts happen concurrently
Upserts in Pinot enable real-time updates and ensure that queries always retrieve the latest version of a record, making them a powerful feature for managing mutable data efficiently. However, in applications with extremely high QPS and high ingestion rates, queries and upserts happening concurrently can sometimes lead to inconsistencies in query results.
For example, consider a table with 1 million primary keys. A distinct count query should always return 1 million, regardless of how new records are ingested and older records are invalidated. However, at high ingestion and query rates, the query may occasionally return a count slightly above or below 1 million. This happens because queries determine valid records by acquiring validDocIds bitmaps from multiple segments, which indicate which documents are currently valid. Since acquiring these bitmaps is not atomic with respect to ongoing upserts, a query may capture an inconsistent view of the data, leading to overcounting or undercounting of valid records.
This is a classic concurrency issue where reads and writes happen simultaneously, leading to temporary inconsistencies. Typically, such issues are resolved using locks or snapshots to maintain a stable view of the data during query execution. To address this, two new consistency modes - SYNC and SNAPSHOT - have been introduced for upsert enabled tables to ensure consistent query results even when queries and upserts occur concurrently and at very high throughput.
By default, the consistency mode is NONE, meaning the system operates as before. The SYNC mode ensures consistency by blocking upserts while queries execute, guaranteeing that queries always see a stable upserted data view. However, this can introduce write latency. Alternatively, the SNAPSHOT mode creates a consistent snapshot of validDocIds bitmaps for queries to use. This allows upserts to continue without blocking queries, making it more suitable for workloads with both high query and write rates.
These new consistency modes provide flexibility, allowing applications to balance consistency guarantees against performance trade-offs based on their specific requirements.
Use strictReplicaGroup for routing
The upsert Pinot table can use only the low-level consumer for the input streams. As a result, it uses the partitioned replica-group assignment implicitly for the segments. Moreover, upsert poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires to use strictReplicaGroup as the routing strategy. To use that, configure instanceSelectorType in Routing as the following:
Using implicit partitioned replica-group assignment from low-level consumer won't persist the instance assignment (mapping from partition to servers) to the ZooKeeper, and new added servers will be automatically included without explicit reassigning instances (usually through rebalance). This can cause new segments of the same partition assigned to a different server and break the requirement of upsert.
To prevent this, we recommend using explicit partitioned replica-group instance assignment to ensure the instance assignment is persisted. Note that numInstancesPerPartition should always be 1 in replicaGroupPartitionConfig.
Enable validDocIds snapshots for upsert metadata recovery
Upsert snapshot support is also added in release-0.12.0. To enable the snapshot, set the enableSnapshot to true. For example:
Upsert maintains metadata in memory containing which docIds are valid in a particular segment (ValidDocIndexes). This metadata gets lost during server restarts and needs to be recreated again.
ValidDocIndexes can not be recovered easily after out-of-TTL primary keys get removed. Enabling snapshots addresses this problem by adding functions to store and recover validDocIds snapshot for Immutable Segments
The snapshots are taken on every segment commit to ensure that they are consistent with the persisted data in case of abrupt shutdown.
We recommend that you enable this feature so as to speed up server boot times during restarts.
The lifecycle for validDocIds snapshots are shows as follows,
If snapshot is enabled, snapshots for existing segments are taken or refreshed when the next consuming segment gets started.
The snapshot files are kept on disk until the segments get removed, e.g. due to data retention or manual deletion.
If snapshot is disabled, the existing snapshot for a segment is cleaned up when the segment gets loaded by the server, e.g. when the server restarts.
Enable preload for faster server restarts
Upsert preload feature can make it faster to restore the upsert states when server restarts. To enable the preload feature, set the enablePreload to true. To enable preloading, enableSnapshot: true should also be set in the table config. For example:
Under the hood, it uses the validDocIds snapshots to identify the valid docs and restore their upsert metadata quickly instead of performing a whole upsert comparison flow. The flow is triggered before the server is marked as ready, after which the server starts to load the remaining segments without snapshots (hence the name preload).
The feature also requires you to specify pinot.server.instance.max.segment.preload.threads: N in the server config where N should be replaced with the number of threads that should be used for preload. It's 0 by default to disable the preloading feature.
A bug was introduced in v1.2.0 that when enablePreload and enableSnapshot flags are set to true but max.segment.preload.threads is left as 0, the preloading mechanism is still enabled but segments fail to get loaded as there is no threads for preloading. This was fixed in newer versions, but for v1.2.0, if enablePreload and enableSnapshot are set to true, remember to set max.segment.preload.threads to a positive value as well. Server restart is needed to get max.segment.preload.threads config change into effect.
Handle out-of-order events
There are 2 configs added related to handling out-of-order events.
dropOutOfOrderRecord
To enable dropping of out-of-order record, set the dropOutOfOrderRecord to true. For example:
This feature doesn't persist any out-of-order event to the consuming segment. If not specified, the default value is false.
When false, the out-of-order record gets persisted to the consuming segment, but the MetadataManager mapping is not updated thus this record is not referenced in query or in any future updates. You can still see the records when using skipUpsert query option.
When true, the out-of-order record doesn't get persisted at all and the MetadataManager mapping is not updated so this record is not referenced in query or in any future updates. You cannot see the records when using skipUpsert query option.
outOfOrderRecordColumn
This is to identify out-of-order events programmatically. To enable this config, add a boolean field in your table schema, say isOutOfOrder and enable via this config. For example:
This feature persists a true / false value to the isOutOfOrder field based on the orderness of the event. You can filter out out-of-order events while using skipUpsert to avoid any confusion. For example:
Use custom metadata manager
Pinot supports custom PartitionUpsertMetadataManager that handle records and segments updates.
Adding custom upsert managers
You can add custom PartitionUpsertMetadataManager as follows:
Create a new java project. Make sure you keep the package name as org.apache.pinot.segment.local.upsert.xxx
In your java project include the dependency
Add your custom partition manager that implements PartitionUpsertMetadataManager interface
Add your custom TableUpsertMetadataManager that implements BaseTableUpsertMetadataManager interface
Place the compiled JAR in the /plugins directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the custom upsert manager in table configs as follows:
⚠️ The upsert manager class name is case-insensitive as well.
Upsert table limitations
There are some limitations for the upsert Pinot tables.
The upsert feature is supported for Real-time tables only, and not for Hybrid or Offline tables.
The high-level consumer is not allowed for the input stream ingestion, which means stream.[consumerName].consumer.type must always be lowLevel.
The star-tree index cannot be used for indexing, as the star-tree index performs pre-aggregation during the ingestion.
Unlike append-only tables, out-of-order events (with comparison value in incoming record less than the latest available value) won't be consumed and indexed by Pinot partial upsert table, these late events will be skipped.
Best practices
Unlike other real-time tables, Upsert table takes up more memory resources as it needs to bookkeep the record locations in memory. As a result, it's important to plan the capacity beforehand, and monitor the resource usage. Here are some recommended practices of using Upsert table.
Create the topic/stream with more partitions.
The number of partitions in input streams determines the partition numbers of the Pinot table. The more partitions you have in input topic/stream, more Pinot servers you can distribute the Pinot table to and therefore more you can scale the table horizontally. Do note that you can't increase the partitions in future for upsert enabled tables so you need to start with good enough partitions (atleast 2-3X the number of pinot servers)
Memory usage
Upsert table maintains an in-memory map from the primary key to the record location. So it's recommended to use a simple primary key type and avoid composite primary keys to save the memory cost. Beware when using JSON column as primary key, same key-values in different order would be considered as different primary keys. In addition, consider the hashFunction config in the Upsert config, which can be MD5 or MURMUR3, to store the 128-bit hashcode of the primary key instead. This is useful when your primary key takes more space. But keep in mind, this hash may introduce collisions, though the chance is very low.
Monitoring
Set up a dashboard over the metric pinot.server.upsertPrimaryKeysCount.tableName to watch the number of primary keys in a table partition. It's useful for tracking its growth which is proportional to the memory usage growth. **** The total memory usage by upsert is roughly (primaryKeysCount * (sizeOfKeyInBytes + 24))
Capacity planning
It's useful to plan the capacity beforehand to ensure you will not run into resource constraints later. A simple way is to measure the rate of the primary keys in the input stream per partition and extrapolate the data to a specific time period (based on table retention) to approximate the memory usage. A heap dump is also useful to check the memory usage so far on an upsert table instance.
Example
Putting these together, you can find the table configurations of the quick start examples as the following:
Pinot server maintains a primary key to record location map across all the segments served in an upsert-enabled table. As a result, when updating the config for an existing upsert table (e.g. change the columns in the primary key, change the comparison column), servers need to be restarted in order to apply the changes and rebuild the map.
Quick Start
To illustrate how the full upsert works, the Pinot binary comes with a quick start example. Use the following command to creates a real-time upsert table meetupRSVP.
You can also run partial upsert demo with the following command
As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to check out the real-time data.
Query the upsert table
For partial upsert you can see only the value from configured column changed based on specified partial upsert strategy.
Query the partial upsert table
An example for partial upsert is shown below, each of the event_id kept being unique during ingestion, meanwhile the value of rsvp_count incremented.
Explain partial upsert table
To see the difference from the non-upsert table, you can use a query option skipUpsert to skip the upsert effect in the query result.
Disable the upsert during query via query option
FAQ
Can I change primary key columns in existing upsert table?
Yes, you can add or delete columns to primary keys as long as input stream is partitioned on one of the primary key columns. However, you need to restart all Pinot servers so that it can rebuild the primary key to record location map with the new columns.
0.10.0
Summary
This release introduces some new great features, performance enhancements, UI improvements, and bug fixes which are described in details in the following sections.
The release was cut from this commit fd9c58a.
Dependency Graph
The dependency graph for plug-and-play architecture that was introduced in release has been extended and now it contains new nodes for Pinot Segment SPI.
SQL Improvements
Implement NOT Operator
Add DistinctCountSmartHLLAggregationFunction which automatically store distinct values in Set or HyperLogLog based on cardinality
Add LEAST and GREATEST functions
UI Enhancements
Show Reported Size and Estimated Size in human readable format in UI
Make query console state URL based
Improve query console to not show query result when multiple columns have the same name
Performance Improvements
Reuse regex matcher in dictionary based LIKE queries
Early terminate orderby when columns already sorted
Do not do another pass of Query Automaton Minimization
Other Notable Features
Adding NoopPinotMetricFactory and corresponding changes
Allow to specify fixed segment name for SegmentProcessorFramework
Move all prestodb dependencies into a separated module
This release brings significant improvements, including enhancements to the multistage query engine and the introduction of an experimental time series query engine for efficient analysis. Key features include database query quotas, cursor-based pagination for large result sets, multi-stream ingestion, and new function support for URL and GeoJson. Security vulnerabilities and several bug fixes and performance enhancements have been addressed, ensuring a more robust and versatile platform.
Multistage Engine Improvements
Reuse common expressions in a query (spool) ,
Refines query plan reuse in Apache Pinot by allowing reuse across stages instead of subtrees. Stages are natural boundaries in the query plan, divided into pull-based operators. To execute queries, Pinot introduces stages connected by MailboxSendOperator and MailboxReceiveOperator. The proposal modifies MailboxSendOperator to send data to multiple stages, transforming stage connections into a Directed Acyclic Graph (DAG) for greater efficiency and flexibility.
Segment Plan for MultiStage Queries ,
It focuses on providing comprehensive execution plans, including physical operator details. The new explain mode aligns with Calcite terminology and uses a broker-server communication flow to analyze and transform query plans into explained physical plans without executing them. A new ExplainedPlanNode is introduced to enrich query execution plans with physical details, ensuring better transparency and debugging capabilities for users.
DataBlock Serde Performance Improvements ,
Improve the performance of DataBlock building, serialization, and deserialization by reducing memory allocation and copies without altering the binary format. Benchmarks show 1x to 3x throughput gains, with significant reductions in memory allocation, minimizing GC-related latency issues in production. The improvement is achieved by changes to the buffers and the addition of a couple of stream classes.
Notable Improvements and Bug Fixes
Allow adding and subtracting timestamp types.
Remove PinotAggregateToSemiJoinRule to avoid mistakenly removing DISTINCT from the IN clause.
Support the use of timestamp indexes.
Timeseries Engine Support in Pinot
Introduction of a Generic Time Series Query Engine in Apache Pinot, enabling native support for various time-series query languages (e.g., PromQL, M3QL) through a pluggable framework. This enhancement addresses limitations in Pinot’s current SQL-based query engines for time-series analysis, providing optimized performance and usability for observability use cases, especially those requiring high-cardinality metrics.
NOTE: Timeseries Engine support in Pinot is currently in an Experimental state.
Key Features
Pluggable Time-Series Query Language:
Pinot will support multiple time-series query languages, such as PromQL and Uber’s M3QL, via plugins like pinot-m3ql.
Example queries:
Plot hourly order counts for specific merchants.
Pluggable Time-Series Operators:
Custom operators specific to each query language (e.g., nonNegativeDerivative or holt_winters) can be implemented within language-specific plugins without modifying Pinot’s core code.
Extensible operator abstractions will allow stakeholders to define unique time-series analysis functions.
Advantages of the New Engine:
Optimized for Time-Series Data: Processes data in series rather than rows, improving performance and simplifying the addition of complex analysis functions.
Reduced Complexity in Pinot Core: The engine reuses existing components like the Multi-Stage Engine (MSE) Query Scheduler, Query Dispatcher, and Mailbox. At the same time, language parsers and planners remain modular in plugins.
Improved Usability: Users can run concise and powerful time-series queries in their preferred language, avoiding the verbosity and limitations of SQL.
Impact on Observability Use Cases:
This new engine significantly enhances Pinot’s ability to handle complex time-series analyses efficiently, making it an ideal database for high-cardinality metrics and observability workloads.
The improvement is a step forward in transforming Pinot into a robust and versatile platform for time-series analytics, enabling seamless integration of diverse query languages and custom operators.
Here are some of the key PRs that have been merged as part of this feature:
Pinot time series engine SPI.
Add combine and segment level operators for time series.
Working E2E quickstart for time series engine.
Database Query Quota
Introduces the ability to impose query rate limits at the database level, covering all queries made to tables within a database. A database-level rate limiter is implemented, and a new method, acquireDatabase(databaseName), is added to the QueryQuotaManager interface to check database query quotas.
Database Query Quota Configuration
Query and storage quotas are now provisioned similarly to table quotas but managed separately in a DatabaseConfig znode.
Details about the DatabaseConfig znode:
It does not represent a logical database entity.
Default and Override Quotas
A default query quota (databaseMaxQueriesPerSecond: 1000) is provided in ClusterConfig.
Overrides for specific databases can be configured via znodes (e.g., PROPERTYSTORE/CONFIGS/DATABASE/).
APIs for Configuration
Method
Path
Description
Dynamic Quota Updates
Quotas are determined by a combination of default cluster-level quotas and database-specific overrides.
Per-broker quotas are adjusted dynamically based on the number of live brokers.
Updates are handled via:
This feature provides fine-grained control over query rate limits, ensuring scalability and efficient resource management for databases within Pinot.
Binary Workload Scheduler for Constrained Execution
Introduction of the BinaryWorkloadScheduler, which categorizes queries into two distinct workloads to ensure cluster stability and prioritize critical operations:
Workload Categories:
1. Primary Workload:
Default category for all production traffic.
Queries are executed using an unbounded FCFS (First-Come, First-Served) scheduler.
Designed for high-priority, critical queries to maintain consistent availability and performance.
2. Secondary Workload:
Reserved for ad-hoc queries, debugging tools, dashboards/notebooks, development environments, and one-off tests.
Imposes several constraints to minimize impact on the primary workload:
Limited concurrent queries: Caps the number of in-progress queries, with excess queries queued.
Key Benefits:
Prioritization: Guarantees the primary workload remains unaffected by resource-intensive or long-running secondary queries.
Stability: Protects cluster availability by preventing incidents caused by poorly optimized or excessive ad-hoc queries.
Scalability: Efficiently manages traffic in multi-tenant clusters, maintaining service reliability across workloads.
Cursors Support ,
Cursor support will allow Pinot clients to consume query results in smaller chunks. This feature allows clients to work with lesser resources esp. memory. Application logic is more straightforward with cursors. For example an app UI paginates through results in a table or a graph. Cursor support has been implemented using APIs.
API
Method
Path
Description
SPI
The feature provides two SPIs to extend the feature to support other implementations:
ResponseSerde: Serialize/Deserialize the response.
ResponseStore: Store responses in a storage system. Both SPIs use Java SPI and the default ServiceLoader to find implementation of the SPIs. All implementation should be annotated with AutoService to help generate files for discovering the implementations.
URL Functions Support
Implemented various URL functions to handle multiple aspects of URL processing, including extraction, encoding/decoding, and manipulation, making them useful for tasks involving URL parsing and modification
URL Extraction Methods
urlProtocol(String url): Extracts the protocol (scheme) from the URL.
urlDomain(String url): Extracts the domain from the URL.
urlDomainWithoutWWW(String url): Extracts the domain without the leading "www." if present.
URL Manipulation Methods
urlEncode(String url): Encodes a string into a URL-safe format.
urlDecode(String url) Decodes a URL-encoded string.
urlEncodeFormComponent(String url): Encodes the URL string following RFC-1866 standards, with spaces encoded as +.
Multi Stream Ingestion Support ,
Add support to ingest from multiple source by a single table
Use existing interface (TableConfig) to define multiple streams
Separate the partition id definition between Stream and Pinot segment
New Scalar Functions Support.
intDiv and intDivOrZero: Perform integer division, with intDivOrZero returning zero for division by zero or when dividing a minimal negative number by minus one.
isFinite, isInfinite, and isNaN: Check if a double value is finite, infinite, or NaN, respectively.
GeoJSON Support
Add support for GeoJSON Scalar functions:
Supported data types:
Point
LineString
Polygon
MultiPoint
Improved Implementation of Distinct Operators.
Main optimizations:
Add per data type DistinctTable and utilize primitive type if possible
Specialize single-column case to reduce overhead
Allow processing null values with dictionary based operators
Upsert Improvements
Features and Improvements
Track New Segments for Upsert Tables
Improvement for addressing a race condition where newly uploaded segments may be processed by the server before brokers add them to the routing table, potentially causing queries to miss valid documents.
Introduce a configurable newSegmentTrackingTimeMs (default 10s) to track new segments on the server side, allowing them to be accessed as optional segments until brokers update their routing tables.
Ensure Upsert Deletion Consistency with Compaction Flow Enabled
Enhancement addresses inconsistencies in upsert-compaction by introducing a mechanism to track the distinct segment count for primary keys. By ensuring a record exists in only one segment before compacting deleted records, it prevents older non-deleted records from being incorrectly revived during server restarts, ensuring consistent table state.
Consistent Segments Tracking for Consistent Upsert View
This improves consistent upsert view handling by addressing segment tracking and query inconsistencies. Key changes include:
Complete and Consistent Segment Tracking: Introduced a new Set to track segments before registration to the table manager, ensuring synchronized segment membership and validDocIds access.
Improved Segment Replacement: Added DuoSegmentDataManager to register both mutable and immutable segments during replacement, allowing queries to access a complete data view without blocking ingestion.
Query Handling Enhancements: Queries now acquire the latest consuming segments to avoid missing newly ingested data if the broker's routing table isn't updated.
Other Notable Improvements and Bug Fixes
Config for max output segment size in UpsertCompactMerge task.
Add config for ignoreCrcMismatch for upsert-compaction task.
Upsert small segment merger task in minions.
Lucene and Text Search Improvements
Store index metadata file for Lucene text indexes.
Runtime configurability for Lucene analyzers and query parsers, enabling dynamic text tokenization and advanced log search capabilities like case-sensitive/insensitive searches.
Security Improvements and Vulnerability Fixes
Force SSL cert reload daily using the scheduled thread.
Allow configuring TLS between brokers and servers for the multi-stage engine.
Strip Matrix parameter from BasePath checking.
Miscellaneous Improvements
Allow setting ForwardIndexConfig default settings via cluster config.
Extend Merge Rollup Capabilities for Datasketches.
Skip task validation during table creation with schema.
Bug Fixes
Fix typo in RefreshSegmentTaskExecutor logger.
Fix to avoid handling JSON_ARRAY as multi-value JSON during transformation.
Fix for partition-enabled instance assignment with minimized movement.
Apache Pinot 1.0 Upserts overview
Support for polymorphic scalar comparison functions(=, !=, >, >=, <, <=). #13711
Optimized MergeEqInFilterOptimizer by reducing the hash computation of expression. #14732
Add support for is_enable_group_trim aggregate option. #14664
Add support for is_leaf_return_final_result aggregate option. #14645
Override the return type from NOW to TIMESTAMP. #14614
Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG). #14689
Add cluster configuration to limit the number of multi-stage queries running concurrently. #14574
Its absence does not prevent table creation under a database.
Deletion does not remove tables within the database.
A custom DatabaseConfigRefreshMessage is sent to brokers upon database config changes.
A ClusterConfigChangeListener in ClusterChangeMediator to process updates in cluster configs.
Adjustments to per-broker quotas upon broker resource changes.
Creation of database rate limiters during the OFFLINE -> ONLINE state transition of tables in BrokerResourceOnlineOfflineStateModel.
Thread restrictions: Limits the number of worker threads per query and across all queries in the secondary workload.
Queue pruning: Queries stuck in the queue too long are pruned based on time or queue length.
DELETE
/resultStore/{requestId}/
Delete the results of a query.
urlTopLevelDomain(String url): Extracts the top-level domain (TLD) from the URL.
urlFirstSignificantSubdomain(String url): Extracts the first significant subdomain from the URL.
cutToFirstSignificantSubdomain(String url): Extracts the first significant subdomain and the top-level domain from the URL.
cutToFirstSignificantSubdomainWithWWW(String url): Returns the part of the domain that includes top-level subdomains up to the "first significant subdomain", without stripping "www.".
urlPort(String url): Extracts the port from the URL.
urlPath(String url): Extracts the path from the URL without the query string.
urlPathWithQuery(String url): Extracts the path from the URL with the query string.
urlQuery(String url): Extracts the query string without the initial question mark (?) and excludes the fragment (#) and everything after it.
urlFragment(String url): Extracts the fragment identifier (without the hash symbol) from the URL.
urlQueryStringAndFragment(String url): Extracts the query string and fragment identifier from the URL.
extractURLParameter(String url, String name): Extracts the value of a specific query parameter from the URL.
extractURLParameters(String url): Extracts all query parameters from the URL as an array of name=value pairs.
extractURLParameterNames(String url): Extracts all parameter names from the URL query string.
urlHierarchy(String url): Generates a hierarchy of URLs truncated at path and query separators.
urlPathHierarchy(String url): Generates a hierarchy of path elements from the URL, excluding the protocol and host.
urlDecodeFormComponent(String url): Decodes the URL string following RFC-1866 standards, with + decoded as a space.
urlNetloc(String url): Extracts the network locality (username:password@host:port) from the URL.
cutWWW(String url): Removes the leading "www." from a URL’s domain.
cutQueryString(String url): Removes the query string, including the question mark.
cutFragment(String url): Removes the fragment identifier, including the number sign.
cutQueryStringAndFragment(String url): Removes both the query string and fragment identifier.
cutURLParameter(String url, String name): Removes a specific query parameter from a URL.
cutURLParameters(String url, String[] names): Removes multiple specific query parameters from a URL.
Compatible with existing stream partition auto expansion logics The feature does not change any existing interfaces. Users could define the table config in the same way and combine with any other transform functions or instance assignment strategies.
ifNotFinite: Returns a default value if the given value is not finite.
moduloOrZero and positiveModulo: Variants of the modulo operation, with moduloOrZero returning zero for division by zero or when dividing a minimal negative number by minus one.
negate: Returns the negation of a double value.
gcd and lcm: Calculate the greatest common divisor and least common multiple of two long values, respectively.
hypot: Computes the hypotenuse of a right-angled triangle given the lengths of the other two sides.
byteswapInt and byteswapLong: Perform byte swapping on integer and long values.
MultiLineString
MultiPolygon
GeometryCollection
Feature
FeatureCollection
Specialize unlimited LIMIT case
Do not create priority queue before collecting LIMIT values
Add support for null ordering
Misc Fixes: Addressed edge cases, such as updating _numDocsIndexed before metadata updates, returning empty bitmaps instead of null, and preventing bitmap re-acquisition outside locking logic. These changes, gated by the new feature flag upsertConfig.consistencyMode, are tested with unit and stress tests in a staging environment to ensure reliability.
Fix to acquire segmentLock before taking segment snapshot. #14179
Update upsert TTL watermark in replaceSegment. #14147
Fix checks on largest comparison value for upsert ttl and allow to add segments out of ttl. #14094
More observability and metrics to track the upsert rate of deletion. #13838
Disable replacing environment variables and system properties in get table configs REST API. #14002
Upgrade the hadoop version to 3.3.6 to fix vulnerabilities. #12561)
Fix vulnerabilities for msopenjdk 11 pinot-base-runtime image. #14030
Add capability to configure sketch precision / accuracy for different rollup buckets. Helpful in a space-saving for use cases where historical data does not require high accuracy. #14373
Add support for application-level query quota. #14226
Improvement to allow setting ForwardIndexConfig default settings via cluster config. #14773
Enhanced mutable Index class to be as pluggable. #14609
Improvement to allow configurable initial capacity for IndexedTable. #14620
Add a new segment reload API for flexible control, allowing specific segments to be reloaded on designated servers and enabling workload management through batch processing and replica group targeting. #14544
Add a server API to list segments that need to be refreshed for a table. #14544
Introduced the ability to erase dimension values before rollup in merged segments, reducing cardinality and optimizing space for less critical historical data. #14355
Add support for immutable CLPForwardIndex creator and related classes. #14288
Add support for Minion Task to support automatic Segment Refresh. #14300
Add support for hex decimal to long scalar functions. #14435
Remove emitting null value fields during data transformation for SchemaConformingTransformer. #14351
Improved CSV record reader to skip unparseable lines. #14396
Add the ability to specify a target instance for segment reloading and improve API response messages when segments are not found on the target instances. #14393
Improvement for MSQ explain and stageStats when dealing with empty tables. #14374
Improvement for dynamically adjusting GroupByResultHolder's initial capacity based on filter predicates to optimize resource allocation and improve performance for filtered group-by queries. #14001
Improvement to ensure consistent index configuration by constructing IndexLoadingConfig and SegmentGeneratorConfig from table config and schema, fixing inconsistencies and honouring FieldConfig.EncodingType. #14258
Add usage of CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency. #14241
Add support for application-level query quota. #14226
Add null handling support for aggregations grouped by MV columns. #14071
Add support to enable the capability to specify zstd and lz4 segment compression via config. #14008
Improvement for allowing usage of star-tree index with null handling enabled when no null values in segment columns. #14177
Improvement Improvement for avoiding using setter in IndexLoadingConfig for consuming segment. #14190
Implement consistent data push for Spark3 segment generation and metadata push jobs. #14139
Improvement in addressing ingestion delays in real-time tables with many partitions by mitigating simultaneous segment commits across consumers. #14170
Improve query options validation and error handling. #14158
Add support an arbitrary number of WHEN THEN clauses in the scalar CASE function. #14125
Add support for configuring Theta and Tuple aggregation functions. #14167
Add support for Map type in complex schema. #13906)
Add TTL watermark storage/loading for the dedup feature to prevent stale metadata from being added to the store when loading segments. #14137
Polymorphic scalar function implementation for BETWEEN. #14113
Allow the building of an index on the preserved field in SchemaConformingTransformer. #13993
Add support to differentiate null and emptyLists for multi-value columns in avro decoder. #13572
Broker config to set default query null handling behavior. #13977
Moves the untarring method to BaseTaskExecutor to enable downloading and untarring from a peer server if deepstore untarring fails and allows DownloadFromServer to be enabled. #13964
New SPI to support custom executor services, providing default implementations for cached and fixed thread pools. #13921
Introduction of shared IdealStateUpdaterLock for PinotLLCRealtimeSegmentManager to prevent race conditions and timeouts during large segment updates. #13947
Support for configuring aggregation function parameters in the star-tree index. #13835
Write support for creating Pinot segments in the Pinot Spark connector. #13748
Array flattening support in SchemaConformingTransformer. #13890
Allow table names in TableConfigs with or without database name when database context is passed. #13934
Improvement in null handling performance for nullable single input aggregation functions. #13791
Improvement in column-based null handling by refining method naming, adding documentation and updating validation and constructor logic to support column-specific null strategies. #13839
Enhanced the noRawDataForTextIndex config to skip writing raw data when re-using the mutable index is enabled, fixing a global disable issue and improving ingestion performance. #13776
Improvements to polymorphic scalar comparison functions for better backward compatibility. #13870
Add TablePauseStatus to track the pause details. #13803
Check stale dedup metadata when adding new records/segments. #13848
Improve error messages with star-tree indexes creation. #13818
Adds support for ZStandard and LZ4 compression in tar archives, enhancing efficiency and reducing CPU bottlenecks for large-scale data operations. #13782
Fix for using PropertiesWriter to escape index_map keys properly. #12018
Fix query option validation for group-by queries. #14618
Fix for making RecordExtractor preserve empty array/map and map entries with empty values. #14547
Fix CRC mismatch during deep store upload retry task. #14506
Fix for allowing reload for UploadedRealtimeSegmentName segments. #14494
Fix default value handling in REGEXP_EXTRACT transform function. #14489
Fix for Spark upsert table backfill support. #14443
Fix long value parsing in jsonextractscalar. #14337
Fix deep store upload retry for infinite retention tables. #14406
Fix to ensure deterministic index processing order across server replicas and runs to prevent inconsistent segment data file layouts and unnecessary synchronization. #14391
Fix for real-time validation NPE when stream partition is no longer available. #14392
Fix for handling NULL values encountered in CLPDecodeTransformFunction. #14364
Fix for TextMatchFilterOptimizer grouping for the inner compound query. #14299
Fix for removing redundant API calls on the home page. #14295
Fix the missing precondition check for the V5 writer version in BaseChunkForwardIndexWriter. #14265
Fix for computing all groups for the group by queries with only filtered aggregations. #14211
Fix for race condition in IdealStateGroupCommit. #14237
Fix default column handling when the forward index is disabled. #14215
Fix bug with server return final aggregation result when null handling is enabled. #14181
Fix Kubernetes Routing Issue in Helm chart. #13450
Fix implementing a table-level lock to prevent parallel updates to the SegmentLineage ZK record and align real-time table ideal state updates with minion task locking for consistency. #13735
Fix INT overflow issue for FixedByteSVMutableForwardIndex with large segment size. #13717
Fix preload enablement checks to consider the preload executor and refine numMissedSegments logging to exclude unchanged segments, preventing incorrect missing segment reports. #13747
Fix a bug in resource status evaluation during service startup, ensuring resources return GOOD when servers have no assigned segments, addressing issues with small tables and segment redistribution. #13541
Fix RealtimeProvisioningHelperCommand to allow using just schemaFile along with sampleCompletedSegmentDir. #13727
This page covers the latest changes included in the Apache Pinot™ 1.0.0 release, including new features, enhancements, and bug fixes.
1.0.0 (2023-09-19)
This release includes the several new features, enhancements, and bug fixes, including the following highlights:
Multi-stage query engine: , , and . Learn how to or more about how the works.
Multi-stage query engine new features
Support for
Initial (phase 1) Query runtime for window functions with ORDER BY within the OVER() clause (#10449)
Multi-stage query engine enhancements
Turn on v2 engine by default ()
Introduced the ability to stream leaf stage blocks for more efficient data processing ().
Early terminate SortOperator if there is a limit ()
Multi-stage query engine bug fixes
Fix Predicate Pushdown by Using Rule Collection ()
Try fixing mailbox cancel race condition ()
Catch Throwable to Propagate Proper Error Message ()
Index SPI
Add the ability to include new index types at runtime in Apache Pinot. This opens the ability of adding third party indexes, including proprietary indexes. More details
Null value support for pinot queries
NULL support for ORDER BY, DISTINCT, GROUP BY, value transform functions and filtering.
Upsert enhancements
Delete support in upsert enabled tables ()
Support added to extend upserts and allow deleting records from a realtime table. The design details can be found .
Preload segments with upsert snapshots to speedup table loading ()
Adds a feature to preload segments from a table that uses the upsert snapshot feature. The segments with validDocIds snapshots can be preloaded in a more efficient manner to speed up the table loading (thus server restarts).
TTL configs for upsert primary keys ()
Adds support for specifying expiry TTL for upsert primary key metadata cleanup.
Segment compaction for upsert real-time tables ()
Adds a new minion task to compact segments belonging to a real-time table with upserts.
Pinot Spark Connector for Spark3 ()
Added spark3 support for Pinot Spark Connector ()
Also added support to pass pinot query options to spark connector ()
PinotDataBufferFactory and new PinotDataBuffer implementations ()
Adds new implementations of PinotDataBuffer that uses Unsafe java APIs and foreign memory APIs. Also added support for PinotDataBufferFactory to allow plugging in custom PinotDataBuffer implementations.
Query functions enhancements
Add PercentileKLL aggregation function ()
Support for ARG_MIN and ARG_MAX Functions ()
refactor argmin/max to exprmin/max and make it calcite compliant ()
JSON and CLP encoded message ingestion and querying
Add clpDecode transform function for decoding CLP-encoded fields. ()
Add CLPDecodeRewriter to make it easier to call clpDecode with a column-group name rather than the individual columns. ()
Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. ()
Tier level index config override ()
Allows overriding index configs at tier level, allowing for more flexible index configurations for different tiers.
Ingestion connectors and features
Kinesis stream header extraction ()
Extract record keys, headers and metadata from Pulsar sources ()
Realtime pre-aggregation for Distinct Count HLL & Big Decimal ()
UI enhancements
Adds persistence of authentication details in the browser session. This means that even if you refresh the app, you will still be logged in until the authentication session expires ()
AuthProvider logic updated to decode the access token and extract user name and email. This information will now be available in the app for features to consume. ()
Pinot docker image improvements and enhancements
Make Pinot base build and runtime images support Amazon Corretto and MS OpenJDK ()
Support multi-arch pinot docker image ()
Update dockerfile with recent jdk distro changes ()
Operational improvements
Rebalance
Rebalance status API ()
Tenant level rebalance API Tenant rebalance and status tracking APIs ()
Config to use customized broker query thread pool ()
Added new configuration options below which allow use of a bounded thread pool and allocate capacities for it.
This feature allows better management of broker resources.
Drop results support ()
Adds a parameter to queryOptions to drop the resultTable from the response. This mode can be used to troubleshoot a query (which may have sensitive data in the result) using metadata only.
Make column order deterministic in segment ()
In segment metadata and index map, store columns in alphabetical order so that the result is deterministic. Segments generated before/after this PR will have different CRC, so during the upgrade, we might get segments with different CRC from old and new consuming servers. For the segment consumed during the upgrade, some downloads might be needed.
Allow configuring helix timeouts for EV dropped in Instance manager ()
Adds options to configure helix timeouts
external.view.dropped.max.wait.ms`` - The duration of time in milliseconds to wait for the external view to be dropped. Default - 20 minutes. external.view.check.interval.ms`` - The period in milliseconds in which to ping ZK for latest EV state.
Enable case insensitivity by default ()
This PR makes Pinot case insensitive be default, and removes the deprecated property enable.case.insensitive.pql
Newly added APIs and client methods
Add Server API to get tenant pools ()
Add new broker query point for querying multi-stage engine ()
Add a new controller endpoint for segment deletion with a time window ()
Cleanup and backward incompatible changes
High level consumers are no longer supported
Cleanup HLC code ()
Remove support for High level consumers in Apache Pinot ()
Type information preservation of query literals
[feature] [backward-incompat] [null support # 2] Preserve null literal information in literal context and literal transform ()
String versions of numerical values are no longer accepted. For example, "123" won't be treated as a numerical anymore.
Controller job status ZNode path update
Moving Zk updates for reload, force_commit to their own Znodes which … ()
The status of previously completed reload jobs will not be available after this change is deployed.
Metric names for mutable indexes to change
Implement mutable index using index SPI ()
Due to a change in the IndexType enum used for some logs and metrics in mutable indexes, the metric names may change slightly.
Update in controller API to enable / disable / drop instances
Update getTenantInstances call for controller and separate POST operations on it ()
Change in substring query function definition
Change substring to comply with standard sql definition ()
Full list of features added
Allow queries on multiple tables of same tenant to be executed from controller UI
Encapsulate changes in IndexLoadingConfig and SegmentGeneratorConfig
[Index SPI] IndexType ()
Vulnerability fixes, bugfixes, cleanups and deprecations
Remove support for High level consumers in Apache Pinot ()
Fix JDBC driver check for username ()
[Clean up] Remove getColumnName() from AggregationFunction interface ()
Support for the ranking ROW_NUMBER() window function (#10527, #10587)
Set operations support:
Support SetOperations (UNION, INTERSECT, MINUS) compilation in query planner (#10535)
Timestamp and Date Operations
Support TIMESTAMP type and date ops functions (#11350)
This release comes with several Improvements and Bug Fixes for the Multistage Engine, Upserts and Compaction. There are a ton of other small features and general bug fixes.
Multistage Engine Improvements
Features
New Window Functions: LEAD, LAG, FIRST_VALUE, LAST_VALUE
LEAD allows you to access values after the current row in a frame.
LAG allows you to access values before the current row in a frame.
FIRST_VALUE and LAST_VALUE return the respective extremal values in the frame.
Support for Logical Database in V2 Engine
V2 Engine now supports a "database" construct, enabling table namespace isolation within the same Pinot cluster.
Improves user experience when multiple users are using the same Pinot Cluster.
Access control policies can be set at the database level.
Improved Multi-Value (MV) and Array Function Support
Added array sum aggregation functions for point-wise array operations .
Added support for valueIn MV transform function .
Fixed bug in numeric casts for MV columns in filters .
Support for WITHIN GROUP Clause and ListAgg
WITHIN GROUP Clause can be used to process rows in a given order within a group.
One of the most common use-cases for this is the ListAgg function, which when combined with WITHIN GROUP can be used to concatenate strings in a given order.
Scalar/Transform Function and Set Operation Improvements
Added Geospatial Scalar Function support for use in intermediate stage in the v2 query engine .
Fix 'WEEK' transform function .
Support EXTRACT as a scalar function .
Improved Literal Handling Support
Fixed bug in handling literal arguments in aggregation functions like Percentile .
Allow INT and FLOAT literals .
Fixed literal handling for all types .
Metrics Improvements
Added new metrics for tracking queries executed globally and at the table level .
New metrics to track join counts and window function counts .
Multiple meters and timers to track Multistage Engine Internals .
Notable Improvements and Bug Fixes
Improved Window operators resiliency, with new checks to make sure the window doesn't grow too large .
Optimized Group Key generation .
Fixed SortedMailboxReceiveOperator to honor convention of pulling at most 1 EOS block .
Upsert Compaction and Minion Improvements
Features and Improvements
Minion Resource Isolation
Minions now support resource isolation based on an instance tag.
Instance tag is configured at table level, and can be set for each task on a table.
This enables you to implement arbitrary resource isolation strategies, i.e. you can use a set of Minion Nodes for running any set of tasks across any set of tables.
Greedy Upsert Compaction Scheduling
Upsert compaction now schedules segments for compaction based on the number of invalid docs.
This helps the compaction task to handle arbitrary temporal distribution of invalid docs.
Notable Improvements
Minions can now download segments from servers when deepstore copy is missing. This feature is enabled via a cluster level config allowDownloadFromServer .
Added support for TLS Port in Minions .
New metrics added for Minions to track segment/record processing information .
Bug Fixes
Minions can now handle invalid instance tags in Task Configs gracefully. Prior to this change, Minions would be stuck in IN_PROGRESS state until task timeout .
Fix bug to return validDocIDsMetadata from all servers .
Upsert compaction doesn't retain maxLength information and trims string fields .
Upsert Improvements
Features and Improvements
Consistent Table View for Upsert Tables
Adds different modes of consistency guarantees for Upsert tables.
Adds a new UpsertConfig called consistencyMode which can be set to NONE, SYNC, SNAPSHOT.
SYNC is optimized for data freshness but can lead to elevated query latencies and is best for low-qps use-cases. In this mode, the ingestion threads will take a WLock when updating validDocID bitmaps.
Pluggable Partial Upsert Merger
Partial Upsert merges the old record and the new incoming record to generate the final ingested record.
Pinot now allows users to customize how this merge of an old row and the new row is computed.
This allows a column value in the new row to be an arbitrary function of the old and the new row.
Support for Uploading Externally Partitioned Segments for Upsert Backfill
Segments uploaded for Upsert Backfill can now explicitly specify the Kafka partition they belong to.
This enables backfilling an Upsert table where the externally generated segments are partitioned using an arbitrary hash function on an arbitrary primary key.
Misc Improvements and Bug Fixes
Fixed a Bug in Handling Equal Comparison Column Values in Upsert, which could lead to data inconsistency ()
Upsert snapshot will now snapshot only those segments which have updates. .
Notable Features
JSON Support Improvements
JSON Index can now be used for evaluating Regex and Range Predicates.
jsonExtractIndex now supports contextual array filters. .
JSON column type now supports filter predicates like =, !=
Lucene and Text Search Improvements
Improved Segment Build Time for Lucene Text Index by 40-60%. This improvement is realized when a consuming segment commits and changes to an ImmutableSegment. This significantly helps in lowering ingestion lag at commit time due to a large text index .
Phrase Search can run 3x faster when the Lucene Index Config enablePrefixSuffixMatchingInPhraseQueries is set to true. This is achieved by rewriting phrase search query to a wildcard and prefix matching query .
New Funnel Functions
Added funnelMaxStep function which can be used to calculate max funnel steps for a given sliding window .
Added funnelCompleteCount to calculate the number of completed funnels, and funnelMatchStep to get the funnel match array.
Support for Interning for OnHeapByteDictionary
This can reduce the heap usage of a dictionary encoded byte column, for a certain distribution of duplicate values. See for details.
Column Major Builder On By Default for New Tables
Prior to this feature, on a segment commit, Pinot would convert all the columnar data from the Mutable Segment to row-major, and then re-build column major Immutable Segments.
This feature skips the row-major conversion and is expected to be both space and time efficient.
It can help lower ingestion lag from segment commits, especially helpful when your segments are large.
Support for SQL Formatting in Query Editor
You can now prettify SQL right in the Controller UI!
Hash Function for UUID Primary Keys
Added a new lossless hash-function for Upsert Primary Keys optimized for UUIDs.
The hash function can reduce Old Gen by up to 30%.
It maps a UUID to a 16 byte array, vs encoding it in a UTF string which would take 36 bytes.
Column Level Index Skip Query Option
Convenient for debugging impact of indexes on query performance or results.
You can add the skipIndexes option to your query to skip any number of indexes. e.g. SET skipIndexes=inverted,range;
New UDFs and Scalar Functions
New GeoHash functions: encodeGeoHash, decodeGeoHash, decodeGeoHashLatitude and decodeGeoHashLongitude.
dateBin can be used to align a timestamp to the nearest time bucket.
CLP Compression Codec in Forward Indexes
is a compressed log processor which has really high compression ratio for certain log types.
To enable this, you can set the compressionCodec in the fieldConfigList of the column you want to target.
Misc. Improvements
Enable segment preloading at partition level .
Use Temurin instead of AdoptOpenJdk
Adding record reader config/context param to record transformer
Bug Fixes
Use gte(lte) to replace between() which has a bug
Fix the ConcurrentModificationException for And/Or DocIdSet
Upgrade RoaringBitmap to 1.0.5 to pick up the fix for RangeBitmap.between()
Database can be selected in a query using a SET statement, such as SET database=my_db;.
Fixed NPE in ArrayAgg when a column contains no data .
Fixed array literal handling .
Added support for ALL modifier for INTERSECT and EXCEPT Set Operations .
Fixed null literal handling for null intolerant functions .
Improvement in how execution stats are handled .
Use Protobuf instead of Reflection for Plan Serialization .
SNAPSHOT mode can handle high-qps/high-ingestion use-cases by getting the list of valid docs from a snapshot of validDocID. The snapshot can be refreshed every few seconds and the tolerance can be set via a query option upsertViewFreshnessMs.
,
IN
and
NOT IN
. This is convenient for scenarios where the JSON values are very small.
.
JSON_MATCH now supports exclusive predicates correctly. For instance, you can use predicates such as JSON_MATCH(person, '"$.addresses[*].country" != ''us''' to find all people who have at least one address that is not in the US. .
jsonExtractIndex supports extracting Multi-Value JSON Fields, and also supports providing any default value when the key doesn't exist. .
Added isJson UDF which increases your options to handle invalid JSONs. This can be used in queries and for filtering invalid json column values in ingestion. .
Fix ArrayIndexOutOfBoundsException in jsonExtractIndex. .
Fixed bug in TextMatchFilterOptimizer that was not applying precedence to the filter expressions properly, which could lead to incorrect results. .
Fixed bug in handling NOT text_match which could have returned incorrect results. .
Added SchemaConformingTranformerV2 to enhance text search abilities. .
Added metrics to track Lucene NRT Refresh Delay .
Switched to NRTCachingDirectory for Realtime segments and prevented duplicates in the Realtime Lucene Index to avoid IndexOutOfBounds query time exceptions. .
Lucene Version is upgraded to 9.11.1. .
prefixes, suffixes and uniqueNgrams UDFs for generating all respective string subsequences from a string input. .
Added isJson UDF which increases your options to handle invalid JSONs. This can be used in queries and for filtering invalid json column values in ingestion. .
splitPart UDF has minor improvements. .
Removing legacy commons-lang dependency
12508: Feature add segment rows flush config
ADSS Race Condition and update to client error codes
Add ExceptionMapper to convert Exception to Response Object for Broker REST API's
Add FunnelMaxStepAggregationFunction and FunnelCompleteCountAggregationFunction
Add GZIP Compression Codec (#11434)
Add PodDisruptionBudgets to the Pinot Helm chart
Add Postgres compliant name aliasing for String Functions.
Add SchemaConformingTransformerV2 to enhance text search abilities
Add a benchmark to measure multi-stage block serde cost
Add a plan version field to QueryRequest Protobuf Message
Add a post-validator visitor that verifies there are no cast to bytes
Add a safe version of CLStaticHttpHandler that disallows path traversal.
Add ability to track filtered messages offset
Add back 'numRowsResultSet' to BrokerResponse, and retain it when result table id hidden
Add back profile for shade
Add back some exclude deps from hadoop-mapreduce-client-core
Add backward compatibility regression test suite for multi-stage query engine
Add base class for custom object accumulator
Add clickstream example table for funnel analysis
Add config option for timezone
Add config to skip record ingestion on string column length exceeding configured max schema length
Add controller API to get allLiveInstances
Add isJson UDF
Add list of collaborators to asf.yaml
Add locking logic to get consistent table view for upsert tables
Add metric to track number of segments missed in upsert-snapshot
Add metrics for SEGMENTS_WITH_LESS_REPLICAS monitoring
Add mode to allow adding dummy events for non-matching steps
Add offset based lag metrics
Add protobuf codegen decoder
Add retry policy to wait for job id to persist during rebalancing
Add round-robin logic during downloadSegmentFromPeer
Add schema as input to the decoder.
Add splitPartWithLimit and splitPartFromEnd UDFs
Add support for creating raw derived columns during segment reload
Add support for raw JSON filter predicates
Add the possibility of configuring ForwardIndexes with compressionCodec
Add upsert-snapshot timer metric
Add validation check for forward index disabled if it's a REALTIME table
Added PR compatability test against release 1.1.0
Added kafka partition number to metadata.
Added pinot-error-code header in query response
Added tests for additional data types in SegmentPreProcessorTest.java
Adding a cluster config to enable instance pool and replica group configuration in table config
Adding batch api support for WindowFunction
Adding bytes string data type integration tests
Adding registerExtraComponents to allow registering additional components in various services
Adding support of insecure TLS
Adding support to insecure TLS when creating SSLFactory
Adds AGGREGATE_CASE_TO_FILTER rule
Adds per-column, query-time index skip option
Allow Aggregations in Case Expressions
Allow PintoHelixResourceManager subclasses to be used in the controller starter by providing an overridable PinotHelixResouceManager object creator function
Allow RequestContext to consider http-headers case-insensitivity
Allow Server throttling just before executing queries on server to allow max CPU and disk utilization
Allow all raw index config in star-tree index
Allow apply both environment variables and system properties to user and table configs, Environment variables take precedence over system properties
Allow configurable queryWorkerThreads in Pinot server side GrpcQueryServer
Allow dynamically setting the log level even for loggers that aren't already explicitly configured
Allow passing custom record reader to be inited/closed in SegmentProcessorFramework
Allow passing database context through database http header
Allow stop to interrupt the consumer thread and safely release the resource
Allow user configurable regex library for queries
Allow using 'serverReturnFinalResult' to optimize server partitioned table
Assign default value to newly added derived column upon reload
Avoid port conflict in integration tests
Better handling of null tableNames
CLP as a compressionCodec
Change helm app version to 1.0.0 for Apache Pinot latest release version
Clean Google Dependencies
Clean up BrokerRequestHandler and BrokerResponse
Clean up arbitrary sleep in /GrpcBrokerClusterIntegrationTest
Cleaning up vector index comments and exceptions
Cleanup HTTP components dependencies and upgrade Thrift
Cleanup Javax and Jakarta dependencies
Cleanup deprecated query options
Cleanup the consumer interfaces and legacy code
Cleanup unnecessary dependencies under pinot-s3
Cleanup unused aggregate internal hint
Consistency in API response for live broker
Consolidate bouncycastle libraries
Consolidate nimbus-jose-jwt version to 9.37.3
ControllerRequestClient accepts headers. Useful for authN tests
Custom configuration property reader for segment metadata files
Delete database API
Deprecate PinotHelixResourceManager#getAllTables() in favour of getAllTables(String databaseName)
Detect expired messages in Kafka. Log and set a gauge.
Do not hard code resource class in BaseClusterIntegrationTest
Do not pause ingestion when upsert snapshot flow errors out
Don't drop original field during flatten
Don't enforce -realTimeInstanceCount and -offlineInstanceCount options when creating broker tenants
Egalpin/skip indexes minor changes
Emit Metrics for Broker Adaptive Server Selector type
Emit table size related metrics only in lead controller
Enable complexType handling in SegmentProcessFramework
Enable more integration tests to run on the v2 multi-stage query engine
Enabling avroParquet to read Int96 as bytes
Enhance Kinesis consumer
Enhance Parquet Test
Enhance ProtoSerializationUtils to handle class move
Enhance Pulsar consumer
Enhance PulsarConsumerTest
Enhance commit threshold to accept size threshold without setting rows to 0
Enhance json index to support regexp and range predicate evaluation
Enhancement: Sketch value aggregator performance
Ensure FieldConfig.getEncodingType() is never null
Ensure all the lists used in PinotQuery are ArrayList
Ensure brokerId and requestId are always set in BrokerResponse
Enter segment preloading at partition level
Exclude dimensions from star-tree index stored type check
Expose more helper API in TableDataManager
Extend compatibility verifier operation timeout from 1m to 2m to reduce flakiness
Extract json individual array elements from json index for the transform function jsonExtractIndex
Fetch query quota capacity utilization rate metric in a callback function
First with time
GitHub Actions checkout v4
Gzip compression, ensure uncompressed size can be calculated from compressed buffer
Handle errors gracefully during multi-stage stats collection in the broker
Handle shaded classes in all methods of kafka factory
Hash Function for UUID Primary Keys
Ignore case when checking for Direct Memory OOM
Improve Retention Manager Segment Lineage Clean Up
Improve error message for max rows in join limit breach
Improve exception logging when we fail to index / transform message
Improve logging in range index handler for index updates
Improve upsert compaction threshold validations
Improve warn logs for requesting validDocID snapshots
Improved metrics for server grpc query
Improved null check for varargs
Improved segment build time for Lucene text index realtime to offline conversion
In ClusterTest, make start port higher to avoid potential conflict with Kafka
Introduce PinotLogicalAggregate and remove internal hint
Introduce retries while creating stream message decoder for more robustness
Isolate bad server configs during broker startup phase
Issue #12367
Json extract index filter support
Json extract index mv
Keep get tables API with and without database
Lint failure
Logging a warn message instead of throwing exception
Made the error message around dimension table size clearer
Make Helix state transition handling idempotent
Make KafkaConsumerFactory method less restrictive to avoid incompatibility
Make task manager APIs database aware
Metric for count of tables configured with various tier backends
Metric for upsert tables count
Metrics for Realtime Rows Fetched and Stream Consumer Create Exceptions
Minmaxrange null
Modify consumingSegmentsInfo endpoint to indicate how many servers failed
Move offset validation logic to consumer classes
Move package org.apache.calcite to org.apache.pinot.calcite
Move resolveComparisonTies from addOrReplaceSegment to base class
Move some mispositioned tests under pinot-core
Move wildfly-openssl dependency management to root pom
Moving deleteSegment call from POST to DELETE call
Optimize unnecessary extra array allocation and conversion for raw derived column during segment reload
Pass explicit TypeRef when evaluating MV jsonPath
Percentile operations supporting null
Prepare for next development iteration
Propagate Disable User Agent Config to Http Client
Properly handle complex type transformer in segment processor framework
Properly return response if SegmentCompletion is aborted
Publish helm 0.2.8
Publish helm 0.2.9
Pull janino dependency to root pom
Pull pulsar version definitaion into root POM
Query response opt
Re-enable the Spotless plugin for Java 21
Readme - How to setup Pinot UI for development
Record enricher
Refactor PinotTaskManager class
Refactored CommonsConfigurationUtils for loading properties configuration.
Refactored compatibility-verifier module
Refactoring removeSegment flow in upsert
Refine PeerServerSegmentFinder
Refine SegmentFetcherFactory
Replace custom fmpp plugin with fmpp-maven-plugin
Reposition query submission spot for adaptive server selection
Reset controller port when stopping the controller in ControllerTest
Rest Endpoint to Create ZNode
Return clear error message when no common broker found for multi-stage query with tables from different tenants
Returning tables names failing authorization in Exception of Multi State Engine Queries
Revert " Adding record reader config/context param to record transformer (#12520)"
Revert "Using local copy of segment instead of downloading from remote (#12863)"
Short circuit SubPlanFragmenter because we don't support multiple sub-plans yet
Simplify Google dependencies by importing BOM
Specify version for commons-validator
Support NOT in StarTree Index
Support empty strings as json nodes^
Supporting human-readable format when configuring broker response size
Use ArrayList instead of LinkedList in SortOperator
Use a two server setup for multi-stage query engine backward compatibility regression test suite
Use more efficient variants of URLEncoder::encode and URLDecoder::decode
Use parameterized log messages instead of string concatenation
Use separate action for /tasks/scheduler/jobDetails API
Use try-with-resources to close file walk stream in LocalPinotFS
Using local copy of segment instead of downloading from remote
[Adaptive Server Selector] Add metrics for Stats Manager Queue Size
[Cleanup] Move classes in pinot-common to the correct package
[Feature] Add Support for SQL Formatting in Query Editor
[HELM]: Added additional probes options and startup probe.
[HELM]: Added checksum config annotation in stateful set for broker, controller and server
[HELM]: Added namespace support in K8s deployment.
[HELM]: zookeeper chart upgrade to version 13.2.0
[Minor] Add Nullable annotation to HttpHeaders in BrokerRequestHandler
[Minor] Small refactor of raw index creator constructor to be more clear
[Multi-stage] Clean up RelNode to Operator handling
[null-aggr] Add null handling support in mode aggregation
[partial-upsert] configure early release of _partitionGroupConsumerSemaphore in RealtimeSegmentDataManager
[spark-connector] Add option to fail read when there are invalid segments
add Netty arm64 dependencies
add Netty unit test
add SegmentContext to collect validDocIds bitmaps for many segments together
add skipUnavailableServers query option
add insecure mode when Pinot uses TLS connections
add instrumentation to json index getMatchingFlattenedDocsMap()
add jmx to promethues metric exporting rule for realtimeRowsFiltered
add metrics for IdeaState update
add some metrics for upsert table preloading
add some tests on jsonPathString
add test cases in RequestUtilsTest
add unit test for JsonAsyncHttpPinotClientTransport
add unit test for QueryServer
add unit test for ServerChannels
add unit test for StringFunctions encodeUrl
add unit tests for pinot-jdbc-client
add url assertion to SegmentCompletionProtocolTest
adjust the llc partition consuming metric reporting logic
allow passing null http headers object to translateTableName
allow to set segment when use SegmentProcessorFramework
auto renew jvm default sslconext when it's loaded from files
avoid useless intermediate byte array allocation for VarChunkV4Reader's getStringMV
aws sdk 2.25.3
build-helper-maven-plugin 3.5.0
cache ssl contexts and reuse them
clean up jetbrain nullable annotation
cleanup: maven no transfer progress
close JDBC connections
do not fail on duplicate relaxed vars (#13214)z
dropwizard metrics 4.2.25
dynamic chunk sizing for v4 raw forward index
enable Netty leak detection
enable parallel Maven in pinot linter script
ensure inverse And/OrFilterOperator implementations match the query
exclude .mvn directory from source assembly
extend CompactedPinotSegmentRecordReader so that it can skip deleteRecord
get startTime outside the executor task to avoid flaky time checks
handle absent segments so that catchup checker doesn't get stuck on them
handle overflow for MutableOffHeapByteArrayStore buffer starting size
handle segments not tracked by partition mgr and add skipUpsertView query option
handle table name translation on missed api resources
hash4j version upgrade to 0.17.0
including the underlying exception in the logging output
int96 parity with native parquet reader
jsonExtractIndex support array of default values
log the log rate limiter rate for dropped broker logs
make http listener ssl config swappable
make reflection calls compatible with 0.9.11 [#12958](https://github.com/apache/
maven: no transfer progress
missed to delete the temp dir
move shouldReplaceOnComparisonTie to base class to be more reusable
reduce Java enum .values() usage in TimerContext
reduce logging for SpecialValueTransformer
reduce regex pattern compilation in Pinot jdbc
refactor TlsUtils class
refine when to registerSegment while doing addSegment and replaceSegment for upsert tables for better data consistency
reformat AdminConsoleIntegrationTest.java
reformat ClusterTest.java
release segment mgrs more reliably
replaced getServer with getServers
report rebalance job status for the early returns like noops
require noDictionaryColumns with aggregationConfigs
share the same table config object
track segments for snapshotting even if they lost all comparisons
untrack the segment out of TTL
update ControllerJobType from enum to string
update RewriterConstants so that expr min max would not collide with columns start with "parent"
update access control check error handling to catch throwable and log errors
bugfix: do not move src ByteBuffer position for LZ4 length prefixed decompress
Bug Fix createDictionaryForColumn does not take into account inverted index
fix Cluster Manager error
fix for quick start Cluster Manager issue
Adding config for having suffix for client ID for realtime consumer
Addressed comments and fixed tests from pull request 12389. /uptime and /start-time endpoints working all components
Bigfix. Added missing paramName
Bug fix: Do not ignore scheme property
Bug fix: Handle missing shade config overwrites for Kafka
BugFix: Fix merge result from more than one server
Bugfix. Allow tenant rebalance with downtime as true
Bugfix. Avoid passing null table name input to translation util
Bugfix. Correct wrong method call from scheduleTask() to scheduleTaskForDatabase()
Bugfix. Maintain literal data type during function evaluation
Cleanup: Fix grammar in error message, also improve readability.
Fix Bug in Handling Equal Comparison Column Values in Upsert
Fix ColumnMinMaxValueGenerator
Fix JavaEE related dependencies
Fix Logging Location for CPU-Based Query Killing
Fix PulsarUtils to not share buffer
Fix URI construction so that AddSchema command line tool works when override flag is set to true
Fix [Type]ArrayList elements() method usage
Fix a typo when calculating query freshness
Fix an overflow in PinotDataBuffer.readFrom
Fix bug in logging in UpsertCompaction task
Fix bug to return validDocIDsMetadata from all servers
Fix connection issues if using JDBC and Hikari (#12267)
Fix controller host / port / protocol CLI option description for admin commands
Fix environment variables not applied when creating table
Fix error message for insufficient number of untagged brokers during tenant creation
Fix few metric rules which were affected by the database prefix handling
Fix file handle leaks in Pinot Driver (apache#12263)
Fix flakiness of ControllerPeriodicTasksIntegrationTest
Fix issue with startree index metadata loading for columns with '__' in name
Fix metric rule pattern regex
Fix pinot-parquet NoClassFound issue
Fix segment size check in OfflineClusterIntegrationTest
Fix some resource leak in tests
Fix the NPE from IS update metrics
Fix the NPE when metadataTTL is enabled without delete column
Fix the ServletConfig loading issue with swagger.
Fix the issue that map flatten shouldn't remove the map field from the record
Fix the race condition for H3InclusionIndexFilterOperator
Fix the time segment pruner on TIMESTAMP data type
Fix time stats in SegmentIndexCreationDriverImpl
Fixed infer logical type name from avro union schema
Fixing instance type to resolve and
Helm: bug fix for chart rendering issue.
Try to amend kafka common package with pinot shaded package prefix
Update getValidDocIdsMetadataFromServer to make call in batches to servers and other bug fixes
Upgrade com.microsoft.azure:msal4j from 1.3.5 to 1.3.10 for CVE fixing
[bugfix] Handling null value for kafka client id suffix
bugfix: fixing jdbc client sql feature not supported exception
bugfix: re-add support for not text_match
bugfix: reduce enum array allocation in QueryLogger
bugfix: use consumerDir during lucene realtime segment conversion
cleanup: fix apache rat violation
fix GuavaRateLimiter acquire method
fix fieldsToRead class not in decoder
fix flakey test, avoid early finalization
fix merging null multi value in partial upsert
fix race condition in ScalingThreadPoolExecutor
fix shared buffer, tests
fix(build): update node version to 16
fixing CVE critical issues by resolving kerby/jline and wildfly libraries
This release comes with several features, including SQL, UI, and performance enhancements. Also included are bug fixes across multiple features such as the V2 multi-stage query engine, ingestion, storage format, and SQL support.
Multi-stage query engine
Features
Support RelDistribution-based trait planning (,)
Adds support for RelDistribution optimization for more accurate leaf-stage direct exchange/shuffle. Also extends partition optimization beyond leaf stage to entire query plan.
Applies optimization based on distribution trait in the mailbox/worker assignment stage
Fixes previous direct exchange which was decided based on the table partition hint. Now direct exchange is decided via distribution trait: it will applied if-and-only-if the trait propagated matches the exchange requirement.
Leaf stage planning with multi-semi join support ()
Solves the limitation of pinotQuery that supports limited amount of PlanNodes.
Float type column is treated as Double in the multistage engine, so FLOAT type is not supported.
Supports data BOOLEAN, INT, LONG
Enhancements
Canonicalize SqlKind.OTHERS and SqlKind.OTHER_FUNCTIONS and support
concat as || operator ()
Capability for constant filter in QueryContext, with support for server to handle it (
Bugfixes, refactoring, cleanups, tests
Bugfix for evaluation of chained literal functions ()
Fixes to sort copy rule ( and )
Fixes duplicate results for literal queries ()
Notable features
Server-level throttling for realtime consumption ()
Use server config pinot.server.consumption.rate.limit to enable this feature
Server rate limiter is disabled by default (default value 0)
Reduce segment generation disk footprint for Minion Tasks ()
Supported in MergeRollupTask and RealtimeToOfflineSegmentsTask minion tasks
Use taskConfig segmentMapperFileSizeThresholdInBytes to specify the threshold size
Support for swapping of TLS keystore/truststore (, )
Security feature that makes the keystore/truststore swappable.
Auto-reloads keystore/truststore (without need for a restart) if they are local files
Sticky query routing ()
Adds support for deterministic and sticky routing for a query / table / broker. This setting would lead to same server / set of servers (for MultiStageReplicaGroupSelector) being used for all queries of a given table.
Query option (takes precedence over fixed routing setting at table / broker config level)
SET "useFixedReplica"=true;
Table config (takes precedence over fixed routing setting at broker config level)
Table Config to disallow duplicate primary key for dimension tables ()
Use tableConfig dimensionTableConfig.errorOnDuplicatePrimaryKey=true to enable this behavior
Disabled by default
Partition-level ForceCommit for realtime tables ()
Support to force-commit specific partitions of a realtime table.
Partitions can be specified to the forceCommit API as a comma separated list of partition names or consuming segment names
Support initializing broker tags from config ()
Support to give the broker initial tags on startup.
Automatically updates brokerResource when broker joins the cluster for the first time
Broker tags are provided as comma-separated values in pinot.broker.instance.tags
Support for StreamNative OAuth2 authentication for Pulsar ()
StreamNative (the cloud SAAS offering of Pulsar) uses OAuth2 to authenticate clients to their Pulsar clusters.
For more information, see how to
Can be configured by adding the following properties to streamConfigs:
Introduce low disk mode to table rebalance ()
Introduces a new table rebalance boolean config lowDiskMode.Default value is false.
Applicable for rebalance with downtime=false.
When enabled, segments will first be offloaded from servers, then added to servers after offload is done. It may increase the total time of the rebalance, but can be useful when servers are low on disk space, and we want to scale up the cluster and rebalance the table to more servers.
Support vector index and hierarchical navigable small worlds (HNSW) ()
Supports Vector Index on float array/multi-value columnz
Add predicate and function to retrieve topK closest vector. Example query
The function l2_distance will return a double value where the first parameter is the embedding column and the second parameter is the search term embedding literal.
Since VectorSimilarity is a predicate, once config the topK, this predicate will return topk rows per segment. Then if you are using this index with other predicate, you may not get expected number of rows since the records matching other predicate might not in the topk rows.
Support for retention on deleted keys of upsert tables ()
Adds an upsert config deletedKeysTTL which will remove deleted keys from in-memory hashmap and mark the validDocID as invalid after the deletedKeysTTL threshold period.
Disabled by default. Enabled only if a valid value for deletedKeysTTL is set.
Configurable Lucene analyzer ()
Introduces the capability to specify a custom Lucene analyzer used by text index for indexing and search on an individual column basis.
Sample usage
Default Behavior falls back to using the standardAnalyzer unless the luceneAnalyzerClass property is specified.
Support for murmur3 as a partition function ()
Murmur3 support with optional fields seed and variant for the hash in functionConfig field of columnPartitionMap.Default value for seed is 0.
Added support for 2 variants of Murmur3: x86_32
New optimized MV forward index to only store unique MV values
Adds new MV dictionary encoded forward index format that only stores the unique MV entries.
This new index format can significantly reduce the index size when the MV entries repeat a lot
The new index format can be enabled during index creation, derived column creation, and segment reload
Support for explicit null handling modes ()
Adds support for 2 possible ways to handle null:
Table mode - which already exists
Column mode, which means that each column specifies its own nullability in the FieldSpec
Support tracking out of order events in Upsert ()
Adds a new upsert config outOfOrderRecordColumn
When set to a non-null value, we check whether an event is OOO or not and then accordingly update the corresponding column value to true / false.
This will help in tracking which event is out-of-order while using skipUpsert
Compression configuration support for aggregationConfigs to StartreeIndexConfigs ()
Can be used to save space. For eg: when a functionColumnPairs has a output type of bytes, such as when you use distinctcountrawhll.
Sample config
Preconfiguration based mirror instance assignment ()
Supports instance assignment based pre-configured instance assignment map.
The assignment will always respect the mirrored servers in the pre-configured map
More details
Support for listing dimension tables ()
Adds dimension as a valid option to table "type" in the /tables controller API
Support in upsert for dropping out of order events ()
This patch adds a new config for upsert: dropOutOfOrderRecord
If set to true, pinot doesn't persist out-of-order events in the segment.
This feature is useful to
Support to retry failed table rebalance tasks ()
New configs for the RebalanceChecker periodic task:
controller.rebalance.checker.frequencyPeriod: 5min by default ; -1 to disable
controller.rebalanceChecker.initialDelayInSeconds
Support for UltraLogLog ()
UltraLogLog aggregations for Count Distinct (distinctCountULL and distinctCountRawULL)
UltraLogLog creation via Transform Function
UltraLogLog merging in MergeRollup
Support for Apache Datasketches CPC sketch ()
Ingestion via transformation function
Extracting estimates via query aggregation functions
Segment rollup aggregation
Support to reduce DirectMemory OOM chances on broker ()
Broadly there are two configs that will enable this feature:
maxServerResponseSizeBytes: Maximum serialized response size across all servers for a query. This value is equally divided across all servers processing the query.
maxQueryResponseSizeBytes: Maximum length of the serialized response per server for a query
UI support to allow schema to be created with JSON config ()
This is helpful when user has the entire JSON handy
UI still keeps Form Way to add Schema along with JSON view
Support in JSON index for ignoring values longer than a given length ()
Use option maxValueLength in jsonIndexConfig to restrict length of values
A value of 0 (or when the key is omitted) means there is no restriction
Support for MultiValue VarByte V4 index writer ()
Supports serializing and writing MV columns in VarByteChunkForwardIndexWriterV4
Supports V4 reader that can be used to read SV var length, MV fixed length and MV var length buffers encoded with V4 writer
Improved scalar function support for Multivalue columns(, )
Support for FrequentStringsSketch and FrequentLonsSketch aggregation functions ()
Approximation aggregation functions for estimating the frequencies of items a dataset in a memory efficient way. More details in library.
Controller API for table index ()
Table index api to get the aggregate index details of all segments for a table.
URL/tables/{tableName}/indexes
Response format
Support for configurable rebalance delay at lead controller ()
The lead controller rebalance delay is now configurable with controller.resource.rebalance.delay_ms
Changing rebalance configurations will now update the lead controller resource
Support for configuration through environment variables ()
Adds support for Pinot configuration through ENV variables with Dynamic mapping.
More details in issue:
Sample configs through ENV
Add hyperLogLogPlus aggregation function for distinct count ()
HLL++ has higher accuracy than HLL when cardinality of dimension is at 10k-100k.
More details
Support for clpMatch
Adds query rewriting logic to transform a "virtual" UDF, clpMatch, into a boolean expression on the columns of a CLP-encoded field.
To use the rewriter, modify broker config to add org.apache.pinot.sql.parsers.rewriter.ClpRewriter to pinot.broker.query.rewriter.class.names.
Support for DATETIMECONVERTWINDOWHOP function ()
Support for JSON_EXTRACT_INDEX transform function to leverage json index for json value extraction ()
Support for ArrayAgg aggregation function ()
GenerateData command support for generating data in JSON format ()
Enhancements
SQL
Support ARRAY function as a literal evaluation ()
Support for ARRAY literal transform functions ()
Theta Sketch Aggregation enhancements ()
UI
Async rendering of UI elements to load UI elements async resulting in faster page loads ()
Make the table name link clickable in task details ()
Swagger UI enhancements to resumeConsumption API call ()
Misc
Enhancement to reduce the heap usage of String Dictionaries that are loaded on-heap ()
Wire soft upsert delete for Compaction task ()
Upsert compaction debuggability APIs for validDocId metadata ()
Bug fixes, refactoring, cleanups, deprecations
Upsert bugfix in "rewind()" for CompactedPinotSegmentRecordReader ()
Fix error message format for Preconditions.checks failures()
Bugfix to distribute Pinot as a multi-release JAR (, )
Backward incompatible Changes
Fix a race condition for upsert compaction (). Notes on backward incompatibility below:
This PR is introducing backward incompatibility for UpsertCompactionTask. Previously, we allowed to configure the compaction task without the snapshot enabled. We found that using in-memory based validDocIds is a bit dangerous as it will not give us the consistency (e.g. fetching validDocIds bitmap while the server is restarting & updating validDocIds).
We now enforce the enableSnapshot=true for UpsertCompactionTask if the advanced customer wants to run the compaction with the in-memory validDocId bitmap.
Library upgrades and dependencies
Update maven-jar-plugin and maven-enforcer-plugin version (#11637)
Update testng as the test provider explicitly instead of relying on the classpath. ()
Update compatibility verifier version ()
As a side effect, is_colocated_by_join_keys query option is reintroduced to ensure dynamic broadcast which can also benefit from direct exchange optimization
Allows propagation of partition distribution trait info across the tree to be used during Physical Planning phase. It can be used in the following scenarios (will follow up in separate PRs)
Note on backward incompatbility
is_colocated_by_join_keys hint is now required for making colocated joins
it should only affect semi-join b/c it is the only one utilizing broadcast exchange but were pulled to act as direct exchange.
inner/left/right/full join should automatically apply colocation thus the backward incompatibility should not affect these.
for any remainder nodes that cannot be planned into PinotQuery, will be run together with the LeafStageTransferrableBlockOperator as the input locally.
Bugfix for IN and NOT IN filters within case statements (#12305)
Broker conf - pinot.broker.use.fixed.replica=true
#12112 adds the UI capability to toggle this option
and
x64_32
configurable using the
variant
field in
functionConfig
. If no variant is provided we choose to keep the
x86_32
variant as it was part of the original implementation.
Examples of functionConfig;
Here there is no functionConfig configured, so the seed value will be 0 and variant will be x86_32.
Here the seed is configured as 9001 but as no variant is provided, x86_32 will be picked up.
Here the variant is mentioned so Murmur3 will use the x64_32 variant with 9001 as seed.
Note on users using Debezium and Murmur3 as partitioning function :
The partitioning key should be set up on either of byte[], String or long[] columns.
On pinot variant should be set as x64_32 and seed should be set as 9001.
To enable the new index format, set the compression codec in the FieldConfig:
Or use the new index JSON:
Column mode can be enabled by the below config.
The default value for enableColumnBasedNullHandling is false. When set to true, Pinot will ignore TableConfig.IndexingConfig.nullHandlingEnabled and columns will be nullable if and only ifFieldSpec.notNull is false, which is also the default value.
Sample table config
Save disk-usage
Avoid any confusion when using skipUpsert for partial-upsert tables as nulls start showing up for columns where a previous non-null was encountered and we don't know if it's an out-of-order event or not.
: 2min+ by default
New configs added for RebalanceConfig:
heartbeatIntervalInMs: 300_000 i.e. 5min
heartbeatTimeoutInMs: 3600_000 i.e. 1hr
maxAttempts: 3 by default, i.e. the original run plus two retries
retryInitialDelayInMs: 300_000 i.e. 5min, for exponential backoff w/ jitters
New metrics to monitor rebalance and its retries:
TABLE_REBALANCE_FAILURE("TableRebalanceFailure", false), emit from TableRebalancer.rebalanceTable()
TABLE_REBALANCE_EXECUTION_TIME_MS("tableRebalanceExecutionTimeMs", false), emit from TableRebalancer.rebalanceTable()
TABLE_REBALANCE_FAILURE_DETECTED("TableRebalanceFailureDetected", false), emit from RebalanceChecker
TABLE_REBALANCE_RETRY("TableRebalanceRetry", false), emit from RebalanceChecker
New restful API
DELETE /tables/{tableName}/rebalance API to stop rebalance. In comparison, POST /tables/{tableName}/rebalance was used to start one.
Support for UltraLogLog in Star-Tree indexes
StarTree aggregation
Configs are available as queryOption, tableConfig and Broker config. The priority of enforcement is as follows:
Adds configuration options for DistinctCountThetaSketchAggregationFunction
Respects ordering for existing Theta sketches to use "early-stop" optimisations for unions
Add query option override for Broker MinGroupTrimSize (#11984)
Support for 2 new scalar functions for bytes: toUUIDBytes and fromUUIDBytes (#11988)
Config option to make groupBy trim size configurable at Broker (#11958)
Pre-aggregation support for distinct count hll++ (#11747)
Add float type into literal thrift to preserve literal type conforming to SQL standards (#11697)
Enhancement to add query function override for Aggregate functions of multi valued columns (#11307)
Perf optimization in IN clause evaluation (#11557)
Add TextMatchFilterOptimizer to maximally push down text_match filters to Lucene (#12339
Adds support for CTRL key as a modifier for Query shortcuts (#12087)
UI enhancement to show partial index in reload (#11913)
UI improvement to add Links to Instance in Table and Segment View (#11807)
Fixes reload to use the right indexes API instead of fetching all segment metadata (#11793)
Enhancement to add toggle to hide/show query exceptions (#11611)
Make server resource classes configurable (#12324)
Shared aggregations for Startree index - mapping from aggregation used in the query to aggregation used to store pre-aggregated values (#12164)
Increased fetch timeout for Kineses to prevent stuck kinesis consumers
Allow users to pass custom RecordTransformers to SegmentProcessorFramework (#11887)
Add isPartialResult flag to broker response (#11592)
Add new configs to Google Cloud Storage (GCS) connector: jsonKey (#11890)
jsonKey is the GCP credential key in string format (either in plain string or base64 encoded string). Refer Creating and managing service account keys to download the keys.
Performance enhancement to build segments in column orientation (#11776)
Disabled by default. Can be enabled by setting table config columnMajorSegmentBuilderEnabled
Observability enhancements to emit metrics for grpc request and multi-stage leaf stage (#11838)
pinot.server.query.log.maxRatePerSecond: query log max rate (QPS, default 10K)
pinot.server.query.log.droppedReportMaxRatePerSecond: dropped query log report max rate (QPS, default 1)
Observability improvement to expose GRPC metrics (#11842)
Improvements to response format for reload API to be pretty printed (#11608)
Add more information in RequestContext class (#11708)
Support to read exact buffer byte ranges corresponding to a given forward index doc id (#11729)
Enhance Broker reducer to handle expression format change (#11762)
Capture build scans on ge.apache.org to benefit from deep build insights (#11767)
Performance enhancement in multiple places by updating initial capacity of HashMap (#11709)
Support for building indexes post segment file creation, allowing indexes that may depend on a completed segment to be built as part of the segment creation process (#11711)
Support excluding time values in SimpleSegmentNameGenerator (#11650)
Perf enhancement to reduce cpu usage by avoiding throwing an exception during query execution (#11715)
Added framework for supporting nulls in ScalarTransformFunctionWrapper in the future (#11653)
Observability change to metrics to export netty direct memory used and max (#11575)
Observability change to add a metric to measure total thread cpu time for a table (#11713)
Observability change to use SlidingTimeWindowArrayReservoirin dropwizard metrics (#11695)
Fix the bug of using push time to identify new created segment (#11599)
Bugfix in CSVRecordReader when using line iterator (#11581)
Remove split commit and some deprecated config for real-time protocol on controller (#11663)Improved validation for single argument aggregation functions (#11556)
Fix to not emit lag once tabledatamanager shutdown (#11534)
Bugfix to fail reload if derived columns can't be created (#11559)
Fix the double unescape of property value (#12405)
Fix for the backward compatible issue that existing metadata may contain unescaped characters (#12393)
Skip invalid json string rather than throwing error during json indexing (#12238)
Fixing the multiple files concurrent write issue when reloading SSLFactory (#12384)
Fix memory leaking issue by making thread local variable static (#12242)
Simplify kafka build and remove old kafka 0.9 files (#11638)
Adding comments for docker image tags, make a hyper link of helmChart from root directory (#11646)
Improve the error response on controller. (#11624)
Simplify authrozation for table config get (#11640)
Bugfix to remove segments with empty download url in UpsertCompactionTask (#12320)
Test changes to make taskManager resources protected for derived classes to override in their setUp() method. (#12335)
Also, we allow to configure invalidDocIdsType to UpsertCompactionTask for advanced user.
snapshot: Default validDocIds type. This indicates that the validDocIds bitmap is loaded from the snapshot from the Pinot segment. UpsertConfig's enableSnapshot must be enabled for this type.
onHeap: the validDocIds bitmap will be fetched from the server.
onHeapWithDelete: the validDocIds bitmap will be fetched from the server. This will also take account into the deleted documents. UpsertConfig's deleteRecordColumn must be provided for this type.
Removal of the feature flag allow.table.name.with.database (#12402)
Error handling to throw exception when schema name doesn't match table name during table creation (#11591)
Fix type cast issue with dateTimeConvert scalar function (#11839, #11971)
Incompatible API fix to remove table state update operation in GET call (#11621)
Use string to represent BigDecimal datatype in JSON response (#11716)
Single quoted literal will not have its type auto-derived to maintain SQL compatibility (#11763)
Changes to always use split commit on server and disables the option to disable it (#11680, #11687)
Change to not allow NaN as default value for Float and Double in Schemas (#11661)
Code cleanup and refactor that removes TableDataManagerConfig (#12189)
Fix partition handling for consistency of values between query and segment (#12115)
Changes for migration to commons-configuration2 (#11985)
Cleanup to simplify the upsert metadata manager constructor (#12120)
SELECT ProductId, UserId, l2_distance(embedding, ARRAY[-0.0013143676,-0.011042999,...]) AS l2_dist, n_tokens, combined
FROM fineFoodReviews
WHERE VECTOR_SIMILARITY(embedding, ARRAY[-0.0013143676,-0.011042999,...], 5)
ORDER by l2_dist ASC
LIMIT 10