1 of 100

release-1.3.0

Introduction

Apache Pinot is a real-time distributed OLAP datastore purpose-built for low-latency, high-throughput analytics, and perfect for user-facing analytical workloads.

Apache Pinot™ is a real-time distributed online analytical processing (OLAP) datastore. Use Pinot to ingest and immediately query data from streaming or batch data sources (including, Apache Kafka, Amazon Kinesis, Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage).

We'd love to hear from you! Join us in our Slack channel to ask questions, troubleshoot, and share feedback.

Apache Pinot includes the following:

Ultra low-latency analytics even at extremely high throughput.
Columnar data store with several smart indexing and pre-aggregation techniques.
Scaling up and out with no upper bound.
Consistent performance based on the size of your cluster and an expected query per second (QPS) threshold.

It's perfect for user-facing real-time analytics and other analytical use cases, including internal dashboards, anomaly detection, and ad hoc data exploration.

User-facing real-time analytics

User-facing analytics refers to the analytical tools exposed to the end users of your product. In a user-facing analytics application, all users receive personalized analytics on their devices, resulting in hundreds of thousands of queries per second. Queries triggered by apps may grow quickly in proportion to the number of active users on the app, as many as millions of events per second. Data generated in Pinot is immediately available for analytics in latencies under one second.

User-facing real-time analytics requires the following:

Fresh data. The system needs to be able to ingest data in real time and make it available for querying, also in real time.
Support for high-velocity, highly dimensional event data from a wide range of actions and from multiple sources.
Low latency. Queries are triggered by end users interacting with apps, resulting in hundreds of thousands of queries per second with arbitrary patterns.
Reliability and high availability.
Scalability.
Low cost to serve.

Why Pinot?

Pinot is designed to execute OLAP queries with low latency. It works well where you need fast analytics, such as aggregations, on both mutable and immutable data.

User-facing, real-time analytics

Pinot was originally built at LinkedIn to power rich interactive real-time analytics applications, such as Who Viewed Profile, Company Analytics, Talent Insights, and many more. UberEats Restaurant Manager is another example of a user-facing analytics app built with Pinot.

Real-time dashboards for business metrics

Pinot can perform typical analytical operations such as slice and dice, drill down, roll up, and pivot on large scale multi-dimensional data. For instance, at LinkedIn, Pinot powers dashboards for thousands of business metrics. Connect various business intelligence (BI) tools such as Superset, Tableau, or PowerBI to visualize data in Pinot.

Enterprise business intelligence

For analysts and data scientists, Pinot works well as a highly-scalable data platform for business intelligence. Pinot converges big data platforms with the traditional role of a data warehouse, making it a suitable replacement for analysis and reporting.

Enterprise application development

For application developers, Pinot works well as an aggregate store that sources events from streaming data sources, such as Kafka, and makes it available for a query using SQL. You can also use Pinot to aggregate data across a microservice architecture into one easily queryable view of the domain.

Pinot tenants prevent any possibility of sharing ownership of database tables across microservice teams. Developers can create their own query models of data from multiple systems of record depending on their use case and needs. As with all aggregate stores, query models are eventually consistent.

Get started

If you're new to Pinot, take a look at our Getting Started guide:

To start importing data into Pinot, see how to import batch and stream data:

To start querying data in Pinot, check out our Query guide:

Learn

For a conceptual overview that explains how Pinot works, check out the Concepts guide:

To understand the distributed systems architecture that explains Pinot's operating model, take a look at our basic architecture section:

Basics

Concepts

Explore the fundamental concepts of Apache Pinot™ as a distributed OLAP database.

Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-latency queries on large datasets through a set of common data model abstractions. Delivering on these goals requires several foundational architectural commitments, including:

Storing data in columnar form to support high-performance scanning
Sharding of data to scale both storage and computation
A distributed architecture designed to scale capacity linearly
A tabular data model read by SQL queries

To learn about Pinot components, terminology, and gain a conceptual understanding of how data is stored in Pinot, review the following sections:

Pinot storage model
Pinot architecture
Pinot components

Pinot storage model

Apache Pinot™ uses a variety of terms which can refer to either abstractions that model the storage of data or infrastructure components that drive the functionality of the system, including:

to store data
to partition data
to isolate data
to manage data

Pinot has a distributed systems architecture that scales horizontally. Pinot expects the size of a table to grow infinitely over time. To achieve this, all data needs to be distributed across multiple nodes. Pinot achieves this by breaking data into smaller chunks known as (similar to shards/partitions in HA relational databases). Segments can also be seen as time-based partitions.

Table

Similar to traditional databases, Pinot has the concept of a —a logical abstraction to refer to a collection of related data. As is the case with relational database management systems (RDBMS), a table is a construct that consists of columns and rows (documents) that are queried using SQL. A table is associated with a , which defines the columns in a table as well as their data types.

As opposed to RDBMS schemas, multiple tables can be created in Pinot (real-time or batch) that inherit a single schema definition. Tables are independently configured for concerns such as indexing strategies, partitioning, tenants, data sources, and replication.

Pinot stores data in . A Pinot table is conceptually identical to a relational database table with rows and columns. Columns have the same name and data type, known as the table's .

Pinot schemas are defined in a JSON file. Because that schema definition is in its own file, multiple tables can share a single schema. Each table can have a unique name, indexing strategy, partitioning, data sources, and other metadata.

Pinot table types include:

real-time: Ingests data from a streaming source like Apache Kafka®
offline: Loads data from a batch source
hybrid: Loads data from both a batch source and a streaming source

Segment

Pinot tables are stored in one or more independent shards called . A small table may be contained by a single segment, but Pinot lets tables grow to an unlimited number of segments. There are different processes for creating segments (see ). Segments have time-based partitions of table data, and are stored on Pinot that scale horizontally as needed for both storage and computation.

Tenant

To support multi-tenancy, Pinot has first class support for tenants. A table is associated with a . This allows all tables belonging to a particular logical namespace to be grouped under a single tenant name and isolated from other tenants. This isolation between tenants provides different namespaces for applications and teams to prevent sharing tables or schemas. Development teams building applications do not have to operate an independent deployment of Pinot. An organization can operate a single cluster and scale it out as new tenants increase the overall volume of queries. Developers can manage their own schemas and tables without being impacted by any other tenant on a cluster.

Every table is associated with a , or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., and ) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data from separate workloads from being stored or processed on the same physical hardware.

By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster.

Cluster

A Pinot is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see .

Physical architecture

A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop.

Controller: Maintains cluster metadata and manages cluster resources.
Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.
Broker: Accepts queries from client processes and forwards them to servers for processing.
Server: Provides storage for segment files and compute for query processing.
(Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).

The simplest possible Pinot cluster consists of four components: a server, a broker, a controller, and a Zookeeper node. In production environments, these components typically run on separate server instances, and scale out as needed for data volume, load, availability, and latency. Pinot clusters in production range from fewer than ten total instances to more than 1,000.

Pinot uses as a distributed metadata store and and for cluster management.

Helix is a cluster management solution created by the authors of Pinot. Helix maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. It constantly monitors the cluster to ensure that the right hardware resources are allocated to implement the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.

Controller

A is the core orchestrator that drives the consistency and routing in a Pinot cluster. Controllers are horizontally scaled as an independent component (container) and has visibility of the state of all other components in a cluster. The controller reacts and responds to state changes in the system and schedules the allocation of resources for tables, segments, or nodes. As mentioned earlier, Helix is embedded within the controller as an agent that is a participant responsible for observing and driving state changes that are subscribed to by other components.

The Pinot schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of and ). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

The controller exposes a for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.

Server

host segments (shards) that are scheduled and allocated across multiple nodes and routed on an assignment to a tenant (there is a single tenant by default). Servers are independent containers that scale horizontally and are notified by Helix through state changes driven by the controller. A server can either be a real-time server or an offline server.

A real-time and offline server have very different resource usage requirements, where real-time servers are continually consuming new messages from external systems (such as Kafka topics) that are ingested and allocated on segments of a tenant. Because of this, resource isolation can be used to prioritize high-throughput real-time data streams that are ingested and then made available for query through a broker.

Broker

Pinot take query requests from client processes, scatter them to applicable servers, gather the results, and return them to the client. The controller shares cluster metadata with the brokers that allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.

A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.

Pinot minion

Pinot minion is an optional component that can be used to run background tasks such as "purge" for GDPR (General Data Protection Regulation). As Pinot is an immutable aggregate store, records containing sensitive private data need to be purged on a request-by-request basis. Minion provides a solution for this purpose that complies with GDPR while optimizing Pinot segments and building additional indices that guarantees performance in the presence of the possibility of data deletion. One can also write a custom task that runs on a periodic basis. While it's possible to perform these tasks on the Pinot servers directly, having a separate process (Minion) lessens the overall degradation of query latency as segments are impacted by mutable writes.

A Pinot is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minon tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.

Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function with or without minions, they are typically present to support routine tasks like batch data ingest.

Components

Discover the core components of Apache Pinot, enabling efficient data processing and analytics. Unleash the power of Pinot's building blocks for high-performance data-driven applications.

Storing data in columnar form to support high-performance scanning
Sharding of data to scale both storage and computation
A distributed architecture designed to scale capacity linearly
A tabular data model read by SQL queries

Components

Learn about the major components and logical abstractions used in Pinot.

Operator reference

Developer reference

Cluster

Learn to build and manage Apache Pinot clusters, uncovering key components for efficient data processing and optimized analysis.

A Pinot cluster is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see .

Controller: Maintains cluster metadata and manages cluster resources.
Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.
Broker: Accepts queries from client processes and forwards them to servers for processing.
Server: Provides storage for segment files and compute for query processing.
(Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).

Pinot uses as a distributed metadata store and for cluster management.

Helix is a cluster management solution that maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. Helix constantly monitors the cluster to ensure that the right hardware resources are allocated for the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.

Cluster configuration

For details of cluster configuration settings, see .

Cluster components

Helix divides nodes into logical components based on their responsibilities:

Participant

Participants are the nodes that host distributed, partitioned resources

Pinot servers are modeled as participants. For details about server nodes, see .

Spectator

Spectators are the nodes that observe the current state of each participant and use that information to access the resources. Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).

Pinot brokers are modeled as spectators. For details about broker nodes, see .

Controller

The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability.

Pinot controllers are modeled as controllers. For details about controller nodes, see .

Logical view

Another way to visualize the cluster is a logical view, where:

A cluster contains
Tenants contain
Tables contain

Set up a Pinot cluster

Typically, there is only one cluster per environment/data center. There is no need to create multiple Pinot clusters because Pinot supports .

To set up a cluster, see one of the following guides:

Tenant

Discover the tenant component of Apache Pinot, which facilitates efficient data isolation and resource management within Pinot clusters.

Every table is associated with a tenant, or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., brokers and servers) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data in separate workloads from being stored or processed on the same physical hardware.

By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster.

To support multi-tenancy, Pinot has first-class support for tenants. Every table is associated with a server tenant and a broker tenant, which controls the nodes used by the table as servers and brokers. Multi-tenancy lets Pinot group all tables belonging to a particular use case under a single tenant name.

The concept of tenants is very important when the multiple use cases are using Pinot and there is a need to provide quotas or some sort of isolation across tenants. For example, consider we have two tables Table A and Table B in the same Pinot cluster.

We can configure Table A with server tenant Tenant A and Table B with server tenant Tenant B. We can tag some of the server nodes for Tenant A and some for Tenant B. This will ensure that segments of Table A only reside on servers tagged with Tenant A, and segment of Table B only reside on servers tagged with Tenant B. The same isolation can be achieved at the broker level, by configuring broker tenants to the tables.

No need to create separate clusters for every table or use case!

Tenant configuration

This tenant is defined in the tenants section of the table config.

This section contains two main fields broker and server , which decide the tenants used for the broker and server components of this table.

"tenants": {
  "broker": "brokerTenantName",
  "server": "serverTenantName"
}

In the above example:

The table will be served by brokers that have been tagged as brokerTenantName_BROKER in Helix.
If this were an offline table, the offline segments for the table will be hosted in Pinot servers tagged in Helix as serverTenantName_OFFLINE
If this were a real-time table, the real-time segments (both consuming as well as completed ones) will be hosted in pinot servers tagged in Helix as serverTenantName_REALTIME.

Create a tenant

Broker tenant

Here's a sample broker tenant config. This will create a broker tenant sampleBrokerTenant by tagging three untagged broker nodes as sampleBrokerTenant_BROKER.

sample-broker-tenant.json

{
     "tenantRole" : "BROKER",
     "tenantName" : "sampleBrokerTenant",
     "numberOfInstances" : 3
}

To create this tenant use the following command. The creation will fail if number of untagged broker nodes is less than numberOfInstances.

Follow instructions in Getting Pinot to get Pinot locally, and then

bin/pinot-admin.sh AddTenant \
    -name sampleBrokerTenant 
    -role BROKER 
    -instanceCount 3 -exec

curl -i -X POST -H 'Content-Type: application/json' -d @sample-broker-tenant.json localhost:9000/tenants

Check out the table config in the Rest API to make sure it was successfully uploaded.

Server tenant

Here's a sample server tenant config. This will create a server tenant sampleServerTenant by tagging 1 untagged server node as sampleServerTenant_OFFLINE and 1 untagged server node as sampleServerTenant_REALTIME.

sample-server-tenant.json

{
     "tenantRole" : "SERVER",
     "tenantName" : "sampleServerTenant",
     "offlineInstances" : 1,
     "realtimeInstances" : 1
}

To create this tenant use the following command. The creation will fail if number of untagged server nodes is less than offlineInstances + realtimeInstances.

Follow instructions in Getting Pinot to get Pinot locally, and then

bin/pinot-admin.sh AddTenant \
    -name sampleServerTenant \
    -role SERVER \
    -offlineInstanceCount 1 \
    -realtimeInstanceCount 1 -exec

curl -i -X POST -H 'Content-Type: application/json' -d @sample-server-tenant.json localhost:9000/tenants

Check out the table config in the Rest API to make sure it was successfully uploaded.

Server

Uncover the efficient data processing and storage capabilities of Apache Pinot's server component, optimizing performance for data-driven applications.

Pinot servers provide the primary storage for and perform the computation required to execute queries. A production Pinot cluster contains many servers. In general, the more servers, the more data the cluster can retain in tables, the lower latency the cluster can deliver on queries, and the more concurrent queries the cluster can process.

Servers are typically segregated into real-time and offline workloads, with "real-time" servers hosting only real-time tables, and "offline" servers hosting only offline tables. This is a ubiquitous operational convention, not a difference or an explicit configuration in the server process itself. There are two types of servers:

Offline

Offline servers are responsible for downloading segments from the segment store, to host and serve queries off. When a new segment is uploaded to the controller, the controller decides the servers (as many as replication) that will host the new segment and notifies them to download the segment from the segment store. On receiving this notification, the servers download the segment file and load the segment onto the server, to server queries off them.

Real-time

Real-time servers directly ingest from a real-time stream (such as Kafka or EventHubs). Periodically, they make segments of the in-memory ingested data, based on certain thresholds. This segment is then persisted onto the segment store.

Pinot servers are modeled as Helix participants, hosting Pinot tables (referred to as resources in Helix terminology). Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more Helix partitions of one or more helix resources (i.e. one or more segments of one or more tables).

Starting a server

Make sure you've . If you're using Docker, make sure to . To start a server:

Controller

Discover the controller component of Apache Pinot, enabling efficient data and query management.

The Pinot controller schedules and reschedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, the Pinot controller schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (for example, ingest of real-time tables and offline tables). The Pinot controller can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

The controller exposes a REST API endpoint for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.

The Pinot controller is responsible for the following:

Maintaining global metadata (e.g., configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store.
Hosting the Helix Controller and managing other Pinot components (brokers, servers, minions)
Maintaining the mapping of which servers are responsible for which segments. This mapping is used by the servers to download the portion of the segments that they are responsible for. This mapping is also used by the broker to decide which servers to route the queries to.
Serving admin endpoints for viewing, creating, updating, and deleting configs, which are used to manage and operate the cluster.
Serving endpoints for segment uploads, which are used in offline data pushes. They are responsible for initializing real-time consumption and coordination of persisting real-time segments into the segment store periodically.
Undertaking other management activities such as managing retention of segments, validations.

For redundancy, there can be multiple instances of Pinot controllers. Pinot expects that all controllers are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or ADLS.

Running the periodic task manually

The controller runs several periodic tasks in the background, to perform activities such as management and validation. Each periodic task has its own configuration to define the run frequency and default frequency. Each task runs at its own schedule or can also be triggered manually if needed. The task runs on the lead controller for each table.

For period task configuration details, see Controller configuration reference.

Use the GET /periodictask/names API to fetch the names of all the periodic tasks running on your Pinot cluster.

curl -X GET "http://localhost:9000/periodictask/names" -H "accept: application/json"

[
  "RetentionManager",
  "OfflineSegmentIntervalChecker",
  "RealtimeSegmentValidationManager",
  "BrokerResourceValidationManager",
  "SegmentStatusChecker",
  "SegmentRelocator",
  "StaleInstancesCleanupTask",
  "TaskMetricsEmitter"
]

To manually run a named periodic task, use the GET /periodictask/run API:

curl -X GET "http://localhost:9000/periodictask/run?taskname=SegmentStatusChecker&tableName=jsontypetable&type=OFFLINE" -H "accept: application/json"

{
  "Log Request Id": "api-09630c07",
  "Controllers notified": true
}

The Log Request Id (api-09630c07) can be used to search through pinot-controller log file to see log entries related to execution of the Periodic task that was manually run.

If tableName (and its type OFFLINE or REALTIME) is not provided, the task will run against all tables.

Starting a controller

Make sure you've set up Zookeeper. If you're using Docker, make sure to pull the Pinot Docker image. To start a controller:

docker run \
    --network=pinot-demo \
    --name pinot-controller \
    -p 9000:9000 \
    -d ${PINOT_IMAGE} StartController \
    -zkAddress pinot-zookeeper:2181

bin/pinot-admin.sh StartController \
  -zkAddress localhost:2181 \
  -clusterName PinotCluster \
  -controllerPort 9000

Broker

Discover how Apache Pinot's broker component optimizes query processing, data retrieval, and enhances data-driven applications.

Pinot brokers take query requests from client processes, scatter them to applicable servers, gather the results, and return results to the client. The controller shares cluster metadata with the brokers, which allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.

A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.

Pinot brokers are modeled as Helix spectators. They need to know the location of each segment of a table (and each replica of the segments) and route requests to the appropriate server that hosts the segments of the table being queried.

The broker ensures that all the rows of the table are queried exactly once so as to return correct, consistent results for a query. The brokers may optimize to prune some of the segments as long as accuracy is not sacrificed.

Helix provides the framework by which spectators can learn the location in which each partition of a resource (i.e. participant) resides. The brokers use this mechanism to learn the servers that host specific segments of a table.

In the case of hybrid tables, the brokers ensure that the overlap between real-time and offline segment data is queried exactly once, by performing offline and real-time federation.

Let's take this example, we have real-time data for five days - March 23 to March 27, and offline data has been pushed until Mar 25, which is two days behind real-time. The brokers maintain this time boundary.

Suppose, we get a query to this table : select sum(metric) from table. The broker will split the query into 2 queries based on this time boundary – one for offline and one for real-time. This query becomes select sum(metric) from table_REALTIME where date >= Mar 25 and select sum(metric) from table_OFFLINE where date < Mar 25

The broker merges results from both these queries before returning the result to the client.

Starting a broker

Make sure you've set up Zookeeper. If you're using Docker, make sure to pull the Pinot Docker image. To start a broker:

docker run \
    --network=pinot-demo \
    --name pinot-broker \
    -d ${PINOT_IMAGE} StartBroker \
    -zkAddress pinot-zookeeper:2181

bin/pinot-admin.sh StartBroker \
  -zkAddress localhost:2181 \
  -clusterName PinotCluster \
  -brokerPort 7000

Deep Store

Leverage Apache Pinot's deep store component for efficient large-scale data storage and management, enabling impactful data processing and analysis.

The deep store (or deep storage) is the permanent store for segment files.

It is used for backup and restore operations. New server nodes in a cluster will pull down a copy of segment files from the deep store. If the local segment files on a server gets damaged in some way (or accidentally deleted), a new copy will be pulled down from the deep store on server restart.

The deep store stores a compressed version of the segment files and it typically won't include any indexes. These compressed files can be stored on a local file system or on a variety of other file systems. For more details on supported file systems, see File Systems.

Note: Deep store by itself is not sufficient for restore operations. Pinot stores metadata such as table config, schema, segment metadata in Zookeeper. For restore operations, both Deep Store as well as Zookeeper metadata are required.

How do segments get into the deep store?

There are several different ways that segments are persisted in the deep store.

For offline tables, the batch ingestion job writes the segment directly into the deep store, as shown in the diagram below:

The ingestion job then sends a notification about the new segment to the controller, which in turn notifies the appropriate server to pull down that segment.

For real-time tables, by default, a segment is first built-in memory by the server. It is then uploaded to the lead controller (as part of the Segment Completion Protocol sequence), which writes the segment into the deep store, as shown in the diagram below:

Having all segments go through the controller can become a system bottleneck under heavy load, in which case you can use the peer download policy, as described in Decoupling Controller from the Data Path.

When using this configuration, the server will directly write a completed segment to the deep store, as shown in the diagram below:

Configuring the deep store

For hands-on examples of how to configure the deep store, see the following tutorials:

Segment threshold

Learn how segment thresholds work in Pinot.

The segment threshold determines when a segment is committed in real-time tables.

When data is first ingested from a streaming provider like Kafka, Pinot stores the data in a consuming segment.

This segment is on the disk of the server(s) processing a particular partition from the streaming provider.

However, it's not until a segment is committed that the segment is written to the . The segment threshold decides when that should happen.

Why is the segment threshold important?

The segment threshold is important because it ensures segments are a reasonable size.

When queries are processed, smaller segments may increase query latency due to more overhead (number of threads spawned, meta data processing, and so on).

Larger segments may cause servers to run out of memory. When a server is restarted, the consuming segment must start consuming from the first row again, causing a lag between Pinot and the streaming provider.

Mark Needham explains the segment threshold

Segment retention

In this Apache Pinot concepts guide, we'll learn how segment retention works.

Segments in Pinot tables have a retention time, after which the segments are deleted. Typically, offline tables retain segments for a longer period of time than real-time tables.

The removal of segments is done by the retention manager. By default, the retention manager runs once every 6 hours.

The retention manager purges two types of segments:

Expired segments: Segments whose end time has exceeded the retention period.
Replaced segments: Segments that have been replaced as part of the

There are a couple of scenarios where segments in offline tables won't be purged:

If the segment doesn't have an end time. This would happen if the segment doesn't contain a time column.
If the segment's table has a segmentIngestionType of REFRESH.

If the retention period isn't specified, segments aren't purged from tables.

The retention manager initially moves these segments into a Deleted Segments area, from where they will eventually be permanently removed.

Time boundary

Learn about time boundaries in hybrid tables.

Learn about time boundaries in hybrid tables. Hybrid tables are when we have offline and real-time tables with the same name.

When querying these tables, the Pinot broker decides which records to read from the offline table and which to read from the real-time table. It does this using the time boundary.

How is the time boundary determined?

The time boundary is determined by looking at the maximum end time of the offline segments and the segment ingestion frequency specified for the offline table.

If it's set to hourly, then:

timeBoundary = Maximum end time of offline segments - 1 hour

Otherwise:

timeBoundary = Maximum end time of offline segments - 1 day

It is possible to force the hybrid table to use max(all offline segments' end time) by calling the API (V 0.12.0+)

curl -X POST \
  "http://localhost:9000/tables/{tableName}/timeBoundary" \
  -H "accept: application/json"

Note that this will not automatically update the time boundary as more segments are added to the offline table, and must be called each time a segment with more recent end time is uploaded to the offline table. You can revert back to using the derived time boundary by calling API:

curl -X DELETE \
  "http://localhost:9000/tables/{tableName}/timeBoundary" \
  -H "accept: application/json"

Querying

When a Pinot broker receives a query for a hybrid table, the broker sends a time boundary annotated version of the query to the offline and real-time tables.

For example, if we executed the following query:

SELECT count(*)
FROM events

The broker would send the following query to the offline table:

SELECT count(*)
FROM events_OFFLINE
WHERE timeColumn <= $timeBoundary

And the following query to the real-time table:

SELECT count(*)
FROM events_REALTIME
WHERE timeColumn > $timeBoundary

The results of the two queries are merged by the broker before being returned to the client.

Pinot Data Explorer

Pinot Data Explorer is a user-friendly interface in Apache Pinot for interactive data exploration, querying, and visualization.

Once you have set up a cluster, you can start exploring the data and the APIs using the Pinot Data Explorer.

Navigate to in your browser to open the Data Explorer UI.

Cluster Manager

The first screen that you'll see when you open the Pinot Data Explorer is the Cluster Manager. The Cluster Manager provides a UI to operate and manage your cluster.

If you want to view the contents of a server, click on its instance name. You'll then see the following:

To view the baseballStats table, click on its name, which will show the following screen:

From this screen, we can edit or delete the table, edit or adjust its schema, as well as several other operations.

For example, if we want to add yearID to the list of inverted indexes, click on Edit Table, add the extra column, and click Save:

Query Console

Let's run some queries on the data in the Pinot cluster. Navigate to to see the querying interface.

We can see our baseballStats table listed on the left (you will see meetupRSVP or airlineStats if you used the streaming or the hybrid ). Click on the table name to display all the names along with the data types of the columns of the table.

You can also execute a sample query select * from baseballStats limit 10 by typing it in the text box and clicking the Run Query button.

Cmd + Enter can also be used to run the query when focused on the console.

Here are some sample queries you can try:

Pinot supports a subset of standard SQL. For more information, see .

Rest API

The contains all the APIs that you will need to operate and manage your cluster. It provides a set of APIs for Pinot cluster management including health check, instances management, schema and table management, data segments management.

Let's check out the tables in this cluster by going to , click Try it out, and then click Execute. We can see thebaseballStats table listed here. We can also see the exact cURL call made to the controller API.

You can look at the configuration of this table by going to , click Try it out, type baseballStats in the table name, and then click Execute.

Let's check out the schemas in the cluster by going to , click Try it out, and then click Execute. We can see a schema called baseballStats in this list.

Take a look at the schema by going to , click Try it out, type baseballStats in the schema name, and then click Execute.

Finally, let's check out the data segments in the cluster by going to , click Try it out, type in baseballStats in the table name, and then click Execute. There's 1 segment for this table, called baseballStats_OFFLINE_0.

To learn how to upload your own data and schema, see or .

Getting Started

This section contains quick start guides to help you get up and running with Pinot.

Running Pinot

To simplify the getting started experience, Apache Pinot™ ships with quick start guides that launch Pinot components in a single process and import pre-built datasets.

For a full list of these guides, see Quick Start Examples.

Deploy to a public cloud

Data import examples

Getting data into Pinot is easy. Take a look at these two quick start guides which will help you get up and running with sample data for offline and real-time tables.

Running on public clouds

This page links to multiple quick start guides for deploying Pinot to different public cloud providers.

These quickstart guides show you how to run an Apache Pinot cluster using Kubernetes on different public cloud providers.

Running on Azure

This quickstart guide helps you get started running Pinot on Microsoft Azure.

In this quickstart guide, you will set up a Kubernetes Cluster on

1. Tooling Installation

1.1 Install Kubectl

Follow this link () to install kubectl.

For Mac users

Check kubectl version after installation.

Quickstart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12

1.2 Install Helm

To install Helm, see .

For Mac users

Check helm version after installation.

This quickstart provides helm supports for helm v3.0.0 and v2.12.1. Pick the script based on your helm version.

1.3 Install Azure CLI

Follow this link () to install Azure CLI.

For Mac users

2. (Optional) Log in to your Azure account

This script will open your default browser to sign-in to your Azure Account.

3. (Optional) Create a Resource Group

Use the following script create a resource group in location eastus.

4. (Optional) Create a Kubernetes cluster(AKS) in Azure

This script will create a 3 node cluster named pinot-quickstart for demo purposes.

Modify the parameters in the following example command with your resource group and cluster details:

Once the command succeeds, the cluster is ready to be used.

5. Connect to an existing cluster

Run the following command to get the credential for the cluster pinot-quickstart that you just created:

To verify the connection, run the following:

6. Pinot quickstart

Follow this to deploy your Pinot demo.

7. Delete a Kubernetes Cluster

HDFS as Deep Storage

This guide shows how to set up HDFS as deep storage for a Pinot segment.

To use HDFS as deep storage you need to include HDFS dependency jars and plugins.

Server Setup

Configuration

Executable

Controller Setup

Configuration

Executable

Broker Setup

Configuration

Executable

Troubleshooting

If you receive an error that says No FileSystem for scheme"hdfs", the problem is likely to be a class loading issue.

To fix, try adding the following property to core-site.xml:

fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem

And then export /opt/pinot/lib/hadoop-common-<release-version>.jar in the classpath.

Troubleshooting Pinot

Find debug information in Pinot

Pinot offers various ways to assist with troubleshooting and debugging problems that might happen.

Start with the which will surface many of the commonly occurring problems. The debug api provides information such as tableSize, ingestion status, and error messages related to state transition in server.

The table debug API can be invoked via the Swagger UI, as in the following image:

It can also be invoked directly by accessing the URL as follows. The api requires the tableName, and can optionally take tableType (offline|realtime) and verbosity level.

Pinot also provides a variety of operational metrics that can be used for creating dashboards, alerting and .

Finally, all pinot components log debug information related to error conditions.

Debug a slow query or a query which keeps timing out

Use the following steps:

If the query executes, look at the query result. Specifically look at numEntriesScannedInFilter and numDocsScanned.
1. If numEntriesScannedInFilter is very high, consider adding indexes for the corresponding columns being used in the filter predicates. You should also think about partitioning the incoming data based on the dimension most heavily used in your filter queries.
2. If numDocsScanned is very high, that means the selectivity for the query is low and lots of documents need to be processed after the filtering. Consider refining the filter to increase the selectivity of the query.
If the query is not executing, you can extend the query timeout by appending a timeoutMs parameter to the query, for example, select * from mytable limit 10 option(timeoutMs=60000). Then repeat step 1, as needed.
Look at garbage collection (GC) stats for the corresponding Pinot servers. If a particular server seems to be running full GC all the time, you can do a couple of things such as
1. Increase Java Virtual Machine (JVM) heap (java -Xmx<size>).
2. Consider using off-heap memory for segments.
3. Decrease the total number of segments per server (by partitioning the data in a more efficient way).

Frequently Asked Questions (FAQs)

This page lists pages with frequently asked questions with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, make a pull request.

General

This page has a collection of frequently asked questions of a general nature with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, .

How does Apache Pinot use deep storage?

When data is pushed to Apache Pinot, Pinot makes a backup copy of the data and stores it on the configured deep-storage (S3/GCP/ADLS/NFS/etc). This copy is stored as tar.gz Pinot segments. Note, that Pinot servers keep a (untarred) copy of the segments on their local disk as well. This is done for performance reasons.

How does Pinot use Zookeeper?

Pinot uses Apache Helix for cluster management, which in turn is built on top of Zookeeper. Helix uses Zookeeper to store the cluster state, including Ideal State, External View, Participants, and so on. Pinot also uses Zookeeper to store information such as Table configurations, schemas, Segment Metadata, and so on.

Why am I getting "Could not find or load class" error when running Quickstart using 0.8.0 release?

Check the JDK version you are using. You may be getting this error if you are using an older version than the current Pinot binary release was built on. If so, you have two options: switch to the same JDK release as Pinot was built with or download the for the Pinot release and it locally.

How to change TimeZone when running Pinot?

Pinot uses the local timezone by default. To change the timezone, set the pinot.timezone value in the .conf config file. It is set once for all Pinot components (Controller, Broker, Server, Minion). See the following sample configuration:

Pinot On Kubernetes FAQ

This page has a collection of frequently asked questions about Pinot on Kubernetes with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, .

How to increase server disk size on AWS

The following is an example using Amazon Elastic Kubernetes Service (Amazon EKS).

1. Update Storage Class

In the Kubernetes (k8s) cluster, check the storage class: in Amazon EKS, it should be gp2.

Then update StorageClass to ensure:

Once StorageClass is updated, it should look like this:

2. Update PVC

Once the storage class is updated, then we can update the PersistentVolumeClaim (PVC) for the server disk size.

Now we want to double the disk size for pinot-server-3.

The following is an example of current disks:

The following is the output of data-pinot-server-3:

Now, let's change the PVC size to 2T by editing the server PVC.

Once updated, the specification's PVC size is updated to 2T, but the status's PVC size is still 1T.

3. Restart pod to let it reflect

Restart the pinot-server-3 pod:

Recheck the PVC size:

Import Data

This page lists options for importing data into Apache Pinot™ with links to detailed instructions with examples.

There are multiple options for importing data into Apache Pinot™. The pages in this section provide step-by-step instructions for importing records into Pinot, supported by our plugin architecture. The intent is to get you up and running with imported data as quickly as possible.

Pinot supports multiple file input formats without needing to change anything other than the file name. Each example imports a ready-made dataset so you can see how things work without needing to find or create your own dataset.

Pinot Batch Ingestion

These guides show you how to import data from popular big data platforms.

Pinot Stream Ingestion

This guide shows you how to import data using stream ingestion from Apache Kafka topics.

This guide shows you how to import data using stream ingestion with upsert.

This guide shows you how to import data using stream ingestion with deduplication.

This guide shows you how to import data using stream ingestion with CLP.

Pinot file systems

By default, Pinot does not come with a storage layer, so all the data sent won't be stored in case of system crash. In order to persistently store the generated segments, you will need to change controller and server configs to add a deep storage. See File systems for all the info and related configs.

These guides show you how to import data and persist it in these file systems.

Pinot input formats

This guide shows you how to import data from various Pinot-supported input formats.

This guide shows you how to handle the complex type in the ingested data, such as map and array.

This guide shows additional examples on how to work with complex types.

This guide shows you how to handle records with dynamic schemas, like JSON log events.

Reloading and uploading existing Pinot segments

This guide shows you how to reload Pinot segments from your deep store.

This guide shows you how to upload Pinot segments from an old, closed Pinot instance.

From Query Console

Insert a file into Pinot from Query Console

This feature is supported after the 0.11.0 release. Reference PR: https://github.com/apache/pinot/pull/8557

Prerequisite

Ensure you have available Pinot Minion instances deployed within the cluster.
Pinot version is 0.11.0 or above

How it works

Parse the query with the table name and directory URI along with a list of options for the ingestion job.
Call controller minion task execution API endpoint to schedule the task on minion
Response has the schema of table name and task job id.

Usage Syntax

INSERT INTO [database.]table FROM FILE dataDirURI OPTION ( k=v ) [, OPTION (k=v)]*

Example

SET taskName = 'myTask-s3';
SET input.fs.className = 'org.apache.pinot.plugin.filesystem.S3PinotFS';
SET input.fs.prop.accessKey = 'my-key';
SET input.fs.prop.secretKey = 'my-secret';
SET input.fs.prop.region = 'us-west-2';
INSERT INTO "baseballStats"
FROM FILE 's3://my-bucket/public_data_set/baseballStats/rawdata/'

Insert Rows into Pinot

We are actively developing this feature...

The details will be revealed soon.

Backfill Data

Batch ingestion of backfill data into Apache Pinot.

Introduction

Pinot batch ingestion involves two parts: routine ingestion job(hourly/daily) and backfill. Here are some examples to show how routine batch ingestion works in Pinot offline table:

High-level description

Organize raw data into buckets (eg: /var/pinot/airlineStats/rawdata/2014/01/01). Each bucket typically contains several files (eg: /var/pinot/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01_0.avro)
Run a Pinot batch ingestion job, which points to a specific date folder like ‘/var/pinot/airlineStats/rawdata/2014/01/01’. The segment generation job will convert each such avro file into a Pinot segment for that day and give it a unique name.
Run Pinot segment push job to upload those segments with those uniques names via a Controller API

IMPORTANT: The segment name is the unique identifier used to uniquely identify that segment in Pinot. If the controller gets an upload request for a segment with the same name - it will attempt to replace it with the new one.

This newly uploaded data can now be queried in Pinot. However, sometimes users will make changes to the raw data which need to be reflected in Pinot. This process is known as 'Backfill'.

How to backfill data in Pinot

Pinot supports data modification only at the segment level, which means you must update entire segments for doing backfills. The high level idea is to repeat steps 2 (segment generation) and 3 (segment upload) mentioned above:

Backfill jobs must run at the same granularity as the daily job. E.g., if you need to backfill data for 2014/01/01, specify that input folder for your backfill job (e.g.: ‘/var/pinot/airlineStats/rawdata/2014/01/01’)
The backfill job will then generate segments with the same name as the original job (with the new data).
When uploading those segments to Pinot, the controller will replace the old segments with the new ones (segment names act like primary keys within Pinot) one by one.

Edge case example

Backfill jobs expect the same number of (or more) data files on the backfill date. So the segment generation job will create the same number of (or more) segments than the original run.

For example, assuming table airlineStats has 2 segments(airlineStats_2014-01-01_2014-01-01_0, airlineStats_2014-01-01_2014-01-01_1) on date 2014/01/01 and the backfill input directory contains only 1 input file. Then the segment generation job will create just one segment: airlineStats_2014-01-01_2014-01-01_0. After the segment push job, only segment airlineStats_2014-01-01_2014-01-01_0 got replaced and stale data in segment airlineStats_2014-01-01_2014-01-01_1 are still there.

If the raw data is modified in such a way that the original time bucket has fewer input files than the first ingestion run, backfill will fail.

Dimension table

Batch ingestion of data into Apache Pinot using dimension tables.

Dimension tables are a special kind of offline tables from which data can be looked up via the , providing join-like functionality.

Dimension tables are replicated on all the hosts for a given tenant to allow faster lookups. When a table is marked as a dimension table, it will be replicated on all the hosts, which means that these tables must be small in size.

A dimension table cannot be part of a .

Configure dimension tables using following properties in the table configuration:

isDimTable: Set to true.
ingestionConfig.batchIngestionConfig.segmentIngestionType: Set to REFRESH.
dimensionTableConfig.disablePreload: By default, dimension tables are preloaded to allow for fast lookups. Set to true to trade off speed for memory by storing only the segment reference and docID. Otherwise, the whole row is stored in the Dimension table hash map.
controller.dimTable.maxSize: Determines the maximum size quota for a dimension table in a cluster. Table creation will fail if the storage quota exceeds this maximum size.
dimensionFieldSpecs: To look up dimension values, dimension tables need a primary key. For details, see .

Example dimension table configuration

Example table schema configuration

Ingest streaming data from Amazon Kinesis

This guide shows you how to ingest a stream of records from an Amazon Kinesis topic into a Pinot table.

To ingest events from an Amazon Kinesis stream into Pinot, set the following configs into your table config:

{
  "tableName": "kinesisTable",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "timestamp",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kinesis",
      "stream.kinesis.topic.name": "<your kinesis stream name>",
      "region": "<your region>",
      "accessKey": "<your access key>",
      "secretKey": "<your secret key>",
      "shardIteratorType": "AFTER_SEQUENCE_NUMBER",
      "stream.kinesis.consumer.type": "lowlevel",
      "stream.kinesis.fetch.timeout.millis": "30000",
      "stream.kinesis.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kinesis.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kinesis.KinesisConsumerFactory",
      "realtime.segment.flush.threshold.rows": "1000000",
      "realtime.segment.flush.threshold.time": "6h"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}

where the Kinesis specific properties are:

Property

Description

streamType

This should be set to "kinesis"

stream.kinesis.topic.name

Kinesis stream name

region

Kinesis region e.g. us-west-1

accessKey

Kinesis access key

secretKey

Kinesis secret key

shardIteratorType

Set to LATEST to consume only new records, TRIM_HORIZON for earliest sequence number_,_ AT___SEQUENCE_NUMBER and AFTER_SEQUENCE_NUMBER to start consumptions from a particular sequence number

maxRecordsToFetch

... Default is 20.

Kinesis supports authentication using the DefaultCredentialsProviderChain. The credential provider looks for the credentials in the following order:

Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey
Web Identity Token credentials from the environment or container
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable,
Instance profile credentials delivered through the Amazon EC2 metadata service

You must provide all read access level permissions for Pinot to work with an AWS Kinesis data stream. See the AWS documentation for details.

Although you can also specify the accessKey and secretKey in the properties above, we don't recommend this insecure method. We recommend using it only for non-production proof-of-concept (POC) setups. You can also specify other AWS fields such as AWS_SESSION_TOKEN as environment variables and config and it will work.

Resharding

In Kinesis, whenever you reshard a stream, it is done via split or merge operations on shards. If you split a shard, the shard closes and creates 2 new children shards. So if you started with shard0, and then split it, it would result in shard1 and shard2. Similarly, if you merge 2 shards, both those will close and create a child shard. So in the same example, if you merge shards 1 and 2, you'll end up with shard3 as the active shard, while shard0, shard1, shard2 will remain closed forever.

Please check out this recipe for more details: https://dev.startree.ai/docs/pinot/recipes/github-events-stream-kinesis#resharding-kinesis-stream

In Pinot, resharding of any stream is detected by periodic task RealtimeValidationManager: https://docs.pinot.apache.org/configuration-reference/controller#realtimesegmentvalidationmanager. This runs hourly. If you rehsard, your new shards will not get detected unless:

We finish ingesting from parent shards completely
And after 1, the RealtimeValidationManager runs

You will see a period where the ideal state will show all segments ONLINE, as parents have naturally completed ingesting, and we're waiting for RealtimeValidationManager to kickstart the ingestion from children.

If you need the ingestion to happen sooner, you can manually invoke the RealtimeValidationManager: https://docs.pinot.apache.org/basics/concepts/components/cluster/controller#running-the-periodic-task-manually

Limitations

ShardID is of the format "shardId-000000000001". We use the numeric part as partitionId. Our partitionId variable is integer. If shardIds grow beyond Integer.MAX\_VALUE, we will overflow into the partitionId space.
Segment size based thresholds for segment completion will not work. It assumes that partition "0" always exists. However, once the shard 0 is split/merged, we will no longer have partition 0.

Configure indexes

Learn how to apply indexes to a Pinot table. This guide assumes that you have followed the Ingest data from Apache Kafka guide.

Pinot supports a series of different indexes that can be used to optimize query performance. In this guide, we'll learn how to add indexes to the events table that we set up in the Ingest data from Apache Kafka guide.

Why do we need indexes?

If no indexes are applied to the columns in a Pinot segment, the query engine needs to scan through every document, checking whether that document meets the filter criteria provided in a query. This can be a slow process if there are a lot of documents to scan.

When indexes are applied, the query engine can more quickly work out which documents satisfy the filter criteria, reducing the time it takes to execute the query.

What indexes does Pinot support?

By default, Pinot creates a forward index for every column. The forward index generally stores documents in insertion order.

However, before flushing the segment, Pinot does a single pass over every column to see whether the data is sorted. If data is sorted, Pinot creates a sorted (forward) index for that column instead of the forward index.

For real-time tables you can also explicitly tell Pinot that one of the columns should be sorted. For more details, see the [Sorted Index Documentation](https://docs.pinot.apache.org/basics/indexing/forward-index#real-time-tables).

For filtering documents within a segment, Pinot supports the following indexing techniques:

Inverted index: Used for exact lookups.
Range index - Used for range queries.
Text index - Used for phrase, term, boolean, prefix, or regex queries.
Geospatial index - Based on H3, a hexagon-based hierarchical gridding. Used for finding points that exist within a certain distance from another point.
JSON index - Used for querying columns in JSON documents.
Star-Tree index - Pre-aggregates results across multiple columns.

View events table

Let's see how we can apply these indexing techniques to our data. To recap, the events table has the following fields:

Date Time Fields

Dimensions Fields

Metric Fields

ts

uuid

count

We might want to write queries that filter on the ts and uuid columns, so these are the columns on which we would want to configure indexes.

Since the data we're ingesting into the Kafka topic is all implicitly ordered by timestamp, this means that the ts column already has a sorted index. This means that any queries that filter on this column are already optimised.

So that leaves us with the uuid column.

Add an inverted index

We're going to add an inverted index to the uuid column so that queries that filter on that column will return quicker. We need to add the following line:

"invertedIndexColumns": ["uuid"]

To the tableIndexConfig section.

Copy the following to the clipboard:

/tmp/pinot/table-config-stream.json

{
  "tableName": "events",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "ts",
    "schemaName": "events",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "invertedIndexColumns": ["uuid"],
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "lowlevel",
      "stream.kafka.topic.name": "events",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.broker.list": "kafka:9092",
      "realtime.segment.flush.threshold.rows": "0",
      "realtime.segment.flush.threshold.time": "24h",
      "realtime.segment.flush.threshold.segment.size": "50M",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}

Navigate to localhost:9000/#/tenants/table/events_REALTIME, click on Edit Table, paste the next table config, and then click Save.

Once you've done that, you'll need to click Reload All Segments and then Yes to apply the indexing change to all segments.

Check the index has been applied

We can check that the index has been applied to all our segments by querying Pinot's REST API. You can find Swagger documentation at localhost:9000/help.

The following query will return the indexes defined on the uuid column:

curl -X GET "http://localhost:9000/segments/events/metadata?columns=uuid" \
  -H "accept: application/json" 2>/dev/null | 
  jq '.[] | [.segmentName, .indexes]'

Output

We're using the jq command line JSON processor to extract the fields that we're interested in.

[
  "events__0__1__20220214T1106Z",
  {
    "uuid": {
      "bloom-filter": "NO",
      "dictionary": "YES",
      "forward-index": "YES",
      "inverted-index": "YES",
      "null-value-vector-reader": "NO",
      "range-index": "NO",
      "json-index": "NO"
    }
  }
]
[
  "events__0__0__20220214T1053Z",
  {
    "uuid": {
      "bloom-filter": "NO",
      "dictionary": "YES",
      "forward-index": "YES",
      "inverted-index": "YES",
      "null-value-vector-reader": "NO",
      "range-index": "NO",
      "json-index": "NO"
    }
  }
]

We can see from looking at the inverted-index property that the index has been applied.

Querying

You can now run some queries that filter on the uuid column, as shown below:

SELECT * 
FROM events 
WHERE uuid = 'f4a4f'
LIMIT 10

You'll need to change the actual uuid value to a value that exists in your database, because the UUIDs are generated randomly by our script.

Segment compaction on upserts

Use segment compaction on upsert-enabled real-time tables.

Overview of segment compaction

Compacting a segment replaces the completed segment with a compacted segment that only contains the latest version of records. For more information about how to use upserts on a real-time table in Pinot, see .

The Pinot upsert feature stores all versions of the record ingested into immutable segments on disk. Even though the previous versions are not queried, they continue to add to the storage overhead. To remove older records (no longer used in query results) and reclaim storage space, we need to compact Pinot segments periodically. Segment compaction is done via a new minion task. To schedule Pinot tasks periodically, see the .

Compact segments on upserts in a real-time table

To compact segments on upserts, complete the following steps:

Ensure task scheduling is enabled and a minion is available.
Add the following to your table configuration. These configurations (except schedule)determine which segments to compact.

bufferTimePeriod: To compact segments once they are complete, set to “0d”. To delay compaction (as the configuration above shows by 7 days ("7d")), specify the number of days to delay compaction after a segment completes.
invalidRecordsThresholdPercent (Optional) Limits the older records allowed in the completed segment as a percentage of the total number of records in the segment. In the example above, the completed segment may be selected for compaction when 30% of the records in the segment are old.
invalidRecordsThresholdCount (Optional) Limits the older records allowed in the completed segment by record count. In the example above, if the segment contains more than 100K records, it may be selected for compaction.
tableMaxNumTasks (Optional) Limits the number of tasks allowed to be scheduled.
validDocIdsType (Optional) Specifies the source of validDocIds to fetch when running the data compaction. The valid types are SNAPSHOT, IN_MEMORY, IN_MEMORY_WITH_DELETE
- SNAPSHOT: Default validDocIds type. This indicates that the validDocIds bitmap is loaded from the snapshot from the Pinot segment. UpsertConfig's enableSnapshot must be enabled for this type.
- IN_MEMORY: This indicates that the validDocIds bitmap is loaded from the real-time server's in-memory.
- IN_MEMORY_WITH_DELETE: This indicates that the validDocIds bitmap is read from the real-time server's in-memory. The valid document ids here does take account into the deleted records. UpsertConfig's deleteRecordColumn must be provided for this type.

WARNING Using in-memory based validDocids type (IN_MEMORY, IN_MEMORY_WITH_DELETE) is dangerous as it will not guarantee us the consistency in some edge cases (e.g. fetching validDocIds bitmap while the server is restarting & updating validDocIds).

Because segment compaction is an expensive operation, we do not recommend setting invalidRecordsThresholdPercent and invalidRecordsThresholdCount too low (close to 1). By default, all configurations above are 0, so no thresholds are applied.

Example

The following example includes a dataset with 24M records and 240K unique keys that have each been duplicated 100 times. After ingesting the data, there are 6 segments (5 completed segments and 1 consuming segment) with a total estimated size of 22.8MB.

Submitting the query “set skipUpsert=true; select count(*) from transcript_upsert” before compaction produces 24,000,000 results:

After the compaction tasks are complete, the reports the following.

Segment compactions generates a task for each segment to compact. Five tasks were generated in this case because 90% of the records (3.6–4.5M records) are considered ready for compaction in the completed segments, exceeding the configured thresholds.

If a completed segment only contains old records, Pinot immediately deletes the segment (rather than creating a task to compact it).

Submitting the query again shows the count matches the set of 240K unique keys.

Once segment compaction has completed, the total number of segments remain the same and the total estimated size drops to 2.77MB.

To further improve query latency, merge small segments into larger one.

File Systems

This section contains a collection of short guides to show you how to import data from a Pinot-supported file system.

FileSystem is an abstraction provided by Pinot to access data stored in distributed file systems (DFS).

Pinot uses distributed file systems for the following purposes:

Batch ingestion job: To read the input data (CSV, Avro, Thrift, etc.) and to write generated segments to DFS.
Controller: When a segment is uploaded to the controller, the controller saves it in the configured DFS.
Server:- When a server(s) is notified of a new segment, the server copies the segment from remote DFS to their local node using the DFS abstraction.

Supported file systems

Pinot lets you choose a distributed file system provider. The following file systems are supported by Pinot:

Enabling a file system

To use a distributed file system, you need to enable plugins. To do that, specify the plugin directory and include the required plugins:

You can change the file system in the controller and server configuration. In the following configuration example, the URI is s3://bucket/path/to/file and scheme refers to the file system URI prefix s3.

You can also change the file system during ingestion. In the ingestion job spec, specify the file system with the following configuration:

Google Cloud Storage

This guide shows you how to import data from GCP (Google Cloud Platform).

Enable the Google Cloud Storage using the pinot-gcs plugin. In the controller or server, add the config:

-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-gcs

By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...

GCP file systems provides the following options:

projectId - The name of the Google Cloud Platform project under which you have created your storage bucket.
gcpKey - Location of the json file containing GCP keys. You can refer Creating and managing service account keys to download the keys.

Each of these properties should be prefixed by pinot.[node].storage.factory.class.gs. where node is either controller or server depending on the configuration, like this:

pinot.controller.storage.factory.class.gs.projectId=test-project

Examples

Job spec

executionFrameworkSpec:
    name: 'standalone'
    segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
    segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
    segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: 'gs://my-bucket/path/to/input/directory/'
outputDirURI: 'gs://my-bucket/path/to/output/directory/'
overwriteOutput: true
pinotFSSpecs:
    - scheme: gs
      className: org.apache.pinot.plugin.filesystem.GcsPinotFS
      configs:
        projectId: 'my-project'
        gcpKey: 'path-to-gcp json key file'
recordReaderSpec:
    dataFormat: 'csv'
    className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
    configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
    tableName: 'students'
pinotClusterSpecs:
    - controllerURI: 'http://localhost:9000'

Controller config

controller.data.dir=gs://path/to/data/directory/
controller.local.temp.dir=/path/to/local/temp/directory
controller.enable.split.commit=true
pinot.controller.storage.factory.class.gs=org.apache.pinot.plugin.filesystem.GcsPinotFS
pinot.controller.storage.factory.gs.projectId=my-project
pinot.controller.storage.factory.gs.gcpKey=path/to/gcp/key.json
pinot.controller.segment.fetcher.protocols=file,http,gs
pinot.controller.segment.fetcher.gs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

Server config

pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.gs=org.apache.pinot.plugin.filesystem.GcsPinotFS
pinot.server.storage.factory.gs.projectId=my-project
pinot.server.storage.factory.gs.gcpKey=path/to/gcp/key.json
pinot.server.segment.fetcher.protocols=file,http,gs
pinot.server.segment.fetcher.gs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

Minion config

pinot.minion.storage.factory.class.gs=org.apache.pinot.plugin.filesystem.GcsPinotFS
pinot.minion.storage.factory.gs.projectId=my-project
pinot.minion.storage.factory.gs.gcpKey=path/to/gcp/key.json
pinot.minion.segment.fetcher.protocols=file,http,gs
pinot.minion.segment.fetcher.gs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

Reload a table segment

Reload a table segment in Apache Pinot.

When Pinot writes data to segments in a table, it saves those segments to a deep store location specified in your table configuration, such as a storage drive or Amazon S3 bucket.

If a new column is added to your table or schema configuration during ingestion, incorrect data may appear in the consuming segment(s). To ensure accurate values are reloaded, see how to add a new column during ingestion.

Use the Pinot Controller API to reload segments

To reload all segments from a table, use:

POST /segments/{tableName}/reload

To reload a specific segment from a table, use:

POST /segments/{tableName}/{segmentName}/reload

A successful API call returns the following response:

{
    "status": "200"
}

Use the Pinot Admin Console to reload segments

To use the Pinot Admin Console, do the following:

From the left navigation menu, select Cluster Manager.
Under TENANTS, select the Tenant Name.
From the list of tables in the tenant, select the Table Name.
Do one of the following:
- To reload all segments, under OPERATIONS, click Reload All Segments.
- To reload a specific segment, under SEGMENTS, select the Segment Name, and then in the new OPERATIONS section, select Reload Segment.

FST index

The FST index supports regex queries on text. Decreases on-disk index by 4-6 times.

Only supports regex queries
Only supported on stored or completed Pinot segments (no consuming segments).
Only supported on dictionary-encoded columns.
Works better for prefix queries

Note: Lucene is case sensitive as such when using FST index based column(s) in query, user needs to ensure this is taken into account. For e.g Select * from table T where colA LIKE %Value% which has a FST index on colA will only return rows containing string "Value" but not "value".

For more information on the FST construction and code, see Lucene documentation.

Enable the FST index

To enable the FST index on a dictionary-encoded column, include the following configuration:

"fieldConfigList":[
{
"name":"text_col_1",
"encodingType":"DICTIONARY",
"indexType":"FST"
}
]

The FST index generates one FST index file (.lucene.fst). If the inverted index is enabled, this is further able to take advantage of that.

For more information about enabling the FST index, see ways to enable indexes.

Inverted index

This page describes configuring the inverted index for Apache Pinot

We can define the as a mapping from document IDs (also known as rows) to values. Similarly, an inverted index establishes a mapping from values to a set of document IDs, making it the "inverted" version of the forward index. When you frequently use a column for filtering operations like EQ (equal), IN (membership check), GT (greater than), etc., incorporating an inverted index can significantly enhance query performance.

Pinot supports two distinct types of inverted indexes: bitmap inverted indexes and sorted inverted indexes. Bitmap inverted indexes represent the actual inverted index type, whereas the sorted type is automatically available when the column is sorted. Both types of indexes necessitate the enabling of a for the respective column.

Bitmap inverted index

When a column is not sorted, and an inverted index is enabled for that column, Pinot maintains a mapping from each value to a bitmap of rows. This design ensures that value lookup operations take constant time, providing efficient querying capabilities.

When an inverted index is enabled for a column, Pinot maintains a map from each value to a bitmap of rows, which makes value lookup take constant time. If you have a column that is frequently used for filtering, adding an inverted index will improve performance greatly. You can create an inverted index on a multi-value column.

Inverted indexes are disabled by default and can be enabled for a column by specifying the configuration within the :

The older way to configure inverted indexes can also be used, although it is not actually recommended:

When the index is created

By default, bitmap inverted indexes are not generated when the segment is initially created; instead, they are created when the segment is loaded by Pinot. This behavior is governed by the table configuration option indexingConfig.createInvertedIndexDuringSegmentGeneration, which is set to false by default.

Sorted inverted index

As explained in the section, a column that is both sorted and equipped with a dictionary is encoded in a specialized manner that serves the purpose of implementing both forward and inverted indexes. Consequently, when these conditions are met, an inverted index is effectively created without additional configuration, even if the configuration suggests otherwise. This sorted version of the forward index offers a lookup time complexity of log(n) and leverages data locality.

For instance, consider the following example: if a query includes a filter on the memberId column, Pinot will perform a binary search on memberId values to find the range pair of docIds for corresponding filtering value. If the query needs to scan values for other columns after filtering, values within the range docId pair will be located together, which means we can benefit from data locality.

A sorted inverted index indeed offers superior performance compared to a bitmap inverted index, but it's important to note that it can only be applied to sorted columns. In cases where query performance with a regular inverted index is unsatisfactory, especially when a large portion of queries involve filtering on the same column (e.g., _memberId_), using a sorted index can substantially enhance query performance.

Native text index

This page talks about native text indices and corresponding search functionality in Apache Pinot.

Experimental

This index is experimental and should only be used for testing. It is not recommended for use in production.

Instead, use .

Native text index

Pinot supports text indexing and search by building Lucene indices as sidecars to the main Pinot segments. While this is a great technique, it essentially limits the avenues of optimizations that can be done for Pinot specific use cases of text search.

How is Pinot different?

Pinot, like any other database/OLAP engine, does not need to conform to the entire full text search domain-specific language (DSL) that is traditionally used by full-text search (FTS) engines like ElasticSearch and Solr. In traditional SQL text search use cases, the majority of text searches belong to one of three patterns: prefix wildcard queries (like pino*), postfix or suffix wildcard queries (like *inot), and term queries (like pinot).

Native text indices in Pinot

In Pinot, native text indices are built from the ground up. They use a custom text-indexing engine, coupled with Pinot's powerful inverted indices, to provide a fast text search experience.

The benefits are that native text indices are 80-120% faster than Lucene-based indices for the text search use cases mentioned above. They are also 40% smaller on disk.

Native text indices support real-time text search. For REALTIME tables, native text indices allow data to be indexed in memory in the text index, while concurrently supporting text searches on the same index.

Historically, most text indices depend on the in-memory text index being written to first and then sealed, before searches are possible. This limits the freshness of the search, being near-real-time at best.

Native text indices come with a custom in-memory text index, which allows for real-time indexing and search.

Searching Native Text Indices

The function, TEXT\_CONTAINS, supports text search on native text indices.

Examples:

TEXT\_CONTAINS can be combined using standard boolean operators

Note: TEXT\_CONTAINS supports regex and term queries and will work only on native indices. TEXT\_CONTAINS supports standard regex patterns (as used by LIKE in SQL Standard), so there might be some syntatical differences from Lucene queries.

Creating Native Text Indices

Native text indices are created using field configurations. To indicate that an index type is native, specify it using properties in the field configuration:

Range index

This page describes configuring the range index for Apache Pinot

Range indexing allows you to get better performance for queries that involve filtering over a range.

It would be useful for a query like the following:

SELECT COUNT(*) 
FROM baseballStats 
WHERE hits > 11

A range index is a variant of an inverted index, where instead of creating a mapping from values to columns, we create mapping of a range of values to columns. You can use the range index by setting the following config in the table configuration.

{
    "tableIndexConfig": {
        "rangeIndexColumns": [
            "column_name",
            ...
        ],
        ...
    }
}

Range index is supported for dictionary encoded columns of any type as well as raw encoded columns of a numeric type. Note that the range index can also be used on a dictionary encoded time column using STRING type, since Pinot only supports datetime formats that are in lexicographical order.

A good thumb rule is to use a range index when you want to apply range predicates on metric columns that have a very large number of unique values. This is because using an inverted index for such columns will create a very large index that is inefficient in terms of storage and performance.

Vector index

Overview

Apache Pinot now supports a Vector Index for efficient similarity searches over high-dimensional vector embeddings. This feature introduces the capability to store and query float array columns (multi-valued) using a vector similarity algorithm.

Key Features

Vector Index is implemented using HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor (ANN) search.
Adds support for a predicate and function:
- VECTOR_SIMILARITY(v1, v2, [optional topK]) to retrieve the topK closest vectors based on similarity.
- The similarity function can be used as part of a query to filter and rank results.

Examples

Below is an example schema designed for a use case involving product reviews with vector embeddings for each review.

Schema

{
  "metricFieldSpecs": [],
  "dimensionFieldSpecs": [
    {
      "dataType": "STRING",
      "name": "ProductId"
    },
    {
      "dataType": "STRING",
      "name": "UserId"
    },
    {
      "dataType": "INT",
      "name": "Score"
    },
    {
      "dataType": "STRING",
      "name": "Summary"
    },
    {
      "dataType": "STRING",
      "name": "Text"
    },
    {
      "dataType": "STRING",
      "name": "combined"
    },
    {
      "dataType": "INT",
      "name": "n_tokens"
    },
    {
      "dataType": "FLOAT",
      "name": "embedding",
      "singleValueField": false
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "ts",
      "dataType": "TIMESTAMP",
      "format": "1:MILLISECONDS:TIMESTAMP",
      "granularity": "1:SECONDS"
    }
  ],
  "schemaName": "fineFoodReviews"
}

In this schema:

• The embedding column is a multi-valued float array designed to store high-dimensional vector embeddings (e.g., 1536 dimensions from an NLP model).

• Other fields, such as ProductId, UserId, and Text, store metadata and review text.

Table Config

To enable the Vector Index, configure the table with the appropriate fieldConfigList. The embedding column is specified to use the Vector Index with HNSW for similarity searches.

{
  ...
  "fieldConfigList": [
    {
      "encodingType": "RAW",
      "indexType": "VECTOR",
      "name": "embedding",
      "properties": {
        "vectorIndexType": "HNSW",
        "vectorDimension": 1536,
        "vectorDistanceFunction": "COSINE",
        "version": 1
      }
    }
  ]
}

Explanation of Properties:

vectorIndexType:

Specifies the type of vector index to use. Currently supports HNSW.

vectorDimension:

Defines the dimensionality of the vectors stored in the column. (e.g., 1536 for typical embeddings from models like OpenAI or BERT).

vectorDistanceFunction:

Specifies the distance metric for similarity computation. Options include:

INNER_PRODUCT:
• Computes the inner product (dot product) of the two vectors.
• Typically used when vectors are normalized and higher scores indicate greater similarity.
L2:
• Measures the Euclidean distance between vectors.
• Suitable for tasks where spatial closeness in high-dimensional space indicates similarity.
L1:
• Measures the Manhattan distance between vectors (sum of absolute differences of coordinates).
• Useful for some scenarios where simpler distance metrics are preferred.
COSINE:
• Measures cosine similarity, which considers the angle between vectors.
• Ideal for normalized vectors where orientation matters more than magnitude.

version:

Specifies the version of the Vector Index implementation.

Query

SELECT ProductId, 
       UserId, 
       l2_distance(embedding, ARRAY[-0.0013143676, -0.011042999, ...]) AS l2_dist, 
       n_tokens, 
       combined
FROM fineFoodReviews
WHERE VECTOR_SIMILARITY(embedding, ARRAY[-0.0013143676, -0.011042999, ...], 5)  
ORDER BY l2_dist ASC 
LIMIT 10;

VECTOR_SIMILARITY:

A predicate that retrieves the top k closest vectors to the query vector.

Inputs:

embedding: The vector column.
Query vector (literal array).
Optional topK parameter (default: 10).

Release notes

The following summarizes Apache Pinot™ releases, from the latest one to the earliest one.

Note

Before upgrading from one version to another one, read the release notes. While the Pinot committers strive to keep releases backward-compatible and introduce new features in a compatible manner, your environment may have a unique combination of configurations/data/schema that may have been somehow overlooked. Before you roll out a new release of Pinot on your cluster, it is best that you run the that Pinot provides. The tests can be easily customized to suit the configurations and tables in your pinot cluster(s). As a good practice, you should build your own test suite, mirroring the table configurations, schema, sample data, and queries that are used in your cluster.

1.3.0 (February 2025)

1.2.0 (August 2024)

1.1.0 (March 2024)

1.0.0 (September 2023)

0.12.1 (March 2023)

0.12.0 (December 2022)

0.11.0 (September 2022)

0.10.0 (March 2022)

0.9.3 (December 2021)

0.9.2 (December 2021)

0.9.1 (December 2021)

0.9.0 (November 2021)

0.8.0 (August 2021)

0.7.1 (April 2021)

0.6.0 (November 2020)

0.5.0 (September 2020)

0.4.0 (June 2020)

0.3.0 (March 2020)

0.2.0 (November 2019)

0.1.0 (March 2019, First release)

0.12.1

Summary

This is a bug-fixing release contains:

use legacy case-when format ()

The release is based on the release 0.12.0 with the following cherry-picks:

0.9.3

Summary

This is a bug fixing release contains:

Update Log4j to 2.17.0 to address CVE-2021-45105 (#7933)

The release is based on the release 0.9.2 with the following cherry-picks:

93c0404

0.9.2

Summary

This is a bug fixing release contains:

Upgrade log4j to 2.16.0 to fix ()
Upgrade swagger-ui to 3.23.11 to fix ()
Fix the bug that RealtimeToOfflineTask failed to progress with large time bucket gaps ().

The release is based on the release 0.9.1 with the following cherry-picks:

0.9.1

Summary

This release fixes the major issue of CVE-2021-44228 and a major bug fixing of pinot admin exit code issue(#7798).

The release is based on the release 0.9.0 with the following cherry-picks:

e44d2e4 af2858a

Running in Kubernetes

Pinot quick start in Kubernetes

Get started running Pinot in Kubernetes.

Note: The examples in this guide are sample configurations to be used as reference. For production setup, you may want to customize it to your needs.

Prerequisites

Kubernetes

This guide assumes that you already have a running Kubernetes cluster.

If you haven't yet set up a Kubernetes cluster, see the links below for instructions:

Enable Kubernetes on Docker-Desktop
Install Minikube for local setup
- Make sure to run with enough resources: minikube start --vm=true --cpus=4 --memory=8g --disk-size=50g
Set up a Kubernetes Cluster using Amazon Elastic Kubernetes Service (Amazon EKS)
Set up a Kubernetes Cluster using Google Kubernetes Engine (GKE)
Set up a Kubernetes Cluster using Azure Kubernetes Service (AKS)

Pinot

Make sure that you've downloaded Apache Pinot. The scripts for the setup in this guide can be found in our open source project on GitHub.

# checkout pinot
git clone https://github.com/apache/pinot.git
cd pinot/helm/pinot

Set up a Pinot cluster in Kubernetes

Start Pinot with Helm

The Pinot repository has pre-packaged Helm charts for Pinot and Presto. The Helm repository index file is here.

helm repo add pinot https://raw.githubusercontent.com/apache/pinot/master/helm
kubectl create ns pinot-quickstart
helm install pinot pinot/pinot \
    -n pinot-quickstart \
    --set cluster.name=pinot \
    --set server.replicaCount=2

Note: Specify StorageClass based on your cloud vendor. Don't mount a blob store (such as AzureFile, GoogleCloudStorage, or S3) as the data serving file system. Use only Amazon EBS/GCP Persistent Disk/Azure Disk-style disks.

For AWS: "gp2"
For GCP: "pd-ssd" or "standard"
For Azure: "AzureDisk"
For Docker-Desktop: "hostpath"

1.1.1 Update Helm dependency

helm dependency update

1.1.2 Start Pinot with Helm

kubectl create ns pinot-quickstart
helm install -n pinot-quickstart pinot ./pinot

Check Pinot deployment status

kubectl get all -n pinot-quickstart

Load data into Pinot using Kafka

Bring up a Kafka cluster for real-time data ingestion

helm repo add kafka https://charts.bitnami.com/bitnami
helm install -n pinot-quickstart kafka kafka/kafka --set replicas=1,zookeeper.image.tag=latest,listeners.client.protocol=PLAINTEXT

Check Kafka deployment status

Ensure the Kafka deployment is ready before executing the scripts in the following steps. Run the following command:

kubectl get all -n pinot-quickstart | grep kafka

Below is an example output showing the deployment is ready:

pod/kafka-controller-0                   1/1     Running     0          2m
pod/kafka-controller-1                   1/1     Running     0          2m
pod/kafka-controller-2                   1/1     Running     0          2m

Create Kafka topics

Run the scripts below to create two Kafka topics for data ingestion:

kubectl -n pinot-quickstart exec kafka-controller-0 -- kafka-topics.sh --bootstrap-server kafka:9092 --topic flights-realtime --create --partitions 1 --replication-factor 1
kubectl -n pinot-quickstart exec kafka-controller-0 -- kafka-topics.sh --bootstrap-server kafka:9092 --topic flights-realtime-avro --create --partitions 1 --replication-factor 1

Load data into Kafka and create Pinot schema/tables

The script below does the following:

Ingests 19492 JSON messages to Kafka topic flights-realtime at a speed of 1 msg/sec
Ingests 19492 Avro messages to Kafka topic flights-realtime-avro at a speed of 1 msg/sec
Uploads Pinot schema airlineStats
Creates Pinot table airlineStats to ingest data from JSON encoded Kafka topic flights-realtime
Creates Pinot table airlineStatsAvro to ingest data from Avro encoded Kafka topic flights-realtime-avro

kubectl apply -f pinot/helm/pinot/pinot-realtime-quickstart.yml

Query with the Pinot Data Explorer

Pinot Data Explorer

The following script (located at ./pinot/helm/pinot) performs local port forwarding, and opens the Pinot query console in your default web browser.

./query-pinot-data.sh

Query Pinot with Superset

Bring up Superset using Helm

Install the SuperSet Helm repository:

helm repo add superset https://apache.github.io/superset

Get the Helm values configuration file:

helm inspect values superset/superset > /tmp/superset-values.yaml

For Superset to install Pinot dependencies, edit /tmp/superset-values.yaml file to add apinotdb pip dependency into bootstrapScript field.
You can also build your own image with this dependency or use the image apachepinot/pinot-superset:latest instead.

Replace the default admin credentials inside the init section with a meaningful user profile and stronger password.
Install Superset using Helm:

kubectl create ns superset
helm upgrade --install --values /tmp/superset-values.yaml superset superset/superset -n superset

Ensure your cluster is up by running:

kubectl get all -n superset

Access the Superset UI

Run the below command to port forward Superset to your localhost:18088.

kubectl port-forward service/superset 18088:8088 -n superset

Navigate to Superset in your browser with the admin credentials you set in the previous section.
Create a new database connection with the following URI: pinot+http://pinot-broker.pinot-quickstart:8099/query?controller=http://pinot-controller.pinot-quickstart:9000/
Once the database is added, you can add more data sets and explore the dashboard options.

Access Pinot with Trino

Deploy Trino

Deploy Trino with the Pinot plugin installed:

helm repo add trino https://trinodb.github.io/charts/

See the charts in the Trino Helm chart repository:

helm search repo trino

In order to connect Trino to Pinot, you'll need to add the Pinot catalog, which requires extra configurations. Run the below command to get all the configurable values.

helm inspect values trino/trino > /tmp/trino-values.yaml

To add the Pinot catalog, edit the additionalCatalogs section by adding:

additionalCatalogs:
  pinot: |
    connector.name=pinot
    pinot.controller-urls=pinot-controller.pinot-quickstart:9000

Pinot is deployed at namespace pinot-quickstart, so the controller serviceURL is pinot-controller.pinot-quickstart:9000

After modifying the /tmp/trino-values.yaml file, deploy Trino with:

kubectl create ns trino-quickstart
helm install my-trino trino/trino --version 0.2.0 -n trino-quickstart --values /tmp/trino-values.yaml

Once you've deployed Trino, check the deployment status:

kubectl get pods -n trino-quickstart

Query Pinot with the Trino CLI

Once Trino is deployed, run the below command to get a runnable Trino CLI.

Download the Trino CLI:

curl -L https://repo1.maven.org/maven2/io/trino/trino-cli/363/trino-cli-363-executable.jar -o /tmp/trino && chmod +x /tmp/trino

Port forward Trino service to your local if it's not already exposed:

echo "Visit http://127.0.0.1:18080 to use your application"
kubectl port-forward service/my-trino 18080:8080 -n trino-quickstart

Use the Trino console client to connect to the Trino service:

/tmp/trino --server localhost:18080 --catalog pinot --schema default

Query Pinot data using the Trino CLI, like in the sample queries below.

Sample queries to execute

List all catalogs

trino:default> show catalogs;

  Catalog
---------
 pinot
 system
 tpcds
 tpch
(4 rows)

Query 20211025_010256_00002_mxcvx, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0.70 [0 rows, 0B] [0 rows/s, 0B/s]

List all tables

trino:default> show tables;

    Table
--------------
 airlinestats
(1 row)

Query 20211025_010326_00003_mxcvx, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0.28 [1 rows, 29B] [3 rows/s, 104B/s]

Show schema

trino:default> DESCRIBE airlinestats;

        Column        |      Type      | Extra | Comment
----------------------+----------------+-------+---------
 flightnum            | integer        |       |
 origin               | varchar        |       |
 quarter              | integer        |       |
 lateaircraftdelay    | integer        |       |
 divactualelapsedtime | integer        |       |
 divwheelsons         | array(integer) |       |
 divwheelsoffs        | array(integer) |       |
......

Query 20211025_010414_00006_mxcvx, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0.37 [79 rows, 5.96KB] [212 rows/s, 16KB/s]

Count total documents

trino:default> select count(*) as cnt from airlinestats limit 10;

 cnt
------
 9746
(1 row)

Query 20211025_015607_00009_mxcvx, FINISHED, 2 nodes
Splits: 17 total, 17 done (100.00%)
0.24 [1 rows, 9B] [4 rows/s, 38B/s]

Access Pinot with Presto

Deploy Presto with the Pinot plugin

First, deploy Presto with default configurations:

helm install presto pinot/presto -n pinot-quickstart

kubectl apply -f presto-coordinator.yaml

To customize your deployment, run the below command to get all the configurable values.

helm inspect values pinot/presto > /tmp/presto-values.yaml

After modifying the /tmp/presto-values.yaml file, deploy Presto:

helm install presto pinot/presto -n pinot-quickstart --values /tmp/presto-values.yaml

Once you've deployed the Presto instance, check the deployment status:

kubectl get pods -n pinot-quickstart

Query Presto using the Presto CLI

Once Presto is deployed, you can run the below command from here, or follow the steps below.

./pinot-presto-cli.sh

Download the Presto CLI:

curl -L https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.246/presto-cli-0.246-executable.jar -o /tmp/presto-cli && chmod +x /tmp/presto-cli

Port forward presto-coordinator port 8080 to localhost port 18080:

kubectl port-forward service/presto-coordinator 18080:8080 -n pinot-quickstart> /dev/null &

Start the Presto CLI with the Pinot catalog:

/tmp/presto-cli --server localhost:18080 --catalog pinot --schema default

Query Pinot data with the Presto CLI, like in the sample queries below.

Sample queries to execute

List all catalogs

presto:default> show catalogs;

 Catalog
---------
 pinot
 system
(2 rows)

Query 20191112_050827_00003_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

List all tables

presto:default> show tables;

    Table
--------------
 airlinestats
(1 row)

Query 20191112_050907_00004_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [1 rows, 29B] [1 rows/s, 41B/s]

Show schema

presto:default> DESCRIBE pinot.dontcare.airlinestats;

        Column        |  Type   | Extra | Comment
----------------------+---------+-------+---------
 flightnum            | integer |       |
 origin               | varchar |       |
 quarter              | integer |       |
 lateaircraftdelay    | integer |       |
 divactualelapsedtime | integer |       |
......

Query 20191112_051021_00005_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:02 [80 rows, 6.06KB] [35 rows/s, 2.66KB/s]

Count total documents

presto:default> select count(*) as cnt from pinot.dontcare.airlinestats limit 10;

 cnt
------
 9745
(1 row)

Query 20191112_051114_00006_xkm4g, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [1 rows, 8B] [2 rows/s, 19B/s]

Delete a Pinot cluster in Kubernetes

To delete your Pinot cluster in Kubernetes, run the following command:

kubectl delete ns pinot-quickstart

Forward index

The forward index is the mechanism Pinot employs to store the values of each column. At a conceptual level, the forward index can be thought of as a mapping from document IDs (also known as row indices) to the actual column values of each row.

Forward indexes are enabled by default, meaning that columns will have a forward index unless explicitly disabled. Disabling the forward index can save storage space when other indexes sufficiently cover the required data patterns. For information on how to disable the forward index and its implications, refer to Disabling the Forward Index.

Dictionary encoded vs raw value

How forward indexes are implemented depends on the index encoding and whether the column is sorted.

When the encoding is set to RAW, the forward index is implemented as an array, where the indices correspond to document IDs and the values represent the actual row values. For more details, refer to the raw value forward index section.

In the case of DICTIONARY encoding, the forward index doesn't store the actual row values but instead stores dictionary IDs. This introduces an additional level of indirection when reading values, but it allows for more efficient physical layouts when unique number of values in the column is significantly smaller than the number of rows.

The DICTIONARY encoding can be even more efficient if the segment is sorted by the indexed column. You can learn more about the dictionary encoded forward index and the sorted forward index in their respective sections.

When working out whether a column should use dictionary encoded or raw value encoding, the following comparison table may help:

Dictionary

Raw Value

Provides compression when low to medium cardinality.

Eliminates padding overhead

Allows for indexing (esp inv index).

No inv index (only JSON/Text/FST index)

Adds one level of dereferencing, so can increase disk seeks

Eliminates additional dereferencing, so good when all docs of interest are contiguous

For Strings, adds padding to make all values equal length in the dictionary

Chunk de-compression overhead with docs selected don't have spatial locality

Dictionary-encoded forward index with bit compression (default)

In this approach, each unique value in a column is assigned an ID, and a dictionary is constructed to map these IDs back to their corresponding values. Instead of storing the actual values, the default forward index stores these bit-compressed IDs. This method is particularly effective when dealing with columns containing few unique values, as it significantly improves space efficiency.

The below diagram shows the dictionary encoding for two columns with integer and string types. ForcolA, dictionary encoding saved a significant amount of space for duplicated values.

The diagram below illustrates dictionary encoding for two columns with different data types (integer and string). For colA, dictionary encoding leads to significant space savings due to duplicated values. However, for colB, which contains mostly unique values, the compression effect is limited, and padding overhead may be high.

To know more about dictionary encoding, see Dictionary index.

When using the dictionary-encoded forward index for multi-value column, to further compress the forward index for repeated multi-value entires, enable the MV_ENTRY_DICT compression type which adds another level of dictionary encoding on the multi-value entries. This may be useful, for example, in cases where you pre-join a fact table with dimension table, where the multi-value entries in the dimension table are repeated after joining with the fact table.

It can be enabled with parameter:

Parameter

Default

Description

dictIdCompressionType

null

The compression that will be used for dictionary-encoded forward index

Sorted forward index with run-length encoding

When a column is physically sorted, Pinot employs a sorted forward index with run-length encoding, which builds upon dictionary encoding. Instead of storing dictionary IDs for each document ID, this approach stores pairs of start and end document IDs for each unique value.

(For simplicity, this diagram does not include the dictionary encoding layer.)

Sorted forward indexes offer the benefits of efficient compression and data locality and can also serve as an inverted index. They are active when two conditions are met: the segment is sorted by the column, and the dictionary is enabled for that column. Refer to the dictionary documentation for details on enabling the dictionary.

When dealing with multiple segments, it's crucial to ensure that data is sorted within each segment. Sorting across segments is not necessary.

To guarantee that a segment is sorted by a particular column, follow these steps:

For real-time tables, use the tableIndexConfig.sortedColumn property. If there is exactly one column specified in that array, Pinot will sort the segment by that column upon committing.
For offline tables, you must pre-sort the data by the specified column before ingesting it into Pinot.

It's crucial to note that for offline tables, the tableIndexConfig.sortedColumn property is indeed ignored.

Additionally, for online tables, even though this property is specified as a JSON array, at most one column should be included. Using an array with more than one column is incorrect and will not result in segments being sorted by all the columns listed in the array.

When a real-time segment is committed, rows will be sorted by the sorting column and it will be transformed into an offline segment.

During the creation of an offline segment, which also applies when a real-time segment is committed, Pinot scans the data in each column. If it detects that all values within a column are sorted in ascending order, Pinot concludes that the segment is sorted based on that particular column. In case this happens on more than one column, all of them are considered as sorting columns. Consequently, whether a segment is sorted by a column or not solely depends on the actual data distribution within the segment and entirely disregards the value of the sortedColumn property. This approach also implies that two segments belonging to the same table may have a different number of sorting columns. In the extreme scenario where a segment contains only one row, Pinot will consider all columns within that segment as sorting columns.

Here is an example of a table configuration that illustrates these concepts:

Part of a tableConfig

{
    "tableIndexConfig": {
        "sortedColumn": [
            "column_name"
        ],
        ...
    }
}

Checking sort status

You can check the sorted status of a column in a segment by running the following:

$ grep memberId <segment_name>/v3/metadata.properties | grep isSorted
column.memberId.isSorted = true

Alternatively, for offline tables and for committed segments in real-time tables, you can retrieve the sorted status from the getServerMetadata endpoint. The following example is based on the Batch Quick Start:

curl -X GET \
  "http://localhost:9000/segments/baseballStats/metadata?columns=playerID&columns=teamID" \
  -H "accept: application/json" 2>/dev/null | \
  jq -c  '.[] | . as $parent |  
          .columns[] | 
          [$parent .segmentName, .columnName, .sorted]'

["baseballStats_OFFLINE_0","teamID",false]
["baseballStats_OFFLINE_0","playerID",false]

Raw value forward index

The raw value forward index stores actual values instead of IDs. This means that it eliminates the need for dictionary lookups when fetching values, which can result in improved query performance. Raw forward index is particularly effective for columns with a large number of unique values, where dictionary encoding doesn't provide significant compression benefits.

As shown in the diagram below, dictionary encoding can lead to numerous random memory accesses for dictionary lookups. In contrast, the raw value forward index allows for sequential value scanning, which can enhance query performance when applied appropriately.

Note: Raw value forward index currently does not support inverted index (all others JSON/TEXT/Range/etc are supported). Also, since reading a value from this index requires reading the entire chunk in memory and decompressing, it is not suitable for heavy random reads.

The raw format is used in two scenarios:

When the dictionary is disabled for a column, as specified in the dictionary documentation.
When the encoding is set to RAW in the field config list.

When using the raw format, you can configure the following parameters:

Parameter

Default

Description

chunkCompressionType

null

The compression that will be used. Replaced by compressionCodec since release 1.2.0

compressionCodec

null

The compression that will be used. Introduced in release 1.2.0

deriveNumDocsPerChunk

false

Modifies the behavior when storing variable length values (like string or bytes)

rawIndexWriterVersion

The version initially used

targetDocsPerChunk

1000

The target number of docs per chunk

targetMaxChunkSize

1MB

The target max chunk size

The compressionCodec parameter has the following valid values:

PASS_THROUGH
SNAPPY
ZSTANDARD
LZ4
GZIP (Introduced in release 1.2.0)
null (the JSON null value, not "null"), which is the default. In this case, PASS_THROUGH will be used for metrics and LZ4 for other columns.

deriveNumDocsPerChunk is only used when the datatype may have a variable length, such as with string, big decimal, bytes, etc. By default, Pinot uses a fixed number of elements that was chosen empirically. If changed to true, Pinot will use a heuristic value that depends on the column data.

rawIndexWriterVersion changes the algorithm used to create the index. This changes the actual data layout, but modern versions of Pinot can read indexes written in older versions. The latest version right now is 4.

targetDocsPerChunk changes the target number of docs to store in a chunk. For rawIndexWriterVersion versions 2 and 3, this will store exactly targetDocsPerChunk per chunk. For rawIndexWriterVersion version 4, this config is used in conjunction with targetMaxChunkSize and chunk size is determined with the formula min(lengthOfLongestDocumentInSegment * targetDocsPerChunk, targetMaxChunkSize). A negative value will disable dynamic chunk sizing and use the static targetMaxChunkSize.

targetMaxChunkSize changes the target max chunk size. For rawIndexWriterVersion versions 2 and 3, this can only be used with deriveNumDocsPerChunk. For rawIndexWriterVersion version 4, this sets the upper bound for a dynamically calculated chunk size. Documents larger than the targetMaxChunkSize will be given their own 'huge' chunk, therefore, it is recommended to size this such that huge chunks are avoided.

Raw forward index configuration

The recommended way to configure the forward index using raw format is by including the parameters explained above in the indexes.forward object. For example:

Configured in tableConfig fieldConfigList

{
  "tableName": "somePinotTable",
  "fieldConfigList": [
    {
      "name": "playerID",
      "encodingType": "RAW",
      "indexes": {
        "forward": {
          "compressionCodec": "PASS_THROUGH", // or "SNAPPY", "ZSTANDARD", "LZ4" or "GZIP"
          "deriveNumDocsPerChunk": false,
          "rawIndexWriterVersion": 2
        }
      }
    },
    ...
  ],
...
}

Deprecated

An alternative method to configure the raw format parameters is available. This older approach can still be used, although it is not recommended. Here are the details of this older method:

chunkCompressionType: This parameter can be defined as a sibling of name and encodingType in the fieldConfigList section.
deriveNumDocsPerChunk: You can configure this parameter with the property deriveNumDocsPerChunkForRawIndex. Note that in properties, all values must be strings, so valid values for this property are "true" and "false".
rawIndexWriterVersion: This parameter can be configured using the property rawIndexWriterVersion. Again, in properties, all values must be strings, so valid values for this property are "2", "3", and so on.

For example:

Configured in tableConfig fieldConfigList

{
  "tableName": "somePinotTable",
  "fieldConfigList": [
    {
      "name": "playerID",
      "encodingType": "RAW",
      "chunkCompressionType": "PASS_THROUGH", // it can also be defined here
      "properties": {
        "deriveNumDocsPerChunkForRawIndex": "false", // here the string value has to be used
        "rawIndexWriterVersion": "2" // here the string value has to be used
      }
    },
    ...
  ],
...
}

While this older method is still supported, it is not the recommended way to configure these parameters. There are no plans to remove support for this older method, but keep in mind that any new parameters added in the future may only be configurable in the forward JSON object.

Disabling the forward index

Traditionally the forward index has been a mandatory index for all columns in the on-disk segment file format.

However, certain columns may only be used as a filter in the WHERE clause for all queries. In such scenarios the forward index is not necessary as essentially other indexes and structures in the segments can provide the required SQL query functionality. Forward index just takes up extra storage space for such scenarios and can ideally be freed up.

Thus, to provide users an option to save storage space, a knob to disable the forward index is now available.

Forward index on one or more columns(s) in your Pinot table can be disabled with the following limitations:

Only supported for immutable (offline) segments.
If the column has a range index then the column must be of single-value type and use range index version 2.
MV columns with duplicates within a row will lose the duplicated entries on forward index regeneration. The ordering of data with an MV row may also change on regeneration. A backfill is required in such scenarios (to preserve duplicates or ordering).
If forward index regeneration support on reload (i.e. re-enabling the forward index for a forward index disabled column) is required then the dictionary and inverted index must be enabled on that particular column.

Sorted columns will allow the forward index to be disabled, but this operation will be treated as a no-op and the index (which acts as both a forward index and inverted index) will be created.

To disable the forward index, in table config under fieldConfigList, set the disabled property to true as shown below:

Configured in tableConfig fieldConfigList

{
  "tableName": "somePinotTable",
  "fieldConfigList": [
    {
      "name":"columnA",
      "indexes": {
        "forward": {
          "disabled": true
        }
      }
    },
    ...
  ],
  ...
}

The older way to do so is still supported, but not recommended.

Configured in tableConfig fieldConfigList

"fieldConfigList":[
  {
     "name":"columnA",
     "properties": {
        "forwardIndexDisabled": "true"
      }
  }
]

A table reload operation must be performed for the above config to take effect. Enabling / disabling other indexes on the column can be done via the usual table config options.

The forward index can also be regenerated for a column where it is disabled by enabling the index and reloading the segment. The forward index can only be regenerated if the dictionary and inverted index have been enabled for the column. If either have been disabled then the only way to get the forward index back is to regenerate the segments via the offline jobs and re-push / refresh the data.

Warning:

For multi-value (MV) columns the following invariants cannot be maintained after regenerating the forward index for a forward index disabled column:

Ordering guarantees of the MV values within a row
If entries within an MV row are duplicated, the duplicates will be lost. Regenerate the segments via your offline jobs and re-push / refresh the data to get back the original MV data with duplicates.

We will work on removing the second invariant in the future.

Examples of queries which will fail after disabling the forward index for an example column, columnA, can be found below:

Select

Forward index disabled columns cannot be present in the SELECT clause even if filters are added on it.

SELECT columnA
FROM myTable
    WHERE columnA = 10

SELECT *
FROM myTable

Group By Order By

Forward index disabled columns cannot be present in the GROUP BY and ORDER BY clauses. They also cannot be part of the HAVING clause.

SELECT SUM(columnB)
FROM myTable
GROUP BY columnA

SELECT SUM(columnB), columnA
FROM myTable
GROUP BY columnA
ORDER BY columnA

SELECT MIN(columnA)
FROM myTable
GROUP BY columnB
HAVING MIN(columnA) > 100
ORDER BY columnB

Aggregation Queries

A subset of the aggregation functions do work when the forward index is disabled such as MIN, MAX, DISTINCTCOUNT, DISTINCTCOUNTHLL and more. Some of the other aggregation functions will not work such as the below:

SELECT SUM(columnA), AVG(columnA)
FROM myTable

SELECT MAX(ADD(columnA, columnB))
FROM myTable

Distinct

Forward index disabled columns cannot be present in the SELECT DISTINCT clause.

SELECT DISTINCT columnA
FROM myTable

Range Queries

To run queries on single-value columns where the filter clause contains operators such as >, <, >=, <= a version 2 range index must be present. Without the range index such queries will fail as shown below:

SELECT columnB
FROM myTable
    WHERE columnA > 1000

Star-tree index

This page describes the indexing techniques available in Apache Pinot.

In this page you will learn what a star-tree index is and gain a conceptual understanding of how one works.

Unlike other index techniques which work on a single column, the star-tree index is built on multiple columns and utilizes pre-aggregated results to significantly reduce the number of values to be processed, resulting in improved query performance.

One of the biggest challenges in real-time OLAP systems is achieving and maintaining tight SLAs on latency and throughput on large data sets. Existing techniques such as sorted index or inverted index help improve query latencies, but speed-ups are still limited by the number of documents that need to be processed to compute results. On the other hand, pre-aggregating the results ensures a constant upper bound on query latencies, but can lead to storage space explosion.

Use the star-tree index to utilize pre-aggregated documents to achieve both low query latencies and efficient use of storage space for aggregation and group-by queries.

Existing solutions

Consider the following data set, which is used here as an example to discuss these indexes:

Country

Browser

Locale

Impressions

Chrome

400

Firefox

200

Safari

300

Safari

100

USA

Chrome

600

USA

Firefox

200

USA

Firefox

400

Sorted index

In this approach, data is sorted on a primary key, which is likely to appear as filter in most queries in the query set.

This reduces the time to search the documents for a given primary key value from linear scan O(n) to binary search O(logn), and also keeps good locality for the documents selected.

While this is a significant improvement over linear scan, there are still a few issues with this approach:

While sorting on one column does not require additional space, sorting on additional columns requires additional storage space to re-index the records for the various sort orders.
While search time is reduced from O(n) to O(logn), overall latency is still a function of the total number of documents that need to be processed to answer a query.

Inverted index

In this approach, for each value of a given column, we maintain a list of document id’s where this value appears.

Below are the inverted indexes for columns ‘Browser’ and ‘Locale’ for our example data set:

Browser

Doc Id

Firefox

1,5,6

Chrome

0,4

Safari

2,3

Locale

Doc Id

0,3,4,6

2,5

For example, if we want to get all the documents where ‘Browser’ is ‘Firefox’, we can look up the inverted index for ‘Browser’ and identify that it appears in documents [1, 5, 6].

Using an inverted index, we can reduce the search time to constant time O(1). The query latency, however, is still a function of the selectivity of the query: it increases with the number of documents that need to be processed to answer the query.

Pre-aggregation

In this technique, we pre-compute the answer for a given query set upfront.

In the example below, we have pre-aggregated the total impressions for each country:

Country

Impressions

600

400

USA

1200

With this approach, answering queries about total impressions for a country is a value lookup, because we have eliminated the need to process a large number of documents. However, to be able to answer queries that have multiple predicates means we would need to pre-aggregate for various combinations of different dimensions, which leads to an exponential increase in storage space.

Star-tree solution

On one end of the spectrum we have indexing techniques that improve search times with a limited increase in space, but don't guarantee a hard upper bound on query latencies. On the other end of the spectrum, we have pre-aggregation techniques that offer a hard upper bound on query latencies, but suffer from exponential explosion of storage space

The star-tree data structure offers a configurable trade-off between space and time and lets us achieve a hard upper bound for query latencies for a given use case. The following sections cover the star-tree data structure, and explain how Pinot uses this structure to achieve low latencies with high throughput.

Definitions

Tree structure

The star-tree index stores data in a structure that consists of the following properties:

Root node (Orange): Single root node, from which the rest of the tree can be traversed.
Leaf node (Blue): A leaf node can containing at most T records, where T is configurable.
Non-leaf node (Green): Nodes with more than T records are further split into children nodes.
Star node (Yellow): Non-leaf nodes can also have a special child node called the star node. This node contains the pre-aggregated records after removing the dimension on which the data was split for this level.
Dimensions split order ([D1, D2]): Nodes at a given level in the tree are split into children nodes on all values of a particular dimension. The dimensions split order is an ordered list of dimensions that is used to determine the dimension to split on for a given level in the tree.

Node properties

The properties stored in each node are as follows:

Dimension: The dimension that the node is split on
Start/End Document Id: The range of documents this node points to
Aggregated Document Id: One single document that is the aggregation result of all documents pointed by this node

Index generation

The star-tree index is generated in the following steps:

The data is first projected as per the dimensionsSplitOrder. Only the dimensions from the split order are reserved, others are dropped. For each unique combination of reserved dimensions, metrics are aggregated per configuration. The aggregated documents are written to a file and served as the initial star-tree documents (separate from the original documents).
Sort the star-tree documents based on the dimensionsSplitOrder. It is primary-sorted on the first dimension in this list, and then secondary sorted on the rest of the dimensions based on their order in the list. Each node in the tree points to a range in the sorted documents.
The tree structure can be created recursively (starting at root node) as follows:
- If a node has more than T records, it is split into multiple children nodes, one for each value of the dimension in the split order corresponding to current level in the tree.
- A star node can be created (per configuration) for the current node, by dropping the dimension being split on, and aggregating the metrics for rows containing dimensions with identical values. These aggregated documents are appended to the end of the star-tree documents.
  If there is only one value for the current dimension, a star node won’t be created because the documents under the star node are identical to the single node.
The above step is repeated recursively until there are no more nodes to split.
Multiple star-trees can be generated based on different configurations (dimensionsSplitOrder, aggregations, T)

Aggregation

Aggregation is configured as a pair of aggregation functions and the column to apply the aggregation.

All types of aggregation function that have a bounded-sized intermediate result are supported.

Supported functions

COUNT
MIN
MAX
SUM
SUM_PRECISION
- The maximum precision can be optionally configured in functionParameters using the key precision. For example: {"precision": 20}.
AVG
MIN_MAX_RANGE
PERCENTILE_EST
PERCENTILE_RAW_EST
PERCENTILE_TDIGEST
- The compression factor for the TDigest histogram can be optionally configured in functionParameters using the key compressionFactor. For example: {"compressionFactor": 200}. If not configured, the default value of 100 will be used.
PERCENTILE_RAW_TDIGEST
- The compression factor for the TDigest histogram can be optionally configured in functionParameters using the key compressionFactor. For example: {"compressionFactor": 200}. If not configured, the default value of 100 will be used.
DISTINCT_COUNT_BITMAP
- NOTE: The intermediate result RoaringBitmap is not bounded-sized, use carefully on high cardinality columns.
DISTINCT_COUNT_HLL
- The log2m value for the HyperLogLog structure can be optionally configured in functionParameters , for example: {"log2m": 16}. If not configured, the default value of 8 will be used. Remember that a larger log2m value leads to better accuracy but also a larger memory footprint.
DISTINCT_COUNT_RAW_HLL
- The log2m value for the HyperLogLog structure can be optionally configured in functionParameters , for example: {"log2m": 16}. If not configured, the default value of 8 will be used. Remember that a larger log2m value leads to better accuracy but also a larger memory footprint.
DISTINCT_COUNT_HLL_PLUS
- The p (precision value of normal set) and sp (precision value of sparse set) values for the HyperLogLogPlus structure can be optionally configured in functionParameters, for example: {"p": 16, "sp": 32}. If not configured, p will have the default value of 14 and sp will have the default value of 0.
DISTINCT_COUNT_RAW_HLL_PLUS
- The p (precision value of normal set) and sp (precision value of sparse set) values for the HyperLogLogPlus structure can be optionally configured in functionParameters, for example: {"p": 16, "sp": 32}. If not configured, p will have the default value of 14 and sp will have the default value of 0.
DISTINCT_COUNT_THETA_SKETCH
- The nominalEntries value for the Theta Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with DISTINCT_COUNT_THETA_SKETCH having nominalEntries=8192 or less for any power of 2.
DISTINCT_COUNT_RAW_THETA_SKETCH
- The nominalEntries value for the Theta Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with DISTINCT_COUNT_RAW_THETA_SKETCH having nominalEntries=8192 or less for any power of 2.
DISTINCT_COUNT_TUPLE_SKETCH
- The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with DISTINCT_COUNT_TUPLE_SKETCH having nominalEntries=8192 or less for any power of 2.
DISTINCT_COUNT_RAW_INTEGER_SUM_TUPLE_SKETCH
- The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with DISTINCT_COUNT_RAW_INTEGER_SUM_TUPLE_SKETCH having nominalEntries=8192 or less for any power of 2.
SUM_VALUES_INTEGER_SUM_TUPLE_SKETCH
- The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with SUM_VALUES_INTEGER_SUM_TUPLE_SKETCH having nominalEntries=8192 or less for any power of 2.
AVG_VALUE_INTEGER_SUM_TUPLE_SKETCH
- The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with AVG_VALUE_INTEGER_SUM_TUPLE_SKETCH having nominalEntries=8192 or less for any power of 2.
DISTINCT_COUNT_CPC_SKETCH
- The lgK value for the CPC Sketch can be optionally configured in functionParameters, for example: {"lgK": 13}. If not configured, the default value of 12 will be used. Note that the nominalEntries provided at query time should be 2 ^ lgK in order for a star-tree index to be used. For instance, a star-tree index with {"lgK": 13} can be used with DISTINCTCOUNTCPCSKETCH having nominalEntries=8192.
DISTINCT_COUNT_RAW_CPC_SKETCH
DISTINCT_COUNT_ULL
- The p value (precision parameter) for the UltraLogLog structure can be optionally configured in functionParameters, for example: {"p": 20}. If not configured, the default value of 12 will be used.
DISTINCT_COUNT_RAW_ULL
- The p value (precision parameter) for the UltraLogLog structure can be optionally configured in functionParameters, for example: {"p": 20}. If not configured, the default value of 12 will be used.

Unsupported functions

DISTINCT_COUNT
- Intermediate result Set is unbounded.
SEGMENT_PARTITIONED_DISTINCT_COUNT:
- Intermediate result Set is unbounded.
PERCENTILE
- Intermediate result List is unbounded.

Functions to be supported

ST_UNION

Index generation configuration

Multiple index generation configurations can be provided to generate multiple star-trees. Each configuration should contain the following properties:

Property

Description

dimensionsSplitOrder

An ordered list of dimension names can be specified to configure the split order. Only the dimensions in this list are reserved in the aggregated documents. The nodes will be split based on the order of this list. For example, split at level i is performed on the values of dimension at index i in the list. - The star-tree dimension does not have to be a dimension column in the table, it can also be time column, date-time column, or metric column if necessary. - The star-tree dimension column should be dictionary encoded in order to generate the star-tree index. - All columns in the filter and group-by clause of a query should be included in this list in order to use the star-tree index.

skipStarNodeCreationForDimensions

(Optional, default empty): A list of dimension names for which to not create the Star-Node.

functionColumnPairs

A list of aggregation function and column pairs (split by double underscore “__”). E.g. SUM__Impressions (SUM of column Impressions) or COUNT__*.

aggregationConfigs

Check

maxLeafRecords

(Optional, default 10000): The threshold T to determine whether to further split each node.

`functionColumnPairs` and `aggregationConfigs` are interchangeable. Consider using `aggregationConfigs` since it supports additional parameters like compression.

AggregationConfigs

All aggregations of a query should be included in `aggregationConfigs` or in `functionColumnPairs` in order to use the star-tree index.

Property

Description

columnName

(Required) Name of the column to aggregate. The column can be either dictionary encoded or raw.

aggregationFunction

(Required) Name of the aggregation function to use.

compressionCodec

(Optional, default PASS_THROUGH, introduced in release 1.1.0) Used to configure the compression enabled on the star-tree-index. Useful when aggregating on columns that contain big values. For example, a BYTES column containing HLL counters serialisations used to calculate DISTINCTCOUNTHLL. In this case setting "compressionCodec": "LZ4" can significantly reduce the space used by the index. Equivalent to compressionCodec in

deriveNumDocsPerChunk

(Optional, introduced in release 1.2.0) Equivalent to deriveNumDocsPerChunk in

indexVersion

(Optional, introduced in release 1.2.0) Equivalent to rawIndexWriterVersion in

targetMaxChunkSize

(Optional, introduced in release 1.2.0) Equivalent to targetMaxChunkSize in

targetDocsPerChunk

(Optional, introduced in release 1.2.0) Equivalent to targetDocsPerChunk in

functionParameters

(Optional) A configuration map used to pass in additional configurations to the aggregation function. For example, on DISTINCTCOUNTHLL, this could look like {"log2m": 16} in order to build the star-tree index using DISTINCTCOUNTHLL with a non-default value for log2m. Note that the index will only be used for queries using the same value for log2m with DISTINCTCOUNTHLL.

Default index generation configuration

A default star-tree index can be added to a segment by using the boolean config enableDefaultStarTree under the tableIndexConfig.

A default star-tree will have the following configuration:

All dictionary-encoded single-value dimensions with cardinality smaller or equal to a threshold (10000) will be included in the dimensionsSplitOrder, sorted by their cardinality in descending order.
All dictionary-encoded Time/DateTime columns will be appended to the _dimensionsSplitOrder _following the dimensions, sorted by their cardinality in descending order. Here we assume that time columns will be included in most queries as the range filter column and/or the group by column, so for better performance, we always include them as the last elements in the dimensionsSplitOrder.
Include COUNT(*) and SUM for all numeric metrics in the functionColumnPairs.
Use default maxLeafRecords (10000).

Example

For our example data set, in order to solve the following query efficiently:

SELECT SUM(Impressions) 
FROM myTable 
WHERE Country = 'USA' 
AND Browser = 'Chrome' 
GROUP BY Locale

We may configure the star-tree index as follows:

"tableIndexConfig": {
  "starTreeIndexConfigs": [{
    "dimensionsSplitOrder": [
      "Country",
      "Browser",
      "Locale"
    ],
    "skipStarNodeCreationForDimensions": [
    ],
    "functionColumnPairs": [
      "SUM__Impressions"
    ],
    "maxLeafRecords": 1
  }],
  ...
}

Alternatively using aggregationConfigs instead of functionColumnPairs and enabling compression on the aggregation:

"tableIndexConfig": {
  "starTreeIndexConfigs": [{
    "dimensionsSplitOrder": [
      "Country",
      "Browser",
      "Locale"
    ],
    "skipStarNodeCreationForDimensions": [
    ],
    "aggregationConfigs": [
      {
        "columnName": "Impressions",
        "aggregationFunction": "SUM",
        "compressionCodec": "LZ4"
      }
    ],
    "maxLeafRecords": 1
  }],
  ...
}

Note: In above example configs maxLeafRecords is set to 1 so that all of the dimension combinations are pre-aggregated for clarity in visual below.

The star-tree and documents should be something like below:

Tree structure

The values in the parentheses are the aggregated sum of Impressions for all the documents under the node.

Star-tree documents

Country

Browser

Locale

SUM__Impressions

Chrome

400

Firefox

200

Safari

100

Safari

300

USA

Chrome

600

USA

Firefox

400

USA

Firefox

200

400

200

600

Safari

400

USA

Firefox

600

USA

1000

USA

200

USA

1200

Chrome

1000

Firefox

400

Firefox

200

Firefox

200

Firefox

800

Safari

100

Safari

300

Safari

400

1500

500

200

2200

Query execution

For query execution, the idea is to first check metadata to determine whether the query can be solved with the star-tree documents, then traverse the Star-Tree to identify documents that satisfy all the predicates. After applying any remaining predicates that were missed while traversing the star-tree to the identified documents, apply aggregation/group-by on the qualified documents.

The algorithm to traverse the tree can be described as follows:

Start from root node.
For each level, what child node(s) to select depends on whether there are any predicates/group-by on the split dimension for the level in the query.
- If there is no predicate or group-by on the split dimension, select the Star-Node if exists, or all child nodes to traverse further.
- If there are predicate(s) on the split dimension, select the child node(s) that satisfy the predicate(s).
- If there is no predicate, but there is a group-by on the split dimension, select all child nodes except Star-Node.
Recursively repeat the previous step until all leaf nodes are reached, or all predicates are satisfied.
Collect all the documents pointed by the selected nodes.
- If all predicates and group-by's are satisfied, pick the single aggregated document from each selected node.
- Otherwise, collect all the documents in the document range from each selected node.note

Predicates

Supported Predicates

EQ (=)
NOT EQ (!=)
IN
NOT IN
RANGE (>, >=, <, <=, BETWEEN)
AND

Unsupported Predicates

REGEXP_LIKE: It is intentionally left unsupported because it requires scanning the entire dictionary.
IS NULL: Currently NULL value info is not stored in star-tree index, and the dimension will be indexed as default value. A workaround is to do col = <default> instead.
IS NOT NULL: Same as IS NULL. A workaround is to do col != <default>.

Limited Support Predicates

OR
- It can be applied to predicates on the same dimension, e.g. WHERE d1 < 10 OR d1 > 50)
- It CANNOT be applied to predicates on multiple dimensions because star-tree index will double counting with pre-aggregated results.
NOT (Added since 1.2.0)
- It can be applied to simple predicate and NOT
- It CANNOT be applied on top of AND/OR because star-tree index will double counting with pre-aggregated results.

In scenarios where you have a transform on a column(s) which is in the dimension split order (should include all columns that are either a predicate or a group by column in target query(ies)) AND used in a group-by, then Star-tree index will get applied automatically. If a transform is applied to a column(s) which is used in predicate (WHERE clause) then Star-tree index won't apply.

For e.g if query contains round(colA,600) as roundedValue from tableA group by roundedValue and colA is included in dimensionSplitOrder then Pinot will use the pre-aggregated records to first scan matching records and then apply transform round() to derive roundedValue.

Stream ingestion with Upsert

Upsert support in Apache Pinot.

Pinot provides native support of upserts during real-time ingestion. There are scenarios where records need modifications, such as correcting a ride fare or updating a delivery status.

Partial upserts are convenient as you only need to specify the columns where values change, and you ignore the rest.

Overview of upserts in Pinot

See an overview of how upserts work in Pinot 1.0.

Enable upserts in Pinot

To enable upserts on a Pinot table, do the following:

Define the primary key in the schema
Enable upserts in the table configurations

Define the primary key in the schema

To update a record, you need a primary key to uniquely identify the record. To define a primary key, add the field primaryKeyColumns to the schema definition. For example, the schema definition of UpsertMeetupRSVP in the quick start example has this definition.

upsert_meetupRsvp_schema.json

{
    "primaryKeyColumns": ["event_id"]
}

Note this field expects a list of columns, as the primary key can be a composite.

When two records of the same primary key are ingested, the record with the greater comparison value (timeColumn by default) is used. When records have the same primary key and event time, then the order is not determined. In most cases, the later ingested record will be used, but this may not be true in cases where the table has a column to sort by.

Partition the input stream by the primary key

An important requirement for the Pinot upsert table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the send API. If the original stream is not partitioned, then a streaming processing job (such as with Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.

Additionally if using segmentPartitionConfigto leverage Broker segment pruning then it's important to ensure that the partition function used matches both on the Kafka producer side as well as Pinot. In Kafka default for Java client is 32-bit murmur2 hash and for all other languages such as Python its CRC32 (Cyclic Redundancy Check 32-bit).

Enable upsert in the table configurations

To enable upsert, make the following configurations in the table configurations.

Upsert modes

Full upsert

The upsert mode defaults to FULL . FULL upsert means that a new record will replace the older record completely if they have same primary key. Example config:

{
  "upsertConfig": {
    "mode": "FULL"
  }
}

Partial upserts

Partial upsert lets you choose to update only specific columns and ignore the rest.

To enable the partial upsert, set the mode to PARTIAL and specify partialUpsertStrategies for partial upsert columns. Since release-0.10.0, OVERWRITE is used as the default strategy for columns without a specified strategy. defaultPartialUpsertStrategy is also introduced to change the default strategy for all columns.

Note that null handling must be enabled for partial upsert to work.

For example:

release-0.8.0

{
  "upsertConfig": {
    "mode": "PARTIAL",
    "partialUpsertStrategies":{
      "rsvp_count": "INCREMENT",
      "group_name": "IGNORE",
      "venue_name": "OVERWRITE"
    }
  },
  "tableIndexConfig": {
    "nullHandlingEnabled": true
  }
}

release-0.10.0

{
  "upsertConfig": {
    "mode": "PARTIAL",
    "defaultPartialUpsertStrategy": "OVERWRITE",
    "partialUpsertStrategies":{
      "rsvp_count": "INCREMENT",
      "group_name": "IGNORE"
    }
  },
  "tableIndexConfig": {
    "nullHandlingEnabled": true
  }
}

Pinot supports the following partial upsert strategies:

Strategy

Description

OVERWRITE

Overwrite the column of the last record

INCREMENT

Add the new value to the existing values

APPEND

Add the new item to the Pinot unordered set

UNION

Add the new item to the Pinot unordered set if not exists

IGNORE

Ignore the new value, keep the existing value (v0.10.0+)

MAX

Keep the maximum value betwen the existing value and new value (v0.12.0+)

MIN

Keep the minimum value betwen the existing value and new value (v0.12.0+)

With partial upsert, if the value is null in either the existing record or the new coming record, Pinot will ignore the upsert strategy and the null value:

(null, newValue) -> newValue

(oldValue, null) -> oldValue

(null, null) -> null

None upserts

If set mode to NONE, the upsert is disabled.

Comparison column

By default, Pinot uses the value in the time column (timeColumn in tableConfig) to determine the latest record. That means, for two records with the same primary key, the record with the larger value of the time column is picked as the latest update. However, there are cases when users need to use another column to determine the order. In such case, you can use option comparisonColumn to override the column used for comparison. For example,

{
  "upsertConfig": {
    "mode": "FULL",
    "comparisonColumn": "anotherTimeColumn"
  }
}

For partial upsert table, the out-of-order events won't be consumed and indexed. For example, for two records with the same primary key, if the record with the smaller value of the comparison column came later than the other record, it will be skipped.

NOTE: Please use comparisonColumns for single comparison column instead of comparisonColumn as it is currently deprecated. You may see unrecognizedProperties when using the old config, but it's converted to comparisonColumns automatically when adding the table.

Multiple comparison columns

In some cases, especially where partial upsert might be employed, there may be multiple producers of data each writing to a mutually exclusive set of columns, sharing only the primary key. In such a case, it may be helpful to use one comparison column per producer group so that each group can manage its own specific versioning semantics without the need to coordinate versioning across other producer groups.

{
  "upsertConfig": {
    "mode": "PARTIAL",
    "defaultPartialUpsertStrategy": "OVERWRITE",
    "partialUpsertStrategies":{},
    "comparisonColumns": ["secondsSinceEpoch", "otherComparisonColumn"]
  }
}

Documents written to Pinot are expected to have exactly 1 non-null value out of the set of comparisonColumns; if more than 1 of the columns contains a value, the document will be rejected. When new documents are written, whichever comparison column is non-null will be compared against only that same comparison column seen in prior documents with the same primary key. Consider the following examples, where the documents are assumed to arrive in the order specified in the array.

[
  {
    "event_id": "aa",
    "orderReceived": 1,
    "description" : "first",
    "secondsSinceEpoch": 1567205394
  },
  {
    "event_id": "aa",
    "orderReceived": 2,
    "description" : "update",
    "secondsSinceEpoch": 1567205397
  },
  {
    "event_id": "aa",
    "orderReceived": 3,
    "description" : "update",
    "secondsSinceEpoch": 1567205396
  },
  {
    "event_id": "aa",
    "orderReceived": 4,
    "description" : "first arrival, other column",
    "otherComparisonColumn": 1567205395
  },
  {
    "event_id": "aa",
    "orderReceived": 5,
    "description" : "late arrival, other column",
    "otherComparisonColumn": 1567205392
  },
  {
    "event_id": "aa",
    "orderReceived": 6,
    "description" : "update, other column",
    "otherComparisonColumn": 1567205398
  }
]

The following would occur:

orderReceived: 1

Result: persisted
Reason: first doc seen for primary key "aa"

orderReceived: 2

Result: persisted (replacing orderReceived: 1)
Reason: comparison column (secondsSinceEpoch) larger than that previously seen

orderReceived: 3

Result: rejected
Reason: comparison column (secondsSinceEpoch) smaller than that previously seen

orderReceived: 4

Result: persisted (replacing orderReceived: 2)
Reason: comparison column (otherComparisonColumn) larger than previously seen (never seen previously), despite the value being smaller than that seen for secondsSinceEpoch

orderReceived: 5

Result: rejected
Reason: comparison column (otherComparisonColumn) smaller than that previously seen

orderReceived: 6

Result: persist (replacing orderReceived: 4)
Reason: comparison column (otherComparisonColumn) larger than that previously seen

Metadata time-to-live (TTL)

In Pinot, the metadata map is stored in heap memory. To decrease in-memory data and improve performance, minimize the time primary key entries are stored in the metadata map (metadata time-to-live (TTL)). Limiting the TTL is especially useful for primary keys with high cardinality and frequent updates.

Since the metadata TTL is applied on the first comparison column, the time unit of upsert TTL is the same as the first comparison column.

Configure how long primary keys are stored in metadata

To configure how long primary keys are stored in metadata, specify the length of time in upsertTTL. For example:{

  "upsertConfig": {
    "mode": "FULL",
    "enableSnapshot": true,
    "enablePreload": true,
    "metadataTTL": 86400
  }
}

In this example, Pinot will retain primary keys in metadata for 1 day.

Note that enabling upsert snapshot is required for metadata TTL for in-memory validDocsIDs recovery.

Delete column

Upsert Pinot table can support soft-deletes of primary keys. This requires the incoming record to contain a dedicated boolean single-field column that serves as a delete marker for a primary key. Once the real-time engine encounters a record with delete column set to true , the primary key will no longer be part of the queryable set of documents. This means the primary key will not be visible in the queries, unless explicitly requested via query option skipUpsert=true.

{ 
    "upsertConfig": {  
        ... 
        "deleteRecordColumn": <column_name>
    } 
}

Note that the delete column has to be a single-value boolean column.

// In the Schema
{
    ...
    {
      "name": "<delete_column_name>",
      "dataType": "BOOLEAN"
    },
    ...
}

Note that when deleteRecordColumn is added to an existing table, it will require a server restart to actually pick up the upsert config changes.

A deleted primary key can be revived by ingesting a record with the same primary, but with higher comparison column value(s).

Note that when reviving a primary key in a partial upsert table, the revived record will be treated as the source of truth for all columns. This means any previous updates to the columns will be ignored and overwritten with the new record's values.

Deleted Keys time-to-live (TTL)

The above config deleteRecordColumn only soft-deletes the primary key. To decrease in-memory data and improve performance, minimize the time deleted-primary-key entries are stored in the metadata map (deletedKeys time-to-live (TTL)). Limiting the TTL is especially useful for deleted-primary-keys where there are no future updates foreseen.

Configure how long deleted-primary-keys are stored in metadata

To configure how long primary keys are stored in metadata, specify the length of time in deletedKeysTTL For example:

  "upsertConfig": {
    "mode": "FULL",
    "deleteRecordColumn": <column_name>,
    "deletedKeysTTL": 86400
  }
}

In this example, Pinot will retain the deleted-primary-keys in metadata for 1 day.

Note that the value of this field deletedKeysTTL should be the same as the unit of comparison column. If your comparison column is having values which corresponds to seconds, this config should also have values in seconds (see above example).

Data consistency with deletes and compaction together

When using deletedKeysTTL together with UpsertCompactionTask, there can be a scenario where a segment containing deleted-record (where deleteRecordColumn = true was set for the primary key) gets compacted first and a previous old record is not yet compacted. During server restart, now the old record is added to the metadata manager map and is treated as non-deleted. To prevent data inconsistencies in this scenario, we have added a new config enableDeletedKeysCompactionConsistency which when set to true, will ensure that the deleted records are not compacted until all the previous records from all other segments are compacted for the deleted primary-key.

{
  "upsertConfig": {
    "mode": "FULL",
    "deleteRecordColumn": <column_name>,
    "deletedKeysTTL": 86400,
    "enableDeletedKeysCompactionConsistency": true
  }
}

Data consistency when queries and upserts happen concurrently

Upserts in Pinot enable real-time updates and ensure that queries always retrieve the latest version of a record, making them a powerful feature for managing mutable data efficiently. However, in applications with extremely high QPS and high ingestion rates, queries and upserts happening concurrently can sometimes lead to inconsistencies in query results.

For example, consider a table with 1 million primary keys. A distinct count query should always return 1 million, regardless of how new records are ingested and older records are invalidated. However, at high ingestion and query rates, the query may occasionally return a count slightly above or below 1 million. This happens because queries determine valid records by acquiring validDocIds bitmaps from multiple segments, which indicate which documents are currently valid. Since acquiring these bitmaps is not atomic with respect to ongoing upserts, a query may capture an inconsistent view of the data, leading to overcounting or undercounting of valid records.

This is a classic concurrency issue where reads and writes happen simultaneously, leading to temporary inconsistencies. Typically, such issues are resolved using locks or snapshots to maintain a stable view of the data during query execution. To address this, two new consistency modes - SYNC and SNAPSHOT - have been introduced for upsert enabled tables to ensure consistent query results even when queries and upserts occur concurrently and at very high throughput.

By default, the consistency mode is NONE, meaning the system operates as before. The SYNC mode ensures consistency by blocking upserts while queries execute, guaranteeing that queries always see a stable upserted data view. However, this can introduce write latency. Alternatively, the SNAPSHOT mode creates a consistent snapshot of validDocIds bitmaps for queries to use. This allows upserts to continue without blocking queries, making it more suitable for workloads with both high query and write rates. These new consistency modes provide flexibility, allowing applications to balance consistency guarantees against performance trade-offs based on their specific requirements.

{
  "upsertConfig": {
    "consistencyMode": "SYNC", // or "SNAPSHOT", "NONE"
...
  }
}

Use strictReplicaGroup for routing

The upsert Pinot table can use only the low-level consumer for the input streams. As a result, it uses the partitioned replica-group assignment implicitly for the segments. Moreover, upsert poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires to use strictReplicaGroup as the routing strategy. To use that, configure instanceSelectorType in Routing as the following:

{
  "routing": {
    "instanceSelectorType": "strictReplicaGroup"
  }
}

Using implicit partitioned replica-group assignment from low-level consumer won't persist the instance assignment (mapping from partition to servers) to the ZooKeeper, and new added servers will be automatically included without explicit reassigning instances (usually through rebalance). This can cause new segments of the same partition assigned to a different server and break the requirement of upsert.

To prevent this, we recommend using explicit partitioned replica-group instance assignment to ensure the instance assignment is persisted. Note that numInstancesPerPartition should always be 1 in replicaGroupPartitionConfig.

Enable validDocIds snapshots for upsert metadata recovery

Upsert snapshot support is also added in release-0.12.0. To enable the snapshot, set the enableSnapshot to true. For example:

{
  "upsertConfig": {
    "mode": "FULL",
    "enableSnapshot": true
  }
}

Upsert maintains metadata in memory containing which docIds are valid in a particular segment (ValidDocIndexes). This metadata gets lost during server restarts and needs to be recreated again. ValidDocIndexes can not be recovered easily after out-of-TTL primary keys get removed. Enabling snapshots addresses this problem by adding functions to store and recover validDocIds snapshot for Immutable Segments

The snapshots are taken on every segment commit to ensure that they are consistent with the persisted data in case of abrupt shutdown. We recommend that you enable this feature so as to speed up server boot times during restarts.

The lifecycle for validDocIds snapshots are shows as follows,

If snapshot is enabled, snapshots for existing segments are taken or refreshed when the next consuming segment gets started.
The snapshot files are kept on disk until the segments get removed, e.g. due to data retention or manual deletion.
If snapshot is disabled, the existing snapshot for a segment is cleaned up when the segment gets loaded by the server, e.g. when the server restarts.

Enable preload for faster server restarts

Upsert preload feature can make it faster to restore the upsert states when server restarts. To enable the preload feature, set the enablePreload to true. To enable preloading, enableSnapshot: true should also be set in the table config. For example:

{
  "upsertConfig": {
    "mode": "FULL",
    "enableSnapshot": true,
    "enablePreload": true
  }
}

Under the hood, it uses the validDocIds snapshots to identify the valid docs and restore their upsert metadata quickly instead of performing a whole upsert comparison flow. The flow is triggered before the server is marked as ready, after which the server starts to load the remaining segments without snapshots (hence the name preload).

The feature also requires you to specify pinot.server.instance.max.segment.preload.threads: N in the server config where N should be replaced with the number of threads that should be used for preload. It's 0 by default to disable the preloading feature.

A bug was introduced in v1.2.0 that when enablePreload and enableSnapshot flags are set to true but max.segment.preload.threads is left as 0, the preloading mechanism is still enabled but segments fail to get loaded as there is no threads for preloading. This was fixed in newer versions, but for v1.2.0, if enablePreload and enableSnapshot are set to true, remember to set max.segment.preload.threads to a positive value as well. Server restart is needed to get max.segment.preload.threads config change into effect.

Handle out-of-order events

There are 2 configs added related to handling out-of-order events.

dropOutOfOrderRecord

To enable dropping of out-of-order record, set the dropOutOfOrderRecord to true. For example:

{
  "upsertConfig": {
    ...,
    "dropOutOfOrderRecord": true
  }
}

This feature doesn't persist any out-of-order event to the consuming segment. If not specified, the default value is false.

When false, the out-of-order record gets persisted to the consuming segment, but the MetadataManager mapping is not updated thus this record is not referenced in query or in any future updates. You can still see the records when using skipUpsert query option.
When true, the out-of-order record doesn't get persisted at all and the MetadataManager mapping is not updated so this record is not referenced in query or in any future updates. You cannot see the records when using skipUpsert query option.

outOfOrderRecordColumn

This is to identify out-of-order events programmatically. To enable this config, add a boolean field in your table schema, say isOutOfOrder and enable via this config. For example:

{
  "upsertConfig": {
    ...,
    "outOfOrderRecordColumn": "isOutOfOrder"
  }
}

This feature persists a true / false value to the isOutOfOrder field based on the orderness of the event. You can filter out out-of-order events while using skipUpsert to avoid any confusion. For example:

select key, val from tbl1 where isOutOfOrder = false option(skipUpsert=false)

Use custom metadata manager

Pinot supports custom PartitionUpsertMetadataManager that handle records and segments updates.

{
  "upsertConfig": {
    "metadataManagerClass": org.apache.pinot.segment.local.upsert.CustomPartitionUpsertMetadataManager
  }
}

Adding custom upsert managers

You can add custom PartitionUpsertMetadataManager as follows:

Create a new java project. Make sure you keep the package name as org.apache.pinot.segment.local.upsert.xxx
In your java project include the dependency

<dependency>
  <groupId>org.apache.pinot</groupId>
  <artifactId>pinot-segment-local</artifactId>
  <version>1.0.0</version>
 </dependency>

include 'org.apache.pinot:pinot-common:1.0.0'

Add your custom partition manager that implements PartitionUpsertMetadataManager interface

//Example custom partition manager

class CustomPartitionUpsertMetadataManager implements PartitionUpsertMetadataManager {}

Add your custom TableUpsertMetadataManager that implements BaseTableUpsertMetadataManager interface

//Example custom table upsert metadata manager

public class CustomTableUpsertMetadataManager extends BaseTableUpsertMetadataManager {}

Place the compiled JAR in the /plugins directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the custom upsert manager in table configs as follows:

{
  "upsertConfig": {
    "metadataManagerClass": org.apache.pinot.segment.local.upsert.CustomPartitionUpsertMetadataManager
  }
}

⚠️ The upsert manager class name is case-insensitive as well.

Upsert table limitations

There are some limitations for the upsert Pinot tables.

The upsert feature is supported for Real-time tables only, and not for Hybrid or Offline tables.
The high-level consumer is not allowed for the input stream ingestion, which means stream.[consumerName].consumer.type must always be lowLevel.
The star-tree index cannot be used for indexing, as the star-tree index performs pre-aggregation during the ingestion.
Unlike append-only tables, out-of-order events (with comparison value in incoming record less than the latest available value) won't be consumed and indexed by Pinot partial upsert table, these late events will be skipped.

Best practices

Unlike other real-time tables, Upsert table takes up more memory resources as it needs to bookkeep the record locations in memory. As a result, it's important to plan the capacity beforehand, and monitor the resource usage. Here are some recommended practices of using Upsert table.

Create the topic/stream with more partitions.

The number of partitions in input streams determines the partition numbers of the Pinot table. The more partitions you have in input topic/stream, more Pinot servers you can distribute the Pinot table to and therefore more you can scale the table horizontally. Do note that you can't increase the partitions in future for upsert enabled tables so you need to start with good enough partitions (atleast 2-3X the number of pinot servers)

Memory usage

Upsert table maintains an in-memory map from the primary key to the record location. So it's recommended to use a simple primary key type and avoid composite primary keys to save the memory cost. Beware when using JSON column as primary key, same key-values in different order would be considered as different primary keys. In addition, consider the hashFunction config in the Upsert config, which can be MD5 or MURMUR3, to store the 128-bit hashcode of the primary key instead. This is useful when your primary key takes more space. But keep in mind, this hash may introduce collisions, though the chance is very low.

Monitoring

Set up a dashboard over the metric pinot.server.upsertPrimaryKeysCount.tableName to watch the number of primary keys in a table partition. It's useful for tracking its growth which is proportional to the memory usage growth. **** The total memory usage by upsert is roughly (primaryKeysCount * (sizeOfKeyInBytes + 24))

Capacity planning

It's useful to plan the capacity beforehand to ensure you will not run into resource constraints later. A simple way is to measure the rate of the primary keys in the input stream per partition and extrapolate the data to a specific time period (based on table retention) to approximate the memory usage. A heap dump is also useful to check the memory usage so far on an upsert table instance.

Example

Putting these together, you can find the table configurations of the quick start examples as the following:

{
  "tableName": "upsertMeetupRsvp",
  "tableType": "REALTIME",
  "tenants": {},
  "segmentsConfig": {
    "timeColumnName": "mtime",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "1",
    "replication": "1"
  },
  "tableIndexConfig": {
    "segmentPartitionConfig": {
      "columnPartitionMap": {
        "event_id": {
          "functionName": "Hashcode",
          "numPartitions": 2
        }
      }
    }
  },
  "instanceAssignmentConfigMap": {
    "CONSUMING": {
      "tagPoolConfig": {
        "tag": "DefaultTenant_REALTIME"
      },
      "replicaGroupPartitionConfig": {
        "replicaGroupBased": true,
        "numReplicaGroups": 1,
        "partitionColumn": "event_id",
        "numPartitions": 2,
        "numInstancesPerPartition": 1
      }
    }
  },
  "routing": {
    "segmentPrunerTypes": [
      "partition"
    ],
    "instanceSelectorType": "strictReplicaGroup"
  },
  "ingestionConfig": {
    "streamIngestionConfig": {
      "streamConfigMaps": [
        {
          "streamType": "kafka",
          "stream.kafka.topic.name": "upsertMeetupRSVPEvents",
          "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
          "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
          "stream.kafka.zk.broker.url": "localhost:2191/kafka",
          "stream.kafka.broker.list": "localhost:19092"
        }
      ]
    }
  },
  "upsertConfig": {
    "mode": "FULL",
    "enableSnapshot": true,
    "enablePreload": true
  },
  "fieldConfigList": [
    {
      "name": "location",
      "encodingType": "RAW",
      "indexType": "H3",
      "properties": {
        "resolutions": "5"
      }
    }
  ],
  "metadata": {
    "customConfigs": {}
  }
}

{
  "tableName": "upsertPartialMeetupRsvp",
  "tableType": "REALTIME",
  "tenants": {},
  "segmentsConfig": {
    "timeColumnName": "mtime",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "1",
    "replication": "1"
  },
  "tableIndexConfig": {
    "segmentPartitionConfig": {
      "columnPartitionMap": {
        "event_id": {
          "functionName": "Hashcode",
          "numPartitions": 2
        }
      }
    },
    "nullHandlingEnabled": true
  },
  "instanceAssignmentConfigMap": {
    "CONSUMING": {
      "tagPoolConfig": {
        "tag": "DefaultTenant_REALTIME"
      },
      "replicaGroupPartitionConfig": {
        "replicaGroupBased": true,
        "numReplicaGroups": 1,
        "partitionColumn": "event_id",
        "numPartitions": 2,
        "numInstancesPerPartition": 1
      }
    }
  },
  "routing": {
    "segmentPrunerTypes": [
      "partition"
    ],
    "instanceSelectorType": "strictReplicaGroup"
  },
  "ingestionConfig": {
    "streamIngestionConfig": {
      "streamConfigMaps": [
        {
          "streamType": "kafka",
          "stream.kafka.topic.name": "upsertPartialMeetupRSVPEvents",
          "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
          "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
          "stream.kafka.zk.broker.url": "localhost:2191/kafka",
          "stream.kafka.broker.list": "localhost:19092"
        }
      ]
    }
  },
  "upsertConfig": {
    "mode": "PARTIAL",
    "partialUpsertStrategies": {
      "rsvp_count": "INCREMENT",
      "group_name": "UNION",
      "venue_name": "APPEND"
    }
  },
  "fieldConfigList": [
    {
      "name": "location",
      "encodingType": "RAW",
      "indexType": "H3",
      "properties": {
        "resolutions": "5"
      }
    }
  ],
  "metadata": {
    "customConfigs": {}
  }
}

Pinot server maintains a primary key to record location map across all the segments served in an upsert-enabled table. As a result, when updating the config for an existing upsert table (e.g. change the columns in the primary key, change the comparison column), servers need to be restarted in order to apply the changes and rebuild the map.

Quick Start

To illustrate how the full upsert works, the Pinot binary comes with a quick start example. Use the following command to creates a real-time upsert table meetupRSVP.

# stop previous quick start cluster, if any
bin/quick-start-upsert-streaming.sh

You can also run partial upsert demo with the following command

# stop previous quick start cluster, if any
bin/quick-start-partial-upsert-streaming.sh

As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to check out the real-time data.

For partial upsert you can see only the value from configured column changed based on specified partial upsert strategy.

An example for partial upsert is shown below, each of the event_id kept being unique during ingestion, meanwhile the value of rsvp_count incremented.

To see the difference from the non-upsert table, you can use a query option skipUpsert to skip the upsert effect in the query result.

FAQ

Can I change primary key columns in existing upsert table?

Yes, you can add or delete columns to primary keys as long as input stream is partitioned on one of the primary key columns. However, you need to restart all Pinot servers so that it can rebuild the primary key to record location map with the new columns.

1.3.0

Release Notes for 1.3.0

This release brings significant improvements, including enhancements to the multistage query engine and the introduction of an experimental time series query engine for efficient analysis. Key features include database query quotas, cursor-based pagination for large result sets, multi-stream ingestion, and new function support for URL and GeoJson. Security vulnerabilities and several bug fixes and performance enhancements have been addressed, ensuring a more robust and versatile platform.

Multistage Engine Improvements

Reuse common expressions in a query (spool) #14507, Design Doc

Refines query plan reuse in Apache Pinot by allowing reuse across stages instead of subtrees. Stages are natural boundaries in the query plan, divided into pull-based operators. To execute queries, Pinot introduces stages connected by MailboxSendOperator and MailboxReceiveOperator. The proposal modifies MailboxSendOperator to send data to multiple stages, transforming stage connections into a Directed Acyclic Graph (DAG) for greater efficiency and flexibility.

Segment Plan for MultiStage Queries #13733, #14212

It focuses on providing comprehensive execution plans, including physical operator details. The new explain mode aligns with Calcite terminology and uses a broker-server communication flow to analyze and transform query plans into explained physical plans without executing them. A new ExplainedPlanNode is introduced to enrich query execution plans with physical details, ensuring better transparency and debugging capabilities for users.

DataBlock Serde Performance Improvements #13303, #13304

Improve the performance of DataBlock building, serialization, and deserialization by reducing memory allocation and copies without altering the binary format. Benchmarks show 1x to 3x throughput gains, with significant reductions in memory allocation, minimizing GC-related latency issues in production. The improvement is achieved by changes to the buffers and the addition of a couple of stream classes.

Notable Improvements and Bug Fixes

Allow adding and subtracting timestamp types. #14782
Remove PinotAggregateToSemiJoinRule to avoid mistakenly removing DISTINCT from the IN clause. #14719
Support the use of timestamp indexes. #14690
Support for polymorphic scalar comparison functions(=, !=, >, >=, <, <=). #13711
Optimized MergeEqInFilterOptimizer by reducing the hash computation of expression. #14732
Add support for is_enable_group_trim aggregate option. #14664
Add support for is_leaf_return_final_result aggregate option. #14645
Override the return type from NOW to TIMESTAMP. #14614
Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG). #14689
Add cluster configuration to limit the number of multi-stage queries running concurrently. #14574
Allow filter for lookup JOIN. #14523
Fix the bug where the query option is completely overridden when generating a leaf stage query. #14603
Fix timestamp literal handling in the multi-stage query engine. #14502
Add TLS support to mailboxes used in the multi-stage engine. #14476
Allow configuring TLS between brokers and servers. #14387
Add tablesQueried metadata to BrokerResponse for multi-stage queries. #14384
Apply filter reduce expressions Calcite rule at the end to prevent literal-only filter pushdown to leaf stage. #14448
Add support for all data types in return type inference from string literals for JSON extract functions. #14289
Add support for the IGNORE NULLS option for the FIRST_VALUE and LAST_VALUE window functions. #14264
Fix for window frame upper bound offset extraction in PinotWindowExchangeNodeInsertRule. #14304
Add support for defining custom window frame bounds for window functions. #14273
Improvements to allow using DISTINCTCOUNTTHETASKETCH with filter arguments. #14285
Fix for ROUND scalar function in the multi-stage query engine. #14284
Support for enabling LogicalProject pushdown optimizations to eliminate the exchange of unused columns. #14198
Support for COALESCE as a variadic scalar function. #14195
Support for lookup join. #13966
Add NULLIF scalar function. #14203
Compute all groups for the group by queries with only filtered aggregations. #14211
Add broker API to run a query on both query engines and compare results. #13746
Database handling improvement in the multi-stage engine. #14040
Adds per-block row tracking for CROSS JOINs to prevent OOM while allowing large joins to function within memory limits. #13981
OOM Protection Support for Multi-Stage Queries. #13598, #13955
Refactor function registry for multi-stage engine. #13573
Enforce max rows in join limit on joined rows with left input. #13922
Argument type is used to look up the function for the literal-only query. #13673
Ensure broker queries fail when the multi-stage engine is disabled, aligning behaviour with the controller to improve user experience. #13732

Timeseries Engine Support in Pinot Design Doc

Introduction of a Generic Time Series Query Engine in Apache Pinot, enabling native support for various time-series query languages (e.g., PromQL, M3QL) through a pluggable framework. This enhancement addresses limitations in Pinot’s current SQL-based query engines for time-series analysis, providing optimized performance and usability for observability use cases, especially those requiring high-cardinality metrics.

NOTE: Timeseries Engine support in Pinot is currently in an Experimental state.

Key Features

Pluggable Time-Series Query Language:

Pinot will support multiple time-series query languages, such as PromQL and Uber’s M3QL, via plugins like pinot-m3ql.
Example queries:
- Plot hourly order counts for specific merchants.
- Perform week-over-week analysis of order counts.
These plugins will leverage a new SPI module to enable seamless integration of custom query languages.

Pluggable Time-Series Operators:

Custom operators specific to each query language (e.g., nonNegativeDerivative or holt_winters) can be implemented within language-specific plugins without modifying Pinot’s core code.
Extensible operator abstractions will allow stakeholders to define unique time-series analysis functions.

Advantages of the New Engine:

Optimized for Time-Series Data: Processes data in series rather than rows, improving performance and simplifying the addition of complex analysis functions.
Reduced Complexity in Pinot Core: The engine reuses existing components like the Multi-Stage Engine (MSE) Query Scheduler, Query Dispatcher, and Mailbox. At the same time, language parsers and planners remain modular in plugins.
Improved Usability: Users can run concise and powerful time-series queries in their preferred language, avoiding the verbosity and limitations of SQL.

Impact on Observability Use Cases:

This new engine significantly enhances Pinot’s ability to handle complex time-series analyses efficiently, making it an ideal database for high-cardinality metrics and observability workloads.

The improvement is a step forward in transforming Pinot into a robust and versatile platform for time-series analytics, enabling seamless integration of diverse query languages and custom operators.

Here are some of the key PRs that have been merged as part of this feature:

Pinot time series engine SPI. #13885
Add combine and segment level operators for time series. #13999
Working E2E quickstart for time series engine. #14048
Handling NULL cases in sum, min, max series builders. #14084
Remove unnecessary time series materialization and minor cleanups. #14092
Fix offset handling and effective time filter and enable Group-By expressions. #14104
Enabling JSON column for Group-By in time series. #14141
Fix bug in handling empty filters in time series. #14192
Minor time series engine improvements. #14227
Fix time series query correctness issue. #14251
Define time series ID and broker response name tag semantics. #14286
Use num docs from the value block in the time series aggregation operator. #14331
Make time buckets half open on the left. #14413
Fix Server Selection Bug + Enforce Timeout. #14426
Response Size Limit, Metrics and Series Limit. #14501
Refactor to enable Broker reduction. #14582
Enable streaming response for time series. #14598
Add time series exchange operator, plan node and serde. #14611
Add support for partial aggregate and complex intermediate type. #14631
Complete support for multi-server queries. #14676

Database Query Quota #13544

Introduces the ability to impose query rate limits at the database level, covering all queries made to tables within a database. A database-level rate limiter is implemented, and a new method, acquireDatabase(databaseName), is added to the QueryQuotaManager interface to check database query quotas.

Database Query Quota Configuration

Query and storage quotas are now provisioned similarly to table quotas but managed separately in a DatabaseConfig znode.
Details about the DatabaseConfig znode:
- It does not represent a logical database entity.
- Its absence does not prevent table creation under a database.
- Deletion does not remove tables within the database.

Default and Override Quotas

A default query quota (databaseMaxQueriesPerSecond: 1000) is provided in ClusterConfig.
Overrides for specific databases can be configured via znodes (e.g., PROPERTYSTORE/CONFIGS/DATABASE/).

APIs for Configuration

Method

Path

Description

POST

/databases/{databaseName}/quotas?maxQueriesPerSecond=

Sets the database query quota

GET

/databases/{databaseName}/quotas

Get the database query quota

Dynamic Quota Updates

Quotas are determined by a combination of default cluster-level quotas and database-specific overrides.
Per-broker quotas are adjusted dynamically based on the number of live brokers.
Updates are handled via:
- A custom DatabaseConfigRefreshMessage is sent to brokers upon database config changes.
- A ClusterConfigChangeListener in ClusterChangeMediator to process updates in cluster configs.
- Adjustments to per-broker quotas upon broker resource changes.
- Creation of database rate limiters during the OFFLINE -> ONLINE state transition of tables in BrokerResourceOnlineOfflineStateModel.

This feature provides fine-grained control over query rate limits, ensuring scalability and efficient resource management for databases within Pinot.

Binary Workload Scheduler for Constrained Execution #13847

Introduction of the BinaryWorkloadScheduler, which categorizes queries into two distinct workloads to ensure cluster stability and prioritize critical operations:

Workload Categories:

1. Primary Workload:

Default category for all production traffic.
Queries are executed using an unbounded FCFS (First-Come, First-Served) scheduler.
Designed for high-priority, critical queries to maintain consistent availability and performance.

2. Secondary Workload:

Reserved for ad-hoc queries, debugging tools, dashboards/notebooks, development environments, and one-off tests.
Imposes several constraints to minimize impact on the primary workload:
- Limited concurrent queries: Caps the number of in-progress queries, with excess queries queued.
- Thread restrictions: Limits the number of worker threads per query and across all queries in the secondary workload.
- Queue pruning: Queries stuck in the queue too long are pruned based on time or queue length.

Key Benefits:

Prioritization: Guarantees the primary workload remains unaffected by resource-intensive or long-running secondary queries.
Stability: Protects cluster availability by preventing incidents caused by poorly optimized or excessive ad-hoc queries.
Scalability: Efficiently manages traffic in multi-tenant clusters, maintaining service reliability across workloads.

Cursors Support #14110, Design Doc

Cursor support will allow Pinot clients to consume query results in smaller chunks. This feature allows clients to work with lesser resources esp. memory. Application logic is more straightforward with cursors. For example an app UI paginates through results in a table or a graph. Cursor support has been implemented using APIs.

API

Method

Path

Description

POST

/query/sql

New broker API parameter has been added to trigger pagination.

GET

/resultStore/{requestId}/results

Broker API that can be used to iterate over the result set of a query submitted using the above API.

GET

/resultStore/{requestId}/

Returns the BrokerResponse metadata of the query.

GET

/resultStore

Lists all the requestIds of all the query results available in the response store.

DELETE

/resultStore/{requestId}/

Delete the results of a query.

SPI

The feature provides two SPIs to extend the feature to support other implementations:

ResponseSerde: Serialize/Deserialize the response.
ResponseStore: Store responses in a storage system. Both SPIs use Java SPI and the default ServiceLoader to find implementation of the SPIs. All implementation should be annotated with AutoService to help generate files for discovering the implementations.

URL Functions Support #14646

Implemented various URL functions to handle multiple aspects of URL processing, including extraction, encoding/decoding, and manipulation, making them useful for tasks involving URL parsing and modification

URL Extraction Methods

urlProtocol(String url): Extracts the protocol (scheme) from the URL.
urlDomain(String url): Extracts the domain from the URL.
urlDomainWithoutWWW(String url): Extracts the domain without the leading "www." if present.
urlTopLevelDomain(String url): Extracts the top-level domain (TLD) from the URL.
urlFirstSignificantSubdomain(String url): Extracts the first significant subdomain from the URL.
cutToFirstSignificantSubdomain(String url): Extracts the first significant subdomain and the top-level domain from the URL.
cutToFirstSignificantSubdomainWithWWW(String url): Returns the part of the domain that includes top-level subdomains up to the "first significant subdomain", without stripping "www.".
urlPort(String url): Extracts the port from the URL.
urlPath(String url): Extracts the path from the URL without the query string.
urlPathWithQuery(String url): Extracts the path from the URL with the query string.
urlQuery(String url): Extracts the query string without the initial question mark (?) and excludes the fragment (#) and everything after it.
urlFragment(String url): Extracts the fragment identifier (without the hash symbol) from the URL.
urlQueryStringAndFragment(String url): Extracts the query string and fragment identifier from the URL.
extractURLParameter(String url, String name): Extracts the value of a specific query parameter from the URL.
extractURLParameters(String url): Extracts all query parameters from the URL as an array of name=value pairs.
extractURLParameterNames(String url): Extracts all parameter names from the URL query string.
urlHierarchy(String url): Generates a hierarchy of URLs truncated at path and query separators.
urlPathHierarchy(String url): Generates a hierarchy of path elements from the URL, excluding the protocol and host.

URL Manipulation Methods

urlEncode(String url): Encodes a string into a URL-safe format.
urlDecode(String url) Decodes a URL-encoded string.
urlEncodeFormComponent(String url): Encodes the URL string following RFC-1866 standards, with spaces encoded as +.
urlDecodeFormComponent(String url): Decodes the URL string following RFC-1866 standards, with + decoded as a space.
urlNetloc(String url): Extracts the network locality (username:password@host:port) from the URL.
cutWWW(String url): Removes the leading "www." from a URL’s domain.
cutQueryString(String url): Removes the query string, including the question mark.
cutFragment(String url): Removes the fragment identifier, including the number sign.
cutQueryStringAndFragment(String url): Removes both the query string and fragment identifier.
cutURLParameter(String url, String name): Removes a specific query parameter from a URL.
cutURLParameters(String url, String[] names): Removes multiple specific query parameters from a URL.

Multi Stream Ingestion Support #13790, Design Doc

Add support to ingest from multiple source by a single table
Use existing interface (TableConfig) to define multiple streams
Separate the partition id definition between Stream and Pinot segment
Compatible with existing stream partition auto expansion logics The feature does not change any existing interfaces. Users could define the table config in the same way and combine with any other transform functions or instance assignment strategies.

New Scalar Functions Support. #14671

intDiv and intDivOrZero: Perform integer division, with intDivOrZero returning zero for division by zero or when dividing a minimal negative number by minus one.
isFinite, isInfinite, and isNaN: Check if a double value is finite, infinite, or NaN, respectively.
ifNotFinite: Returns a default value if the given value is not finite.
moduloOrZero and positiveModulo: Variants of the modulo operation, with moduloOrZero returning zero for division by zero or when dividing a minimal negative number by minus one.
negate: Returns the negation of a double value.
gcd and lcm: Calculate the greatest common divisor and least common multiple of two long values, respectively.
hypot: Computes the hypotenuse of a right-angled triangle given the lengths of the other two sides.
byteswapInt and byteswapLong: Perform byte swapping on integer and long values.

GeoJSON Support #14405

Add support for GeoJSON Scalar functions:

ST_GeomFromGeoJson(string) -> binary
ST_GeogFromGeoJson(string) -> binary
ST_AsGeoJson(binary) -> string

Supported data types:

Point
LineString
Polygon
MultiPoint
MultiLineString
MultiPolygon
GeometryCollection
Feature
FeatureCollection

Improved Implementation of Distinct Operators. #14701

Main optimizations:

Add per data type DistinctTable and utilize primitive type if possible
Specialize single-column case to reduce overhead
Allow processing null values with dictionary based operators
Specialize unlimited LIMIT case
Do not create priority queue before collecting LIMIT values
Add support for null ordering

Upsert Improvements

Features and Improvements

Track New Segments for Upsert Tables #13992

Improvement for addressing a race condition where newly uploaded segments may be processed by the server before brokers add them to the routing table, potentially causing queries to miss valid documents.
Introduce a configurable newSegmentTrackingTimeMs (default 10s) to track new segments on the server side, allowing them to be accessed as optional segments until brokers update their routing tables.

Ensure Upsert Deletion Consistency with Compaction Flow Enabled #13347

Enhancement addresses inconsistencies in upsert-compaction by introducing a mechanism to track the distinct segment count for primary keys. By ensuring a record exists in only one segment before compacting deleted records, it prevents older non-deleted records from being incorrectly revived during server restarts, ensuring consistent table state.

Consistent Segments Tracking for Consistent Upsert View #13677

This improves consistent upsert view handling by addressing segment tracking and query inconsistencies. Key changes include:

Complete and Consistent Segment Tracking: Introduced a new Set to track segments before registration to the table manager, ensuring synchronized segment membership and validDocIds access.
Improved Segment Replacement: Added DuoSegmentDataManager to register both mutable and immutable segments during replacement, allowing queries to access a complete data view without blocking ingestion.
Query Handling Enhancements: Queries now acquire the latest consuming segments to avoid missing newly ingested data if the broker's routing table isn't updated.
Misc Fixes: Addressed edge cases, such as updating _numDocsIndexed before metadata updates, returning empty bitmaps instead of null, and preventing bitmap re-acquisition outside locking logic. These changes, gated by the new feature flag upsertConfig.consistencyMode, are tested with unit and stress tests in a staging environment to ensure reliability.

Other Notable Improvements and Bug Fixes

Config for max output segment size in UpsertCompactMerge task. #14772
Add config for ignoreCrcMismatch for upsert-compaction task. #14668
Upsert small segment merger task in minions. #14477
Fix to acquire segmentLock before taking segment snapshot. #14179
Update upsert TTL watermark in replaceSegment. #14147
Fix checks on largest comparison value for upsert ttl and allow to add segments out of ttl. #14094
More observability and metrics to track the upsert rate of deletion. #13838

Lucene and Text Search Improvements

Store index metadata file for Lucene text indexes. #13948
Runtime configurability for Lucene analyzers and query parsers, enabling dynamic text tokenization and advanced log search capabilities like case-sensitive/insensitive searches. #13003

Security Improvements and Vulnerability Fixes

Force SSL cert reload daily using the scheduled thread. #14535
Allow configuring TLS between brokers and servers for the multi-stage engine. #14387
Strip Matrix parameter from BasePath checking. #14383
Disable replacing environment variables and system properties in get table configs REST API. #14002
Dependencies upgrade for vulnerabilities. #13892
TLS Configuration Support for QueryServer and Dispatch Clients. #13645
Returning tables names failing authorization in Exception for Multi-State Engine Queries. #13195
TLS Port support for Minion. #12943
Upgrade the hadoop version to 3.3.6 to fix vulnerabilities. #12561)
Fix vulnerabilities for msopenjdk 11 pinot-base-runtime image. #14030

Miscellaneous Improvements

Allow setting ForwardIndexConfig default settings via cluster config. #14773
Extend Merge Rollup Capabilities for Datasketches. #14625
Skip task validation during table creation with schema. #14683
Add capability to configure sketch precision / accuracy for different rollup buckets. Helpful in a space-saving for use cases where historical data does not require high accuracy. #14373
Add support for application-level query quota. #14226
Improvement to allow setting ForwardIndexConfig default settings via cluster config. #14773
Enhanced mutable Index class to be as pluggable. #14609
Improvement to allow configurable initial capacity for IndexedTable. #14620
Add a new segment reload API for flexible control, allowing specific segments to be reloaded on designated servers and enabling workload management through batch processing and replica group targeting. #14544
Add a server API to list segments that need to be refreshed for a table. #14544
Introduced the ability to erase dimension values before rollup in merged segments, reducing cardinality and optimizing space for less critical historical data. #14355
Add support for immutable CLPForwardIndex creator and related classes. #14288
Add support for Minion Task to support automatic Segment Refresh. #14300
Add support for S3A Connector. #14474
Add support for hex decimal to long scalar functions. #14435
Remove emitting null value fields during data transformation for SchemaConformingTransformer. #14351
Improved CSV record reader to skip unparseable lines. #14396
Add the ability to specify a target instance for segment reloading and improve API response messages when segments are not found on the target instances. #14393
Add support for JSON Path Exists function. #14376
Improvement for MSQ explain and stageStats when dealing with empty tables. #14374
Improvement for dynamically adjusting GroupByResultHolder's initial capacity based on filter predicates to optimize resource allocation and improve performance for filtered group-by queries. #14001
Add support for the isEqualSet Function. #14313
Improvement to ensure consistent index configuration by constructing IndexLoadingConfig and SegmentGeneratorConfig from table config and schema, fixing inconsistencies and honouring FieldConfig.EncodingType. #14258
Add usage of CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency. #14241
Add support for application-level query quota. #14226
Add null handling support for aggregations grouped by MV columns. #14071
Add support to enable the capability to specify zstd and lz4 segment compression via config. #14008
Add support for map data type on UI. #14245
Add support for ComplexType in SchemaInfo to render Complex Column count in UI. #14254
Introduced raw fwd index version V5 containing implicit num doc length, improving space efficiency. #14105
Improvement for colocated Joins without hints. #13943
Enhanced optimizeDictionary is used to optimize var-width type columns optionally. #13994
Add support for BETWEEN in NumericalFilterOptimizer. #14163
Add support for NULLIF scalar function. #14203
Improvement for allowing usage of star-tree index with null handling enabled when no null values in segment columns. #14177
Improvement Improvement for avoiding using setter in IndexLoadingConfig for consuming segment. #14190
Implement consistent data push for Spark3 segment generation and metadata push jobs. #14139
Improvement in addressing ingestion delays in real-time tables with many partitions by mitigating simultaneous segment commits across consumers. #14170
Improve query options validation and error handling. #14158
Add support an arbitrary number of WHEN THEN clauses in the scalar CASE function. #14125
Add support for configuring Theta and Tuple aggregation functions. #14167
Add support for Map type in complex schema. #13906)
Add TTL watermark storage/loading for the dedup feature to prevent stale metadata from being added to the store when loading segments. #14137
Polymorphic scalar function implementation for BETWEEN. #14113
Polymorphic binary arithmetic scalar functions. #14089
Improvement for Adaptive Server Selection to penalize servers returning server-side exceptions. #14029
Add a server-level configuration for the segment server upload to the deep store. #14093
Add support to upload segments in batch mode with METADATA upload type. #13646
Remove recreateDeletedConsumingSegment flag from RealtimeSegmentValidationManager. #14024
Kafka3 support for realtime ingestion. #13891
Allow the building of an index on the preserved field in SchemaConformingTransformer. #13993
Add support to differentiate null and emptyLists for multi-value columns in avro decoder. #13572
Broker config to set default query null handling behavior. #13977
Moves the untarring method to BaseTaskExecutor to enable downloading and untarring from a peer server if deepstore untarring fails and allows DownloadFromServer to be enabled. #13964
Optimize Adaptive Server Selection. #13952
New SPI to support custom executor services, providing default implementations for cached and fixed thread pools. #13921
Introduction of shared IdealStateUpdaterLock for PinotLLCRealtimeSegmentManager to prevent race conditions and timeouts during large segment updates. #13947
Support for configuring aggregation function parameters in the star-tree index. #13835
Write support for creating Pinot segments in the Pinot Spark connector. #13748
Array flattening support in SchemaConformingTransformer. #13890
Allow table names in TableConfigs with or without database name when database context is passed. #13934
Improvement in null handling performance for nullable single input aggregation functions. #13791
Improvement in column-based null handling by refining method naming, adding documentation and updating validation and constructor logic to support column-specific null strategies. #13839
UI load time improvements. #13296
Enhanced the noRawDataForTextIndex config to skip writing raw data when re-using the mutable index is enabled, fixing a global disable issue and improving ingestion performance. #13776
Improvements to polymorphic scalar comparison functions for better backward compatibility. #13870
Add TablePauseStatus to track the pause details. #13803
Check stale dedup metadata when adding new records/segments. #13848
Improve error messages with star-tree indexes creation. #13818
Adds support for ZStandard and LZ4 compression in tar archives, enhancing efficiency and reducing CPU bottlenecks for large-scale data operations. #13782
Support for IPv6 in Net Utils. #13805
Optimize NullableSingleInputAggregationFunction when the entire block is null based on the null bitmap’s cardinality. #13758
Supporting extra headers in the request to support the database in routing the requests. #13417
Adds routing policy details to query error messages for unavailable segments, providing context to ease confusion and expedite issue triage. #13706
Refactoring and cleanup for permissions and access. #13696, #13633
Prevent 500 error for non-existent tasktype in /tasks/{taskType}/tasks API. #13537
Changed STREAM_DATA_LOSS from a Meter to a Gauge to accurately reflect data loss detection and ensure proper cleanup. #13712

Bug Fixes

Fix typo in RefreshSegmentTaskExecutor logger. #14763
Fix to avoid handling JSON_ARRAY as multi-value JSON during transformation. #14738
Fix for partition-enabled instance assignment with minimized movement. #14726
Fix v1 query engine behaviour for aggregations without group by where the limit is zero. #13564
Fix metadata fetch by increasing timeout for the Kafka client connection. #14638
Fix integer overflow in GroupByUtils. #14610
Fix for using PropertiesWriter to escape index_map keys properly. #12018
Fix query option validation for group-by queries. #14618
Fix for making RecordExtractor preserve empty array/map and map entries with empty values. #14547
Fix CRC mismatch during deep store upload retry task. #14506
Fix for allowing reload for UploadedRealtimeSegmentName segments. #14494
Fix default value handling in REGEXP_EXTRACT transform function. #14489
Fix for Spark upsert table backfill support. #14443
Fix long value parsing in jsonextractscalar. #14337
Fix deep store upload retry for infinite retention tables. #14406
Fix to ensure deterministic index processing order across server replicas and runs to prevent inconsistent segment data file layouts and unnecessary synchronization. #14391
Fix for real-time validation NPE when stream partition is no longer available. #14392
Fix for handling NULL values encountered in CLPDecodeTransformFunction. #14364
Fix for TextMatchFilterOptimizer grouping for the inner compound query. #14299
Fix for removing redundant API calls on the home page. #14295
Fix the missing precondition check for the V5 writer version in BaseChunkForwardIndexWriter. #14265
Fix for computing all groups for the group by queries with only filtered aggregations. #14211
Fix for race condition in IdealStateGroupCommit. #14237
Fix default column handling when the forward index is disabled. #14215
Fix bug with server return final aggregation result when null handling is enabled. #14181
Fix Kubernetes Routing Issue in Helm chart. #13450
Fix raw index conversion from v4. #14171
Fix for consuming segments cleanup on server startup. #14174
Fix for making S3PinotFS listFiles return directories when non-recursive. #14073
Fix for rebalancer EV converge check for low disk mode. #14178
Fix for copying native text index during format conversion. #14172
Fix for enforcing removeSegment flow with _enableDeletedKeysCompactionConsistency. #13914
Fix for Init BrokerQueryEventListener. #13995
Fix for supporting ComplexFieldSpec in Schema and column metadata. #13905
Fix race condition in shared literal transform functions. #13916
Fix for honouring the column max length property while populating min/max values for column metadata. #13897
Fix for skipping encoding the path URL for the Azure deep store. #13850
Fix for handling DUAL SQL queries in Into JDBC client. #13846
Fix TLS configuration for HTTP clients. #13477
Fix bugs in DynamicBrokerSelection. #13816
Fix literal type handling in LiteralValueExtractor. #13715
Fix for handling NULL values appropriately during segment reload for newly derived columns. #13212
Fix filtered aggregate with ordering. #13784
Fix implementing a table-level lock to prevent parallel updates to the SegmentLineage ZK record and align real-time table ideal state updates with minion task locking for consistency. #13735
Fix INT overflow issue for FixedByteSVMutableForwardIndex with large segment size. #13717
Fix preload enablement checks to consider the preload executor and refine numMissedSegments logging to exclude unchanged segments, preventing incorrect missing segment reports. #13747
Fix a bug in resource status evaluation during service startup, ensuring resources return GOOD when servers have no assigned segments, addressing issues with small tables and segment redistribution. #13541
Fix RealtimeProvisioningHelperCommand to allow using just schemaFile along with sampleCompletedSegmentDir. #13727

1.2.0

Release Notes for 1.2.0

This release comes with several Improvements and Bug Fixes for the Multistage Engine, Upserts and Compaction. There are a ton of other small features and general bug fixes.

Multistage Engine Improvements

Features

New Window Functions: LEAD, LAG, FIRST_VALUE, LAST_VALUE #12878 #13340

LEAD allows you to access values after the current row in a frame.
LAG allows you to access values before the current row in a frame.
FIRST_VALUE and LAST_VALUE return the respective extremal values in the frame.

Support for Logical Database in V2 Engine #12591 #12695

V2 Engine now supports a "database" construct, enabling table namespace isolation within the same Pinot cluster.
Improves user experience when multiple users are using the same Pinot Cluster.
Access control policies can be set at the database level.
Database can be selected in a query using a SET statement, such as SET database=my_db;.

Improved Multi-Value (MV) and Array Function Support

Added array sum aggregation functions for point-wise array operations #13324.
Added support for valueIn MV transform function #13443.
Fixed bug in numeric casts for MV columns in filters #13425.
Fixed NPE in ArrayAgg when a column contains no data #13358.
Fixed array literal handling #13345.

Support for WITHIN GROUP Clause and ListAgg #13146

WITHIN GROUP Clause can be used to process rows in a given order within a group.
One of the most common use-cases for this is the ListAgg function, which when combined with WITHIN GROUP can be used to concatenate strings in a given order.

Scalar/Transform Function and Set Operation Improvements

Added Geospatial Scalar Function support for use in intermediate stage in the v2 query engine #13457.
Fix 'WEEK' transform function #13483.
Support EXTRACT as a scalar function #13463.
Added support for ALL modifier for INTERSECT and EXCEPT Set Operations #13151 #13166.

Improved Literal Handling Support

Fixed bug in handling literal arguments in aggregation functions like Percentile #13282.
Allow INT and FLOAT literals #13078.
Fixed literal handling for all types #13344 #13345.
Fixed null literal handling for null intolerant functions #13255.

Metrics Improvements

Added new metrics for tracking queries executed globally and at the table level #12982.
New metrics to track join counts and window function counts #13032.
Multiple meters and timers to track Multistage Engine Internals #13035.

Notable Improvements and Bug Fixes

Improved Window operators resiliency, with new checks to make sure the window doesn't grow too large #13180 #13428 #13441.
Optimized Group Key generation #12394.
Fixed SortedMailboxReceiveOperator to honor convention of pulling at most 1 EOS block #12406.
Improvement in how execution stats are handled #12517 #12704 #13136.
Use Protobuf instead of Reflection for Plan Serialization #13221.

Upsert Compaction and Minion Improvements

Features and Improvements

Minion Resource Isolation #12459 #12786

Minions now support resource isolation based on an instance tag.
Instance tag is configured at table level, and can be set for each task on a table.
This enables you to implement arbitrary resource isolation strategies, i.e. you can use a set of Minion Nodes for running any set of tasks across any set of tables.

Greedy Upsert Compaction Scheduling #12461

Upsert compaction now schedules segments for compaction based on the number of invalid docs.
This helps the compaction task to handle arbitrary temporal distribution of invalid docs.

Notable Improvements

Minions can now download segments from servers when deepstore copy is missing. This feature is enabled via a cluster level config allowDownloadFromServer #12960 #13247.
Added support for TLS Port in Minions #12943.
New metrics added for Minions to track segment/record processing information #12710.

Bug Fixes

Minions can now handle invalid instance tags in Task Configs gracefully. Prior to this change, Minions would be stuck in IN_PROGRESS state until task timeout #13092.
Fix bug to return validDocIDsMetadata from all servers #12431.
Upsert compaction doesn't retain maxLength information and trims string fields #13157.

Upsert Improvements

Features and Improvements

Consistent Table View for Upsert Tables #12976

Adds different modes of consistency guarantees for Upsert tables.
Adds a new UpsertConfig called consistencyMode which can be set to NONE, SYNC, SNAPSHOT.
SYNC is optimized for data freshness but can lead to elevated query latencies and is best for low-qps use-cases. In this mode, the ingestion threads will take a WLock when updating validDocID bitmaps.
SNAPSHOT mode can handle high-qps/high-ingestion use-cases by getting the list of valid docs from a snapshot of validDocID. The snapshot can be refreshed every few seconds and the tolerance can be set via a query option upsertViewFreshnessMs.

Pluggable Partial Upsert Merger #11983

Partial Upsert merges the old record and the new incoming record to generate the final ingested record.
Pinot now allows users to customize how this merge of an old row and the new row is computed.
This allows a column value in the new row to be an arbitrary function of the old and the new row.

Support for Uploading Externally Partitioned Segments for Upsert Backfill 13107

Segments uploaded for Upsert Backfill can now explicitly specify the Kafka partition they belong to.
This enables backfilling an Upsert table where the externally generated segments are partitioned using an arbitrary hash function on an arbitrary primary key.

Misc Improvements and Bug Fixes

Fixed a Bug in Handling Equal Comparison Column Values in Upsert, which could lead to data inconsistency (#12395)
Upsert snapshot will now snapshot only those segments which have updates. #13285.

Notable Features

JSON Support Improvements

JSON Index can now be used for evaluating Regex and Range Predicates. #12568
jsonExtractIndex now supports contextual array filters. #12683 #12531.
JSON column type now supports filter predicates like =, !=, IN and NOT IN. This is convenient for scenarios where the JSON values are very small. #13283.
JSON_MATCH now supports exclusive predicates correctly. For instance, you can use predicates such as JSON_MATCH(person, '"$.addresses[*].country" != ''us''' to find all people who have at least one address that is not in the US. #13139.
jsonExtractIndex supports extracting Multi-Value JSON Fields, and also supports providing any default value when the key doesn't exist. #12748.
Added isJson UDF which increases your options to handle invalid JSONs. This can be used in queries and for filtering invalid json column values in ingestion. #12603.
Fix ArrayIndexOutOfBoundsException in jsonExtractIndex. #13479.

Lucene and Text Search Improvements

Improved Segment Build Time for Lucene Text Index by 40-60%. This improvement is realized when a consuming segment commits and changes to an ImmutableSegment. This significantly helps in lowering ingestion lag at commit time due to a large text index #12744 #13094 #13050.
Phrase Search can run 3x faster when the Lucene Index Config enablePrefixSuffixMatchingInPhraseQueries is set to true. This is achieved by rewriting phrase search query to a wildcard and prefix matching query #12680.
Fixed bug in TextMatchFilterOptimizer that was not applying precedence to the filter expressions properly, which could lead to incorrect results. #13009.
Fixed bug in handling NOT text_match which could have returned incorrect results. #12372.
Added SchemaConformingTranformerV2 to enhance text search abilities. #12788.
Added metrics to track Lucene NRT Refresh Delay #13307.
Switched to NRTCachingDirectory for Realtime segments and prevented duplicates in the Realtime Lucene Index to avoid IndexOutOfBounds query time exceptions. #13308.
Lucene Version is upgraded to 9.11.1. #13505.

New Funnel Functions #13176 #13231 #13228

Added funnelMaxStep function which can be used to calculate max funnel steps for a given sliding window .
Added funnelCompleteCount to calculate the number of completed funnels, and funnelMatchStep to get the funnel match array.

Support for Interning for OnHeapByteDictionary #12342

This can reduce the heap usage of a dictionary encoded byte column, for a certain distribution of duplicate values. See #12223 for details.

Column Major Builder On By Default for New Tables #12770

Prior to this feature, on a segment commit, Pinot would convert all the columnar data from the Mutable Segment to row-major, and then re-build column major Immutable Segments.
This feature skips the row-major conversion and is expected to be both space and time efficient.
It can help lower ingestion lag from segment commits, especially helpful when your segments are large.

Support for SQL Formatting in Query Editor #11725

You can now prettify SQL right in the Controller UI!

Hash Function for UUID Primary Keys #12538

Added a new lossless hash-function for Upsert Primary Keys optimized for UUIDs.
The hash function can reduce Old Gen by up to 30%.
It maps a UUID to a 16 byte array, vs encoding it in a UTF string which would take 36 bytes.

Column Level Index Skip Query Option #12414

Convenient for debugging impact of indexes on query performance or results.
You can add the skipIndexes option to your query to skip any number of indexes. e.g. SET skipIndexes=inverted,range;

New UDFs and Scalar Functions

New GeoHash functions: encodeGeoHash, decodeGeoHash, decodeGeoHashLatitude and decodeGeoHashLongitude.
dateBin can be used to align a timestamp to the nearest time bucket.
prefixes, suffixes and uniqueNgrams UDFs for generating all respective string subsequences from a string input. #12392.
Added isJson UDF which increases your options to handle invalid JSONs. This can be used in queries and for filtering invalid json column values in ingestion. #12603.
splitPart UDF has minor improvements. #12437.

CLP Compression Codec in Forward Indexes #12504

CLP is a compressed log processor which has really high compression ratio for certain log types.
To enable this, you can set the compressionCodec in the fieldConfigList of the column you want to target.

Misc. Improvements

Enable segment preloading at partition level #12451.
Use Temurin instead of AdoptOpenJdk #12533
Adding record reader config/context param to record transformer #12520
Removing legacy commons-lang dependency #13480
12508: Feature add segment rows flush config #12681
ADSS Race Condition and update to client error codes #13104
Add ExceptionMapper to convert Exception to Response Object for Broker REST API's #13292
Add FunnelMaxStepAggregationFunction and FunnelCompleteCountAggregationFunction #13231
Add GZIP Compression Codec (#11434) #12668
Add PodDisruptionBudgets to the Pinot Helm chart #13153
Add Postgres compliant name aliasing for String Functions. #12795
Add SchemaConformingTransformerV2 to enhance text search abilities #12788
Add a benchmark to measure multi-stage block serde cost #13336
Add a plan version field to QueryRequest Protobuf Message #13267
Add a post-validator visitor that verifies there are no cast to bytes #12475
Add a safe version of CLStaticHttpHandler that disallows path traversal. #13124
Add ability to track filtered messages offset #12602
Add back 'numRowsResultSet' to BrokerResponse, and retain it when result table id hidden #13198
Add back profile for shade #12979
Add back some exclude deps from hadoop-mapreduce-client-core #12638
Add backward compatibility regression test suite for multi-stage query engine #13193
Add base class for custom object accumulator #12685
Add clickstream example table for funnel analysis #13379
Add config option for timezone #12386
Add config to skip record ingestion on string column length exceeding configured max schema length #13103
Add controller API to get allLiveInstances #12498
Add isJson UDF #12603
Add list of collaborators to asf.yaml #13346
Add locking logic to get consistent table view for upsert tables #12976
Add metric to track number of segments missed in upsert-snapshot #12581
Add metrics for SEGMENTS_WITH_LESS_REPLICAS monitoring #12336
Add mode to allow adding dummy events for non-matching steps #13382
Add offset based lag metrics #13298
Add protobuf codegen decoder #12980
Add retry policy to wait for job id to persist during rebalancing #13372
Add round-robin logic during downloadSegmentFromPeer #12353
Add schema as input to the decoder. #12981
Add splitPartWithLimit and splitPartFromEnd UDFs #12437
Add support for creating raw derived columns during segment reload #13037
Add support for raw JSON filter predicates #13283
Add the possibility of configuring ForwardIndexes with compressionCodec #12218
Add upsert-snapshot timer metric #12383
Add validation check for forward index disabled if it's a REALTIME table #12838
Added PR compatability test against release 1.1.0 #12921
Added kafka partition number to metadata. #13447
Added pinot-error-code header in query response #12338
Added tests for additional data types in SegmentPreProcessorTest.java #12755
Adding a cluster config to enable instance pool and replica group configuration in table config #13131
Adding batch api support for WindowFunction #12993
Adding bytes string data type integration tests #12387
Adding registerExtraComponents to allow registering additional components in various services #13465
Adding support of insecure TLS #12416
Adding support to insecure TLS when creating SSLFactory #12425
Adds AGGREGATE_CASE_TO_FILTER rule #12643
Adds per-column, query-time index skip option #12414
Allow Aggregations in Case Expressions #12613
Allow PintoHelixResourceManager subclasses to be used in the controller starter by providing an overridable PinotHelixResouceManager object creator function #13495
Allow RequestContext to consider http-headers case-insensitivity #13169
Allow Server throttling just before executing queries on server to allow max CPU and disk utilization #12930
Allow all raw index config in star-tree index #13225
Allow apply both environment variables and system properties to user and table configs, Environment variables take precedence over system properties #13011
Allow configurable queryWorkerThreads in Pinot server side GrpcQueryServer #13404
Allow dynamically setting the log level even for loggers that aren't already explicitly configured #13156
Allow passing custom record reader to be inited/closed in SegmentProcessorFramework #12529
Allow passing database context through database http header #12417
Allow stop to interrupt the consumer thread and safely release the resource #13418
Allow user configurable regex library for queries #13005
Allow using 'serverReturnFinalResult' to optimize server partitioned table #13208
Assign default value to newly added derived column upon reload #12648
Avoid port conflict in integration tests #13390
Better handling of null tableNames #12654
CLP as a compressionCodec #12504
Change helm app version to 1.0.0 for Apache Pinot latest release version #12436
Clean Google Dependencies #13297
Clean up BrokerRequestHandler and BrokerResponse #13179
Clean up arbitrary sleep in /GrpcBrokerClusterIntegrationTest #12379
Cleaning up vector index comments and exceptions #13150
Cleanup HTTP components dependencies and upgrade Thrift #12905
Cleanup Javax and Jakarta dependencies #12760
Cleanup deprecated query options #13040
Cleanup the consumer interfaces and legacy code #12697
Cleanup unnecessary dependencies under pinot-s3 #12904
Cleanup unused aggregate internal hint #13295
Consistency in API response for live broker #12201
Consolidate bouncycastle libraries #12706
Consolidate nimbus-jose-jwt version to 9.37.3 #12609
ControllerRequestClient accepts headers. Useful for authN tests #13481
Custom configuration property reader for segment metadata files #12440
Delete database API #12765
Deprecate PinotHelixResourceManager#getAllTables() in favour of getAllTables(String databaseName) #12782
Detect expired messages in Kafka. Log and set a gauge. #12608
Do not hard code resource class in BaseClusterIntegrationTest #13400
Do not pause ingestion when upsert snapshot flow errors out #13257
Don't drop original field during flatten #13490
Don't enforce -realTimeInstanceCount and -offlineInstanceCount options when creating broker tenants #13236
Egalpin/skip indexes minor changes #12514
Emit Metrics for Broker Adaptive Server Selector type #12482
Emit table size related metrics only in lead controller #12747
Enable complexType handling in SegmentProcessFramework #12942
Enable more integration tests to run on the v2 multi-stage query engine #13467
Enabling avroParquet to read Int96 as bytes #12484
Enhance Kinesis consumer #12806
Enhance Parquet Test #13082
Enhance ProtoSerializationUtils to handle class move #12946
Enhance Pulsar consumer #12812
Enhance PulsarConsumerTest #12948
Enhance commit threshold to accept size threshold without setting rows to 0 #12684
Enhance json index to support regexp and range predicate evaluation #12568
Enhancement: Sketch value aggregator performance #13020
Ensure FieldConfig.getEncodingType() is never null #12430
Ensure all the lists used in PinotQuery are ArrayList #13017
Ensure brokerId and requestId are always set in BrokerResponse #13200
Enter segment preloading at partition level #12451
Exclude dimensions from star-tree index stored type check #13355
Expose more helper API in TableDataManager #13147
Extend compatibility verifier operation timeout from 1m to 2m to reduce flakiness #13338
Extract json individual array elements from json index for the transform function jsonExtractIndex #12466
Fetch query quota capacity utilization rate metric in a callback function #12767
First with time #12235
GitHub Actions checkout v4 #12550
Gzip compression, ensure uncompressed size can be calculated from compressed buffer #12802
Handle errors gracefully during multi-stage stats collection in the broker #13496
Handle shaded classes in all methods of kafka factory #13087
Hash Function for UUID Primary Keys #12538
Ignore case when checking for Direct Memory OOM #12657
Improve Retention Manager Segment Lineage Clean Up #13232
Improve error message for max rows in join limit breach #13394
Improve exception logging when we fail to index / transform message #12594
Improve logging in range index handler for index updates #13381
Improve upsert compaction threshold validations #13424
Improve warn logs for requesting validDocID snapshots #13280
Improved metrics for server grpc query #13177
Improved null check for varargs #12673
Improved segment build time for Lucene text index realtime to offline conversion #12744
In ClusterTest, make start port higher to avoid potential conflict with Kafka #13402
Introduce PinotLogicalAggregate and remove internal hint #13291
Introduce retries while creating stream message decoder for more robustness #13036
Isolate bad server configs during broker startup phase #12931
Issue #12367 #12922
Json extract index filter support #12683
Json extract index mv #12532
Keep get tables API with and without database #12804
Lint failure #12294
Logging a warn message instead of throwing exception #12546
Made the error message around dimension table size clearer #13163
Make Helix state transition handling idempotent #12886
Make KafkaConsumerFactory method less restrictive to avoid incompatibility #12815
Make task manager APIs database aware #12766
Metric for count of tables configured with various tier backends #12940
Metric for upsert tables count #12505
Metrics for Realtime Rows Fetched and Stream Consumer Create Exceptions #12522
Minmaxrange null #12252
Modify consumingSegmentsInfo endpoint to indicate how many servers failed #12523
Move offset validation logic to consumer classes #13015
Move package org.apache.calcite to org.apache.pinot.calcite #12837
Move resolveComparisonTies from addOrReplaceSegment to base class #13396
Move some mispositioned tests under pinot-core #12884
Move wildfly-openssl dependency management to root pom #12597
Moving deleteSegment call from POST to DELETE call #12663
Optimize unnecessary extra array allocation and conversion for raw derived column during segment reload #13115
Pass explicit TypeRef when evaluating MV jsonPath #12524
Percentile operations supporting null #12271
Prepare for next development iteration #12530
Propagate Disable User Agent Config to Http Client #12479
Properly handle complex type transformer in segment processor framework #13258
Properly return response if SegmentCompletion is aborted #13206
Publish helm 0.2.8 #12465
Publish helm 0.2.9 #13230
Pull janino dependency to root pom #12724
Pull pulsar version definitaion into root POM #13002
Query response opt #13420
Re-enable the Spotless plugin for Java 21 #12992
Readme - How to setup Pinot UI for development #12408
Record enricher #12243
Refactor PinotTaskManager class #12964
Refactored CommonsConfigurationUtils for loading properties configuration. #13201
Refactored compatibility-verifier module #13359
Refactoring removeSegment flow in upsert #13449
Refine PeerServerSegmentFinder #12933
Refine SegmentFetcherFactory #12936
Replace custom fmpp plugin with fmpp-maven-plugin #12737
Reposition query submission spot for adaptive server selection #13327
Reset controller port when stopping the controller in ControllerTest #13399
Rest Endpoint to Create ZNode #12497
Return clear error message when no common broker found for multi-stage query with tables from different tenants #13235
Returning tables names failing authorization in Exception of Multi State Engine Queries #13195
Revert " Adding record reader config/context param to record transformer (#12520)" #12526
Revert "Using local copy of segment instead of downloading from remote (#12863)" #13114
Short circuit SubPlanFragmenter because we don't support multiple sub-plans yet #13306
Simplify Google dependencies by importing BOM #12456
Specify version for commons-validator #12935
Support NOT in StarTree Index #12988
Support empty strings as json nodes^ #12555
Supporting human-readable format when configuring broker response size #12510
Use ArrayList instead of LinkedList in SortOperator #12783
Use a two server setup for multi-stage query engine backward compatibility regression test suite #13371
Use more efficient variants of URLEncoder::encode and URLDecoder::decode #13030
Use parameterized log messages instead of string concatenation #13145
Use separate action for /tasks/scheduler/jobDetails API #13054
Use try-with-resources to close file walk stream in LocalPinotFS #13029
Using local copy of segment instead of downloading from remote #12863
[Adaptive Server Selector] Add metrics for Stats Manager Queue Size #12340
[Cleanup] Move classes in pinot-common to the correct package #13478
[Feature] Add Support for SQL Formatting in Query Editor #11725
[HELM]: Added additional probes options and startup probe. #13165
[HELM]: Added checksum config annotation in stateful set for broker, controller and server #13059
[HELM]: Added namespace support in K8s deployment. #13380
[HELM]: zookeeper chart upgrade to version 13.2.0 #13083
[Minor] Add Nullable annotation to HttpHeaders in BrokerRequestHandler #12816
[Minor] Small refactor of raw index creator constructor to be more clear #13093
[Multi-stage] Clean up RelNode to Operator handling #13325
[null-aggr] Add null handling support in mode aggregation #12227
[partial-upsert] configure early release of _partitionGroupConsumerSemaphore in RealtimeSegmentDataManager #13256
[spark-connector] Add option to fail read when there are invalid segments #13080
add Netty arm64 dependencies #12493
add Netty unit test #12486
add SegmentContext to collect validDocIds bitmaps for many segments together #12694
add skipUnavailableServers query option #13387
add insecure mode when Pinot uses TLS connections #12525
add instrumentation to json index getMatchingFlattenedDocsMap() #13164
add jmx to promethues metric exporting rule for realtimeRowsFiltered #12759
add metrics for IdeaState update #13266
add some metrics for upsert table preloading #12722
add some tests on jsonPathString #12954
add test cases in RequestUtilsTest #12557
add unit test for JsonAsyncHttpPinotClientTransport #12633
add unit test for QueryServer #12599
add unit test for ServerChannels #12616
add unit test for StringFunctions encodeUrl #13391
add unit tests for pinot-jdbc-client #13137
add url assertion to SegmentCompletionProtocolTest #13373
adjust the llc partition consuming metric reporting logic #12627
allow passing null http headers object to translateTableName #12764
allow to set segment when use SegmentProcessorFramework #13341
auto renew jvm default sslconext when it's loaded from files #12462
avoid useless intermediate byte array allocation for VarChunkV4Reader's getStringMV #12978
aws sdk 2.25.3 #12562
build-helper-maven-plugin 3.5.0 #12548
cache ssl contexts and reuse them #12404
clean up jetbrain nullable annotation #13427
cleanup: maven no transfer progress #12444
close JDBC connections #12494
do not fail on duplicate relaxed vars (#13214)z
dropwizard metrics 4.2.25 #12600
dynamic chunk sizing for v4 raw forward index #12945
enable Netty leak detection #12483
enable parallel Maven in pinot linter script #12751
ensure inverse And/OrFilterOperator implementations match the query #13199
exclude .mvn directory from source assembly #12558
extend CompactedPinotSegmentRecordReader so that it can skip deleteRecord #13352
get startTime outside the executor task to avoid flaky time checks #13250
handle absent segments so that catchup checker doesn't get stuck on them #12883
handle overflow for MutableOffHeapByteArrayStore buffer starting size #13215
handle segments not tracked by partition mgr and add skipUpsertView query option #13415
handle table name translation on missed api resources #12792
hash4j version upgrade to 0.17.0 #12968
including the underlying exception in the logging output #13248
int96 parity with native parquet reader #12496
jsonExtractIndex support array of default values #12748
log the log rate limiter rate for dropped broker logs #13041
make http listener ssl config swappable #12455
make reflection calls compatible with 0.9.11 [#12958](https://github.com/apache/
maven: no transfer progress #12528
missed to delete the temp dir #12637
move shouldReplaceOnComparisonTie to base class to be more reusable #13353
reduce Java enum .values() usage in TimerContext #12579
reduce logging for SpecialValueTransformer #12970
reduce regex pattern compilation in Pinot jdbc #13138
refactor TlsUtils class #12515
refine when to registerSegment while doing addSegment and replaceSegment for upsert tables for better data consistency #12709
reformat AdminConsoleIntegrationTest.java #12552
reformat ClusterTest.java #12531
release segment mgrs more reliably #13216
replaced getServer with getServers #12545
report rebalance job status for the early returns like noops #13281
require noDictionaryColumns with aggregationConfigs #12464
share the same table config object #12463
track segments for snapshotting even if they lost all comparisons #13388
untrack the segment out of TTL #12449
update ControllerJobType from enum to string #12518
update RewriterConstants so that expr min max would not collide with columns start with "parent" #13357
update access control check error handling to catch throwable and log errors #13209

Bug Fixes

Use gte(lte) to replace between() which has a bug #12595
Fix the ConcurrentModificationException for And/Or DocIdSet #12611
Upgrade RoaringBitmap to 1.0.5 to pick up the fix for RangeBitmap.between() #12604
bugfix: do not move src ByteBuffer position for LZ4 length prefixed decompress #12539
Bug Fix createDictionaryForColumn does not take into account inverted index #13048
fix Cluster Manager error #12632
fix for quick start Cluster Manager issue #12610
Adding config for having suffix for client ID for realtime consumer #13168
Addressed comments and fixed tests from pull request 12389. /uptime and /start-time endpoints working all components #12512
Bigfix. Added missing paramName #13060
Bug fix: Do not ignore scheme property #12332
Bug fix: Handle missing shade config overwrites for Kafka #13437
BugFix: Fix merge result from more than one server #12778
Bugfix. Allow tenant rebalance with downtime as true #13246
Bugfix. Avoid passing null table name input to translation util #12726
Bugfix. Correct wrong method call from scheduleTask() to scheduleTaskForDatabase() #12791
Bugfix. Maintain literal data type during function evaluation #12607
Cleanup: Fix grammar in error message, also improve readability. #13451
Fix Bug in Handling Equal Comparison Column Values in Upsert #12395
Fix ColumnMinMaxValueGenerator #12502
Fix JavaEE related dependencies #13058
Fix Logging Location for CPU-Based Query Killing #13318
Fix PulsarUtils to not share buffer #12671
Fix URI construction so that AddSchema command line tool works when override flag is set to true #13320
Fix [Type]ArrayList elements() method usage #13354
Fix a typo when calculating query freshness #12947
Fix an overflow in PinotDataBuffer.readFrom #13152
Fix bug in logging in UpsertCompaction task #12419
Fix bug to return validDocIDsMetadata from all servers #12431
Fix connection issues if using JDBC and Hikari (#12267) #12411
Fix controller host / port / protocol CLI option description for admin commands #13237
Fix environment variables not applied when creating table #12560
Fix error message for insufficient number of untagged brokers during tenant creation #13234
Fix few metric rules which were affected by the database prefix handling #13290
Fix file handle leaks in Pinot Driver (apache#12263) #12356
Fix flakiness of ControllerPeriodicTasksIntegrationTest #13337
Fix issue with startree index metadata loading for columns with '__' in name #12554
Fix metric rule pattern regex #12856
Fix pinot-parquet NoClassFound issue #12615
Fix segment size check in OfflineClusterIntegrationTest #13389
Fix some resource leak in tests #12794
Fix the NPE from IS update metrics #13313
Fix the NPE when metadataTTL is enabled without delete column #13262
Fix the ServletConfig loading issue with swagger. #13122
Fix the issue that map flatten shouldn't remove the map field from the record #13243
Fix the race condition for H3InclusionIndexFilterOperator #12487
Fix the time segment pruner on TIMESTAMP data type #12789
Fix time stats in SegmentIndexCreationDriverImpl #13429
Fixed infer logical type name from avro union schema #13224
Fixing instance type to resolve #12677 and #12678
Helm: bug fix for chart rendering issue. #13264
Try to amend kafka common package with pinot shaded package prefix #13056
Update getValidDocIdsMetadataFromServer to make call in batches to servers and other bug fixes #13314
Upgrade com.microsoft.azure:msal4j from 1.3.5 to 1.3.10 for CVE fixing #12580
[bugfix] Handling null value for kafka client id suffix #13279
bugfix: fixing jdbc client sql feature not supported exception #12480
bugfix: re-add support for not text_match #12372
bugfix: reduce enum array allocation in QueryLogger #12478
bugfix: use consumerDir during lucene realtime segment conversion #13094
cleanup: fix apache rat violation #12476
fix GuavaRateLimiter acquire method #12500
fix fieldsToRead class not in decoder #13186
fix flakey test, avoid early finalization #13095
fix merging null multi value in partial upsert #13031
fix race condition in ScalingThreadPoolExecutor #13360
fix shared buffer, tests #12587
fix(build): update node version to 16 #12924
fixing CVE critical issues by resolving kerby/jline and wildfly libraries #12566
fixing pinot-adls high severity CVEs #12571
fixing swagger setup using localhost as host name #13254
swagger-ui upgrade to 5.15.0 Fixes #12908
upgrade jettison version to fix CVE #12567

1.0.0

This page covers the latest changes included in the Apache Pinot™ 1.0.0 release, including new features, enhancements, and bug fixes.

1.0.0 (2023-09-19)

This release includes the several new features, enhancements, and bug fixes, including the following highlights:

Multi-stage query engine: , , and . Learn how to or more about how the works.

Multi-stage query engine new features

Support for
- Initial (phase 1) Query runtime for window functions with ORDER BY within the OVER() clause (#10449)
- Support for the ranking ROW_NUMBER() window function (, )
Set operations support:
- Support SetOperations (UNION, INTERSECT, MINUS) compilation in query planner ()
Timestamp and Date Operations
Support TIMESTAMP type and date ops functions ()
- Support more aggregation functions that are currently implementable ()
- Support multi-value aggregation functions ()
Support Sketch based functions (), ()
Make Intermediate Stage Worker Assignment Tenant Aware ()
Evaluate literal expressions during query parsing, enabling more efficient query execution (
Added support for partition parallelism in partitioned table scans, allowing for more efficient data retrieval ()
[multistage]Adding more tuple sketch scalar functions and integration tests ()

Multi-stage query engine enhancements

Turn on v2 engine by default ()
Introduced the ability to stream leaf stage blocks for more efficient data processing ().
Early terminate SortOperator if there is a limit ()
Implement ordering for SortExchange ()
Table level Access Validation, QPS Quota, Phase Metrics for multistage queries ()
Support partition based leaf stage processing ()
Populate queryOption down to leaf ()
Pushdown explain plan queries from the controller to the broker ()
Enhanced the multi-stage group-by executor to support limiting the number of groups,
improving query performance and resource utilization ().
Improved resilience and reliability of the multi-stage join operator, now with added support for hash join right table protection ().

Multi-stage query engine bug fixes

Fix Predicate Pushdown by Using Rule Collection ()
Try fixing mailbox cancel race condition ()
Catch Throwable to Propagate Proper Error Message ()
Fix tenant detection issues ()
Handle Integer.MIN_VALUE in hashCode based FieldSelectionKeySelector ()
Improve error message in case of non-existent table queried from the controller ()
Derive SUM return type to be PostgreSQL compatible ()

Index SPI

Add the ability to include new index types at runtime in Apache Pinot. This opens the ability of adding third party indexes, including proprietary indexes. More details

Null value support for pinot queries

NULL support for ORDER BY, DISTINCT, GROUP BY, value transform functions and filtering.

Upsert enhancements

Delete support in upsert enabled tables ()

Support added to extend upserts and allow deleting records from a realtime table. The design details can be found .

Preload segments with upsert snapshots to speedup table loading ()

Adds a feature to preload segments from a table that uses the upsert snapshot feature. The segments with validDocIds snapshots can be preloaded in a more efficient manner to speed up the table loading (thus server restarts).

TTL configs for upsert primary keys ()

Adds support for specifying expiry TTL for upsert primary key metadata cleanup.

Segment compaction for upsert real-time tables ()

Adds a new minion task to compact segments belonging to a real-time table with upserts.

Pinot Spark Connector for Spark3 ()

Added spark3 support for Pinot Spark Connector ()
Also added support to pass pinot query options to spark connector ()

PinotDataBufferFactory and new PinotDataBuffer implementations ()

Adds new implementations of PinotDataBuffer that uses Unsafe java APIs and foreign memory APIs. Also added support for PinotDataBufferFactory to allow plugging in custom PinotDataBuffer implementations.

Query functions enhancements

Add PercentileKLL aggregation function ()
Support for ARG_MIN and ARG_MAX Functions ()
refactor argmin/max to exprmin/max and make it calcite compliant ()
Integer Tuple Sketch support ()
Adding vector scalar functions ()
[feature] multi-value datetime transform variants ()
FUNNEL_COUNT Aggregation Function ()
[multistage] Add support for RANK and DENSE_RANK ranking window functions ()
add theta sketch scalar ()
Register dateTimeConverter,timeConvert,dateTrunc, regexpReplace to v2 functions ()
Add extract(quarter/dow/doy) support ()
Funnel Count - Multiple Strategies (no partitioning requisites) ()
Add Boolean assertion transform functions. ()

JSON and CLP encoded message ingestion and querying

Add clpDecode transform function for decoding CLP-encoded fields. ()
Add CLPDecodeRewriter to make it easier to call clpDecode with a column-group name rather than the individual columns. ()
Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. ()

Tier level index config override ()

Allows overriding index configs at tier level, allowing for more flexible index configurations for different tiers.

Ingestion connectors and features

Kinesis stream header extraction ()
Extract record keys, headers and metadata from Pulsar sources ()
Realtime pre-aggregation for Distinct Count HLL & Big Decimal ()
Added support to skip unparseable records in the csv record reader ()
Null support for protobuf ingestion. ()

UI enhancements

Adds persistence of authentication details in the browser session. This means that even if you refresh the app, you will still be logged in until the authentication session expires ()
AuthProvider logic updated to decode the access token and extract user name and email. This information will now be available in the app for features to consume. ()

Pinot docker image improvements and enhancements

Make Pinot base build and runtime images support Amazon Corretto and MS OpenJDK ()
Support multi-arch pinot docker image ()
Update dockerfile with recent jdk distro changes ()

Operational improvements

Rebalance

Rebalance status API ()
Tenant level rebalance API Tenant rebalance and status tracking APIs ()

Config to use customized broker query thread pool ()

Added new configuration options below which allow use of a bounded thread pool and allocate capacities for it.

This feature allows better management of broker resources.

Drop results support ()

Adds a parameter to queryOptions to drop the resultTable from the response. This mode can be used to troubleshoot a query (which may have sensitive data in the result) using metadata only.

Make column order deterministic in segment ()

In segment metadata and index map, store columns in alphabetical order so that the result is deterministic. Segments generated before/after this PR will have different CRC, so during the upgrade, we might get segments with different CRC from old and new consuming servers. For the segment consumed during the upgrade, some downloads might be needed.

Allow configuring helix timeouts for EV dropped in Instance manager ()

Adds options to configure helix timeouts external.view.dropped.max.wait.ms`` - The duration of time in milliseconds to wait for the external view to be dropped. Default - 20 minutes. external.view.check.interval.ms`` - The period in milliseconds in which to ping ZK for latest EV state.

Enable case insensitivity by default ()

This PR makes Pinot case insensitive be default, and removes the deprecated property enable.case.insensitive.pql

Newly added APIs and client methods

Add Server API to get tenant pools ()
Add new broker query point for querying multi-stage engine ()
Add a new controller endpoint for segment deletion with a time window ()
New API to get tenant tags ()
Instance retag validation check api ()
Use PUT request to enable/disable table/instance ()
Update the pinot tenants tables api to support returning broker tagged tables ()
Add requestId for BrokerResponse in pinot-broker and java-client ()
Provide results in CompletableFuture for java clients and expose metrics ()

Cleanup and backward incompatible changes

High level consumers are no longer supported

Cleanup HLC code ()
Remove support for High level consumers in Apache Pinot ()

Type information preservation of query literals

[feature] [backward-incompat] [null support # 2] Preserve null literal information in literal context and literal transform () String versions of numerical values are no longer accepted. For example, "123" won't be treated as a numerical anymore.

Controller job status ZNode path update

Moving Zk updates for reload, force_commit to their own Znodes which … () The status of previously completed reload jobs will not be available after this change is deployed.

Metric names for mutable indexes to change

Implement mutable index using index SPI () Due to a change in the IndexType enum used for some logs and metrics in mutable indexes, the metric names may change slightly.

Update in controller API to enable / disable / drop instances

Update getTenantInstances call for controller and separate POST operations on it ()

Change in substring query function definition

Change substring to comply with standard sql definition ()

Full list of features added

Allow queries on multiple tables of same tenant to be executed from controller UI
Encapsulate changes in IndexLoadingConfig and SegmentGeneratorConfig
[Index SPI] IndexType ()
Simplify filtered aggregate transform operator creation ()
Introduce BaseProjectOperator and ValueBlock ()
Add support to create realtime segment in local ()
Refactor: Pass context instead on individual arguments to operator ()
Add "processAll" mode for MergeRollupTask ()
Upgrade h2 version from 1.x to 2.x ()
Added optional force param to the table configs update API ()
Enhance broker reduce to handle different column names from server response ()
Adding fields to enable/disable dictionary optimization. ()
Remove converted H2 type NUMERIC(200, 100) from BIG_DECIMAL ()
Add JOIN support to PinotQuery ()
Add testng on verifier ()
Clean up temp consuming segment files during server start ()
make pinot k8s sts and deployment start command configurable ()
Fix Bottleneck for Server Bootstrap by Making maxConnsPerRoute Configurable ()
Type match between resultType and function's dataType ()
create segment zk metadata cache ()
Allow ValueBlock length to increase in TransformFunction ()
Allow configuring helix timeouts for EV dropped in Instance manager ()
Enhance error reporting ()
Combine "GET /segments" API & "GET /segments/{tableName}/select" ()
Exposed the CSV header map as part of CSVRecordReader ()
Moving Zk updates for reload,force_commit to their own Znodes which will spread out Zk write load across jobTypes ()
Enabling dictionary override optimization on the segment reload path as well. ()
Make broker's rest resource packages configurable ()
Check EV not exist before allowing creating the table ()
Adding an parameter (toSegments) to the endSegmentReplacement API ()
update target tier for segments if tierConfigs is provided ()
Add support for custom compression factor for Percentile TDigest aggregation functions ()
Utility to convert table config into updated format ()
Segment lifecycle event listener support ()
Add server metrics to capture gRPC activity ()
Separate and parallelize BloomFilter based semgment pruner ()
API to expose the contract/rules imposed by pinot on tableConfig
Add description field to metrics in Pinot ()
changing the dedup store to become pluggable
Make the TimeUnit in the DATETRUNC function case insensitive. ()
[feature] Consider tierConfigs when assigning new offline segment
Compress idealstate according to estimated size
10689: Update for pinot helm release version 0.2.7 ()
Fail the query if a filter's rhs contains NULL. ()
Support Off Heap for Native Text Indices ()
refine segment reload executor to avoid creating threads unbounded
compress nullvector bitmap upon seal ()
Enable case insensitivity by default ()
Push out-of-order events metrics for full upsert ()
[feature] add requestId for BrokerResponse in pinot-broker and java-client
Provide results in CompletableFuture for java clients and expose metrics
Add minion observability for segment upload/download failures ()
Enhance early terminate for combine operator ()
Add fromController method that accepts a PinotClientTransport ()
Ensure min/max value generation in the segment metadata. ()
Apply some allocation optimizations on GrpcSendingMailbox ()
When enable case-insensitive, don't allow to add newly column name which have the same lowercase name with existed columns. ()
Replace Long attributes with primitive values to reduce boxing ()
retry KafkaConsumer creation in KafkaPartitionLevelConnectionHandler.java () ()
Support for new dataTime format in DateTimeGranularitySpec without explicitly setting size ()
Returning 403 status code in case of authorization failures ()
Simplify compatible test to avoid test against itself ()
Updated code for setting value of segment min/max property. ()
Add stat to track number of segments that have valid doc id snapshots ()
Add brokerId and brokerReduceTimeMs to the broker response stats ()
safely multiply integers to prevent overflow ()
Move largest comparison value update logic out of map access ()
Optimize DimensionTableDataManager to abort unnecesarry loading ()
Refine isNullsLast and isAsc functions. ()
Update the pinot tenants tables api to support returning broker tagged tables ()
add multi-value support for native text index ()
Add percentiles report in QuerySummary ()
Add meter for broker responses with unavailable segments ()
Enhance Minion task management ()
add additional lucene index configs ()
Add DECIMAL data type to orc record reader ()
add configuration to fail server startup on non-good status checker ()
allow passing freshness checker after an idle threshold ()
Add broker validation for hybrid tableConfig creation ()
Support partition parallelism for partitioned table scan ()

Vulnerability fixes, bugfixes, cleanups and deprecations

Remove support for High level consumers in Apache Pinot ()
Fix JDBC driver check for username ()
[Clean up] Remove getColumnName() from AggregationFunction interface ()
fix jersey TerminalWriterInterceptor MessageBodyWriter not found issue ()
Bug fix: Start counting operator execution time from first NoOp block ()
Fix unavailable instances issues for StrictReplicaGroup ()
Change shell to bash ()
Fix the double destroy of segment data manager during server shutdown ()
Remove "isSorted()" precondition check in the ForwardIndexHandler ()
Fix null handling in streaming selection operator ()
Fix jackson dependencies ()
Startree index build enhancement ()
optimize queries where lhs and rhs of predicate are equal ()
Trivial fix on a warning detected by static checker ()
wait for full segment commit protocol on force commit ()
Fix bug and add test for noDict -> Dict conversion for sorted column ()
Make column order deterministic in segment ()
Type match between resultType and function's dataType ()
Allow empty segmentsTo for segment replacement protocol ()
Use string as default compatible type for coalesce ()
Use threadlocal variable for genericRow to make the MemoryOptimizedTable threadsafe ()
Fix shading in spark2 connector pom file ()
Fix ramping delay caused by long lasting sequence of unfiltered messa… ()
Do not serialize metrics in each Operator ()
Make pinot-controller apply webpack production mode when bin-dist profile is used. ()
Fix FS props handling when using /ingestFromUri ()
Clean up v0_deprecated batch ingestion jobs ()
Deprecate kafka 0.9 support ()
safely multiply integers to prevent overflow ()
Reduce timeout for codecov and not fail the job in any case ()
Fix DataTableV3 serde bug for empty array ()
Do not record operator stats when tracing is enabled ()
Forward auth token for logger APIs from controller to other controllers and brokers ()
Bug fix: Partial upsert default strategy is null ()
Fix flaky test caused by EV check during table creation ()
Fix withDissabledTrue typo ()
Cleanup unnecessary mailbox id ser/de ()
no error metric for queries where all segments are pruned ()
bug fix: to keep QueryParser thread safe when handling many read requests on class RealtimeLuceneTextIndex ()
Fix static DictionaryIndexConfig.DEFAULT_OFFHEAP being actually onheap ()
10567: [cleanup pinot-integration-test-base], clean query generations and some other refactoring. ()
Fixes backward incompatability with SegmentGenerationJobSpec for segment push job runners ()
Bug fix to get the toSegments list correctly ()
10661: Fix for failing numeric comparison in where clause for IllegalStateException. ()
Fixes partial upsert not reflecting multiple comparison column values ()
Fix Bug in Reporting Timer Value for Min Consuming Freshness ()
Fix typo of rowSize -> columnSize ()
update segment target tier before table rebalance ()
Fix a bug in star-tree filter operator which can incorrecly filter documents ()
Enhance the instrumentation for a corner case where the query doesn't go through DocIdSetOp ()
bug fix: add missing properties when edit instance config ()
Making segmentMapper do the init and cleanup of RecordReader ()
Fix githubEvents table for quickstart recipes ()
Minor Realtime Segment Commit Upload Improvements ()
Return 503 for all interrupted queries. Refactor the query killing code. ()
Add decoder initialization error to the server's error cache ()
bug fix: add @JsonProperty to SegmentAssignmentConfig ()
ensure we wait the full no query timeout before shutting down ()
Clean up KLL functions with deprecated convention ()
Redefine the semantics of SEGMENT_STREAMED_DOWNLOAD_UNTAR_FAILURES metric to count individual segment fetch failures. ()
fix excpetion during exchange routing causes stucked pipeline ()
[bugfix] fix floating point and integral type backward incompatible issue ()
[pinot-core] Start consumption after creating segment data manager ()
Fix IndexOutOfBoundException in filtered aggregation group-by ()
Fix null pointer exception in segment debug endpoint
Clean up RangeIndexBasedFilterOperator. ()
Fix the escape/unescape issue for property value in metadata ()
Fix a bug in the order by comparator ()
Keeps nullness attributes of merged in comparison column values ()
Add required JSON annotation in H3IndexResolution ()
Fix a bug in SELECT DISTINCT ORDER BY. ()
jsonPathString should return null instead of string literal "null" ()
Bug Fix: Segment Purger cannot purge old segments after schema evolution ()
Fix by giving metainfo more priority than config ()
Close PinotFS after Data Manager Shutdowns ()
bump awssdk version for a bugfix on http conn leakage ()
Fix MultiNodesOfflineClusterIntegrationTest.testServerHardFailure() ()
Fix a bug in SELECT DISTINCT ORDER BY LIMIT. ()
Fix an integer overflow bug. ()
Return true when _resultSet is not null ()
Fixing table name extraction for lateral join queries ()
Fix casting when prefetching mmap'd segment larger than 2GB ()
Null check before closing reader ()
Fixes SQL wildcard escaping in LIKE queries ()
[Clean up] Do not count DISTINCT as aggregation ()
do not readd lucene readers to queue if segment is destroyed
Message batch ingestion lag fix ()
Fix a typo in snapshot lock ()
When extracting root-level field name for complex type handling, use the whole delimiter ()
update jersey to fix Denial of Service (DoS) ()
Update getTenantInstances call for controller and separate POST operations on it ()
update freemaker to fix Server-side Template Injection ()
format double 0 properly to compare with h2 results ()
Fix double-checked locking in ConnectionFactory ()
Remove presto-pinot-driver and pinot-java-client-jdk8 module ()
Make RequestUtils always return a string array when getTableNames ()
Fix BOOL_AND and BOOL_OR result type ()
[cleanup] Consolidate some query and controller/broker methods in integration tests ()
Fix grpc regression on multi-stage engine ()
Delete an obsolete TODO. ()
Minor fix on AddTableCommand.toString() ()
Allow using Lucene text indexes on mutable MV columns. ()
Allow offloading multiple segments from same table in parallel ()
Added serviceAccount to minion-stateless ()
Bug fix: TableUpsertMetadataManager is null ()
Fix reload bug ()
Allow extra aggregation types in RealtimeToOfflineSegmentsTask ()
Fix a bug when use range index to solve EQ predicate ()
Sanitise API inputs used as file path variables ()
Fix NPE when nested query doesn't have gapfill ()
Fix the NPE when query response error stream is null ()
Make interface methods non private, for java 8 compatibility ()
Increment nextDocId even if geo indexing fails ()
Fix the issue of consuming segment entering ERROR state due to stream connection errors ()
In TableRebalancer, remove instance partitions only when reassigning instances ()
Remove JDK 8 unsupported code ()
Fix compat test by adding -am flag to build pinot-integration-tests ()
dont duplicate register scalar function in CalciteSchema ()
Fix the storage quota check for metadata push ()
Delete filtering NULL support dead code paths. ()
[bugfix] Do not move real-time segments to working dir on restart ()
Fix a bug in ExpressionScanDocIdIterator for multi-value. ()
Exclude NULLs when PredicateEvaluator::isAlwaysTrue is true. ()
UI: fix sql query options seperator ()
Fix a NullPointerException bug in ScalarTransformFunctionWrapper. ()
[refactor] improve disk read for partial upsert handler ()
Fix the wrong query time when the response is empty ()
getMessageAtIndex should actually return the value in the streamMessage for compatibility ()
Remove presto jdk8 related dependencies ()
Remove special routing handling for multiple consuming segments ()
Properly handle shutdown of TableDataManager ()
Fixing the stale pinot ServerInstance in _tableTenantServersMap ()
Fix the thread safety issue for mutable forward index ()
Fix RawStringDistinctExecutor integer overflow ()
[logging] fix consume rate logging bug to respect 1 minute threshold ()

0.12.0

Multi-Stage Query Engine

New join semantics support

Left join (#9466)
In-equi join (#9448)
Full join (#9907)
Right join (#9907)
Semi join (#9367)
Using keyword (#9373)

New sql semantics support:

Having (#9274)
Order by (#9279)
In/NotIn clause (#9374)
Cast (#9384)
LIke/Rexlike (#9654)
Range predicate (#9445)

Performance enhancement

Thread safe query planning (#9344)
Partial query execution and round robin scheduling (#9753)
Improve data table serde (#9731)

Major updates

Force commit consuming segments by @sajjad-moradi in #9197
add a freshness based consumption status checker by @jadami10 in #9244
Add metrics to track controller segment download and upload requests in progress by @gviedma in #9258
Adding endpoint to download local log files for each component by @xiangfu0 in #9259
[Feature] Add an option to search input files recursively in ingestion job. The default is set to true to be backward compatible. by @61yao in #9265
add query cancel APIs on controller backed by those on brokers by @klsince in #9276
Add Spark Job Launcher tool by @KKcorps in #9288
Enable Consistent Data Push for Standalone Segment Push Job Runners by @yuanbenson in #9295
Allow server to directly return the final aggregation result by @Jackie-Jiang in #9304
TierBasedSegmentDirectoryLoader to keep segments in multi-datadir by @klsince in #9306
Adaptive Server Selection by @vvivekiyer in #9311
[Feature] Support IsDistinctFrom and IsNotDistinctFrom by @61yao in #9312
Allow ingestion of errored records with incorrect datatype by @KKcorps in #9320
Allow setting custom time boundary for hybrid table queries by @saurabhd336 in #9356
skip late cron job with max allowed delay by @klsince in #9372
Do not allow implicit cast for BOOLEAN and TIMESTAMP by @Jackie-Jiang in #9385
Add missing properties in CSV plugin by @KKcorps in #9399
set MDC so that one can route minion task logs to separate files cleanly by @klsince in #9400
Add a new API to fix segment date time in metadata by @KKcorps in #9413
Update get bytes to return raw bytes of string and support getBytesMV by @61yao in #9441
Exposing consumer's record lag in /consumingSegmentsInfo by @navina in #9515
Do not create dictionary for high-cardinality columns by @KKcorps in #9527
get task runtime configs tracked in Helix by @klsince in #9540
Add more options to json index by @Jackie-Jiang in #9543
add SegmentTierAssigner and refine restful APIs to get segment tier info by @klsince in #9598
Add segment level debug API by @saurabhd336 in #9609
Add record availability lag for Kafka connector by @navina in #9621
notify servers that need to move segments to new tiers via SegmentReloadMessage by @klsince in #9624
Allow to configure multi-datadirs as instance configs and a Quickstart example about them by @klsince in #9705
Customize stopword for Lucene Index by @jasperjiaguo in #9708
Add memory optimized dimension table by @KKcorps in #9802
ADLS file system upgrade by @xiangfu0 in #9855
Added Delete Schema/Table pinot admin commands by @bagipriyank in #9857
Adding new ADLSPinotFS auth type: DEFAULT by @xiangfu0 in #9860
Add rate limit to Kinesis requests by @KKcorps in #9863
Adding configs for zk client timeout by @xiangfu0 in #9975

Other features/changes

Show most recent scheduling errors by @satishwaghela in #9161
Do not use aggregation result for distinct query in IntermediateResultsBlock by @Jackie-Jiang in #9262
Emit metrics for ratio of actual consumption rate to rate limit in real-time tables by @sajjad-moradi in #9201
add metrics entry offlineTableCount by @walterddr in #9270
refine query cancel resp msg by @klsince in #9242
add @ManualAuthorization annotation for non-standard endpoints by @apucher in #9252
Optimize ser/de to avoid using output stream by @Jackie-Jiang in #9278
Add Support for Covariance Function by @SabrinaZhaozyf in #9236
Throw an exception when MV columns are present in the order-by expression list in selection order-by only queries by @somandal in #9078
Improve server query cancellation and timeout checking during execution by @jasperjiaguo in #9286
Add capabilities to ingest from another stream without disabling the real-time table by @sajjad-moradi in #9289
Add minMaxInvalid flag to avoid unnecessary needPreprocess by @npawar in #9238
Add array cardinality function by @walterddr in #9300
TierBasedSegmentDirectoryLoader to keep segments in multi-datadir by @klsince in #9306
Add support for custom null values in CSV record reader by @KKcorps in #9318
Infer parquet reader type based on file metadata by @saurabhd336 in #9294
Add Support for Cast Function on MV Columns by @SabrinaZhaozyf in #9296
Allow ingestion of errored records with incorrect datatype by @KKcorps in #9320
[Feature] Not Operator Transformation by @61yao in #9330
Handle null string in CSV decoder by @KKcorps in #9340
[Feature] Not scalar function by @61yao in #9338
Add support for EXTRACT syntax and converts it to appropriate Pinot expression by @tanmesh in #9184
Add support for Auth in controller requests in java query client by @KKcorps in #9230
delete all related minion task metadata when deleting a table by @zhtaoxiang in #9339
BloomFilterRule should only recommend for supported column type by @yuanbenson in #9364
Support all the types in ParquetNativeRecordReader by @xiangfu0 in #9352
Improve segment name check in metadata push by @zhtaoxiang in #9359
Allow expression transformer cotinue on error by @xiangfu0 in #9376
skip late cron job with max allowed delay by @klsince in #9372
Enhance and filter predicate evaluation efficiency by @jasperjiaguo in #9336
Deprecate instanceId Config For Broker/Minion Specific Configs by @ankitsultana in #9308
Optimize combine operator to fully utilize threads by @Jackie-Jiang in #9387
Terminate the query after plan generation if timeout by @jasperjiaguo in #9386
[Feature] Support IsDistinctFrom and IsNotDistinctFrom by @61yao in #9312
[Feature] Support Coalesce for Column Names by @61yao in #9327
Disable logging for interrupted exceptions in kinesis by @KKcorps in #9405
Benchmark thread cpu time by @jasperjiaguo in #9408
Use ISODateTimeFormat as default for SIMPLE_DATE_FORMAT by @KKcorps in #9378
Extract the common logic for upsert metadata manager by @Jackie-Jiang in #9435
Make minion task metadata manager methods more generic by @saurabhd336 in #9436
Always pass clientId to kafka's consumer properties by @navina in #9444
Adaptive Server Selection by @vvivekiyer in #9311
Refine IndexHandler methods a bit to make them reentrant by @klsince in #9440
use MinionEventObserver to track finer grained task progress status on worker by @klsince in #9432
Allow spaces in input file paths by @KKcorps in #9426
Add support for gracefully handling the errors while transformations by @KKcorps in #9377
Cache Deleted Segment Names in Server to Avoid SegmentMissingError by @ankitsultana in #9423
Handle Invalid timestamps by @KKcorps in #9355
refine minion worker event observer to track finer grained progress for tasks by @klsince in #9449
spark-connector should use v2/brokers endpoint by @itschrispeck in #9451
Remove netty server query support from presto-pinot-driver to remove pinot-core and pinot-segment-local dependencies by @xiangfu0 in #9455
Adaptive Server Selection: Address pending review comments by @vvivekiyer in #9462
track progress from within segment processor framework by @klsince in #9457
Decouple ser/de from DataTable by @Jackie-Jiang in #9468
collect file info like mtime, length while listing files for free by @klsince in #9466
Extract record keys, headers and metadata from Stream sources by @navina in #9224
[pinot-spark-connector] Bump spark connector max inbound message size by @cbalci in #9475
refine the minion task progress api a bit by @klsince in #9482
add parsing for AT TIME ZONE by @agavra in #9477
Eliminate explosion of metrics due to gapfill queries by @elonazoulay in #9490
ForwardIndexHandler: Change compressionType during segmentReload by @vvivekiyer in #9454
Introduce Segment AssignmentStrategy Interface by @GSharayu in #9309
Add query interruption flag check to broker groupby reduction by @jasperjiaguo in #9499
adding optional client payload by @walterddr in #9465
[feature] distinct from scalar functions by @61yao in #9486
Check data table version on server only for null handling by @Jackie-Jiang in #9508
Add docId and column name to segment read exception by @KKcorps in #9512
Sort scanning based operators by cardinality in AndDocIdSet evaluation by @jasperjiaguo in #9420
Do not fail CI when codecov upload fails by @Jackie-Jiang in #9522
[Upsert] persist validDocsIndex snapshot for Pinot upsert optimization by @deemoliu in #9062
broker filter by @dongxiaoman in #9391
[feature] coalesce scalar by @61yao in #9487
Allow setting custom time boundary for hybrid table queries by @saurabhd336 in #9356
[GHA] add cache timeout by @walterddr in #9524
Optimize PinotHelixResourceManager.hasTable() by @Jackie-Jiang in #9526
Include exception when upsert metadata manager cannot be created by @Jackie-Jiang in #9532
allow to config task expire time by @klsince in #9530
expose task finish time via debug API by @klsince in #9534
Remove the wrong warning log in KafkaPartitionLevelConsumer by @Jackie-Jiang in #9536
starting http server for minion worker conditionally by @klsince in #9542
Make StreamMessage generic and a bug fix by @vvivekiyer in #9544
Improve primary key serialization performance by @KKcorps in #9538
[Upsert] Skip removing upsert metadata when shutting down the server by @Jackie-Jiang in #9551
add array element at function by @walterddr in #9554
Handle the case when enableNullHandling is true and an aggregation function is used w/ a column that has an empty null bitmap by @nizarhejazi in #9566
Support segment storage format without forward index by @somandal in #9333
Adding SegmentNameGenerator type inference if not explicitly set in config by @timsants in #9550
add version information to JMX metrics & component logs by @agavra in #9578
remove unused RecordTransform/RecordFilter classes by @agavra in #9607
Support rewriting forward index upon changing compression type for existing raw MV column by @vvivekiyer in #9510
Support Avro's Fixed data type by @sajjad-moradi in #9642
[feature] [kubernetes] add loadBalancerSourceRanges to service-external.yaml for controller and broker by @jameskelleher in #9494
Limit up to 10 unavailable segments to be printed in the query exception by @Jackie-Jiang in #9617
remove more unused filter code by @agavra in #9620
Do not cache record reader in segment by @Jackie-Jiang in #9604
make first part of user agent header configurable by @rino-kadijk in #9471
optimize order by sorted ASC, unsorted and order by DESC cases by @gortiz in #8979
Enhance cluster config update API to handle non-string values properly by @Jackie-Jiang in #9635
Reverts recommender REST API back to PUT (reverts PR #9326) by @yuanbenson in #9638
Remove invalid pruner names from server config by @Jackie-Jiang in #9646
Using usageHelp instead of deprecated help in picocli commands by @navina in #9608
Handle unique query id on server by @Jackie-Jiang in #9648
stateless group marker missing several by @walterddr in #9673
Support reloading consuming segment using force commit by @Jackie-Jiang in #9640
Improve star-tree to use star-node when the predicate matches all the non-star nodes by @Jackie-Jiang in #9667
add FetchPlanner interface to decide what column index to prefetch by @klsince in #9668
Improve star-tree traversal using ArrayDeque by @Jackie-Jiang in #9688
Handle errors in combine operator by @Jackie-Jiang in #9689
return different error code if old version is not on master by @SabrinaZhaozyf in #9686
Support creating dictionary at runtime for an existing column by @vvivekiyer in #9678
check mutable segment explicitly instead of checking existence of indexDir by @klsince in #9718
Remove leftover file before downloading segmentTar by @npawar in #9719
add index key and size map to segment metadata by @walterddr in #9712
Use ideal state as source of truth for segment existence by @Jackie-Jiang in #9735
Close Filesystem on exit with Minion Tasks by @KKcorps in #9681
render the tables list even as the table sizes are loading by @jadami10 in #9741
Add Support for IP Address Function by @SabrinaZhaozyf in #9501
bubble up error messages from broker by @agavra in #9754
Add support to disable the forward index for existing columns by @somandal in #9740
show table metadata info in aggregate index size form by @walterddr in #9733
Preprocess immutable segments from REALTIME table conditionally when loading them by @klsince in #9772
revert default timeout nano change in QueryConfig by @agavra in #9790
AdaptiveServerSelection: Update stats for servers that have not responded by @vvivekiyer in #9801
Add null value index for default column by @KKcorps in #9777
[MergeRollupTask] include partition info into segment name by @zhtaoxiang in #9815
Adding a consumer lag as metric via a periodic task in controller by @navina in #9800
Deserialize Hyperloglog objects more optimally by @priyen in #9749
Download offline segments from peers by @wirybeaver in #9710
Thread Level Usage Accounting and Query Killing on Server by @jasperjiaguo in #9727
Add max merger and min mergers for partial upsert by @deemoliu in #9665
#9518 added pinot helm 0.2.6 with secure version pinot 0.11.0 by @bagipriyank in #9519
Combine the read access for replication config by @snleee in #9849
add v1 ingress in helm chart by @jhisse in #9862
Optimize AdaptiveServerSelection for replicaGroup based routing by @vvivekiyer in #9803
Do not sort the instances in InstancePartitions by @Jackie-Jiang in #9866
Merge new columns in existing record with default merge strategy by @navina in #9851
Support disabling dictionary at runtime for an existing column by @vvivekiyer in #9868
support BOOL_AND and BOOL_OR aggregate functions by @agavra in #9848
Use Pulsar AdminClient to delete unused subscriptions by @navina in #9859
add table sort function for table size by @jadami10 in #9844
In Kafka consumer, seek offset only when needed by @Jackie-Jiang in #9896
fallback if no broker found for the specified table name by @klsince in #9914
Allow liveness check during server shutting down by @Jackie-Jiang in #9915
Allow segment upload via Metadata in MergeRollup Minion task by @KKcorps in #9825
Add back the Helix workaround for missing IS change by @Jackie-Jiang in #9921
Allow uploading real-time segments via CLI by @KKcorps in #9861
Add capability to update and delete table config via CLI by @KKcorps in #9852
default to TAR if push mode is not set by @klsince in #9935
load startree index via segment reader interface by @klsince in #9828
Allow collections for MV transform functions by @saurabhd336 in #9908
Construct new IndexLoadingConfig when loading completed real-time segments by @vvivekiyer in #9938
Make GET /tableConfigs backwards compatible in case schema does not match raw table name by @timsants in #9922
feat: add compressed file support for ORCRecordReader by @etolbakov in #9884
Add Variance and Standard Deviation Aggregation Functions by @snleee in #9910
enable MergeRollupTask on real-time tables by @zhtaoxiang in #9890
Update cardinality when converting raw column to dict based by @vvivekiyer in #9875
Add back auth token for UploadSegmentCommand by @timsants in #9960
Improving gz support for avro record readers by @snleee in #9951
Default column handling of noForwardIndex and regeneration of forward index on reload path by @somandal in #9810
[Feature] Support coalesce literal by @61yao in #9958
Ability to initialize S3PinotFs with serverSideEncryption properties when passing client directly by @npawar in #9988
handle pending minion tasks properly when getting the task progress status by @klsince in #9911
allow gauge stored in metric registry to be updated by @zhtaoxiang in #9961
support case-insensitive query options in SET syntax by @agavra in #9912
pin versions-maven-plugin to 2.13.0 by @jadami10 in #9993
Pulsar Connection handler should not spin up a consumer / reader by @navina in #9893
Handle in-memory segment metadata for index checking by @Jackie-Jiang in #10017
Support the cross-account access using IAM role for S3 PinotFS by @snleee in #10009
report minion task metadata last update time as metric by @zhtaoxiang in #9954
support SKEWNESS and KURTOSIS aggregates by @agavra in #10021
emit minion task generation time and error metrics by @zhtaoxiang in #10026
Use the same default time value for all replicas by @Jackie-Jiang in #10029
Reduce the number of segments to wait for convergence when rebalancing by @saurabhd336 in #10028

UI Update & Improvement

Allow hiding query console tab based on cluster config (#9261)
Allow hiding pinot broker swagger UI by config (#9343)
Add UI to show fine-grained minion task progress (#9488)
Add UI to track segment reload progress (#9521)
Show minion task runtime config details in UI (#9652)
Redefine the segment status (#9699)
Show an option to reload the segments during edit schema (#9762)
Load schema UI async (#9781)
Fix blank screen when redirect to unknown app route (#9888)

Library version upgrade

Upgrade h3 lib from 3.7.2 to 4.0.0 to lower glibc requirement (#9335)
Upgrade ZK version to 3.6.3 (#9612)
Upgrade snakeyaml from 1.30 to 1.33 (#9464)
Upgrade RoaringBitmap from 0.9.28 to 0.9.35 (#9730)
Upgrade spotless-maven-plugin from 2.9.0 to 2.28.0 (#9877)
Upgrade decode-uri-component from 0.2.0 to 0.2.2 (#9941)

BugFixes

Fix bug with logging request headers by @abhs50 in #9247
Fix a UT that only shows up on host with more cores by @klsince in #9257
Fix message count by @Jackie-Jiang in #9271
Fix issue with auth AccessType in Schema REST endpoints by @sajjad-moradi in #9293
Fix PerfBenchmarkRunner to skip the tmp dir by @Jackie-Jiang in #9298
Fix thrift deserializer thread safety issue by @saurabhd336 in #9299
Fix transformation to string for BOOLEAN and TIMESTAMP by @Jackie-Jiang in #9287
[hotfix] Add VARBINARY column to switch case branch by @walterddr in #9313
Fix annotation for "/recommender" endpoint by @sajjad-moradi in #9326
Fix jdk8 build issue due to missing pom dependency by @somandal in #9351
Fix pom to use pinot-common-jdk8 for pinot-connector jkd8 java client by @somandal in #9353
Fix log to reflect job type by @KKcorps in #9381
[Bugfix] schema update bug fix by @MeihanLi in #9382
fix histogram null pointer exception by @jasperjiaguo in #9428
Fix thread safety issues with SDF (WIP) by @saurabhd336 in #9425
Bug fix: failure status in ingestion jobs doesn't reflect in exit code by @KKcorps in #9410
Fix skip segment logic in MinMaxValueBasedSelectionOrderByCombineOperator by @Jackie-Jiang in #9434
Fix the bug of hybrid table request using the same request id by @Jackie-Jiang in #9443
Fix the range check for range index on raw column by @Jackie-Jiang in #9453
Fix Data-Correctness Bug in GTE Comparison in BinaryOperatorTransformFunction by @ankitsultana in #9461
extend PinotFS impls with listFilesWithMetadata and some bugfix by @klsince in #9478
fix null transform bound check by @walterddr in #9495
Fix JsonExtractScalar when no value is extracted by @Jackie-Jiang in #9500
Fix AddTable for real-time tables by @npawar in #9506
Fix some type convert scalar functions by @Jackie-Jiang in #9509
fix spammy logs for ConfluentSchemaRegistryRealtimeClusterIntegrationTest [MINOR] by @agavra in #9516
Fix timestamp index on column of preserved key by @Jackie-Jiang in #9533
Fix record extractor when ByteBuffer can be reused by @Jackie-Jiang in #9549
Fix explain plan ALL_SEGMENTS_PRUNED_ON_SERVER node by @somandal in #9572
Fix time validation when data type needs to be converted by @Jackie-Jiang in #9569
UI: fix incorrect task finish time by @jayeshchoudhary in #9557
Fix the bug where uploaded segments cannot be deleted on real-time table by @Jackie-Jiang in #9579
[bugfix] correct the dir for building segments in FileIngestionHelper by @zhtaoxiang in #9591
Fix NonAggregationGroupByToDistinctQueryRewriter by @Jackie-Jiang in #9605
fix distinct result return by @walterddr in #9582
Fix GcsPinotFS by @lfernandez93 in #9556
fix DataSchema thread-safe issue by @walterddr in #9619
Bug fix: Add missing table config fetch for /tableConfigs list all by @timsants in #9603
Fix re-uploading segment when the previous upload failed by @Jackie-Jiang in #9631
Fix string split which should be on whole separator by @Jackie-Jiang in #9650
Fix server request sent delay to be non-negative by @Jackie-Jiang in #9656
bugfix: Add missing BIG_DECIMAL support for GenericRow serde by @timsants in #9661
Fix extra restlet resource test which should be stateless by @Jackie-Jiang in #9674
AdaptiveServerSelection: Fix timer by @vvivekiyer in #9697
fix PinotVersion to be compatible with prometheus by @agavra in #9701
Fix the setup for ControllerTest shared cluster by @Jackie-Jiang in #9704
[hotfix]groovy class cache leak by @walterddr in #9716
Fix TIMESTAMP index handling in SegmentMapper by @Jackie-Jiang in #9722
Fix the server admin endpoint cache to reflect the config changes by @Jackie-Jiang in #9734
[bugfix] fix case-when issue by @walterddr in #9702
[bugfix] Let StartControllerCommand also handle "pinot.zk.server", "pinot.cluster.name" in default conf/pinot-controller.conf by @thangnd197 in #9739
[hotfix] semi-join opt by @walterddr in #9779
Fixing the rebalance issue for real-time table with tier by @snleee in #9780
UI: show segment debug details when segment is in bad state by @jayeshchoudhary in #9700
Fix the replication in segment assignment strategy by @GSharayu in #9816
fix potential fd leakage for SegmentProcessorFramework by @klsince in #9797
Fix NPE when reading ZK address from controller config by @Jackie-Jiang in #9751
have query table list show search bar; fix InstancesTables filter by @jadami10 in #9742
[pinot-spark-connector] Fix empty data table handling in GRPC reader by @cbalci in #9837
[bugfix] fix mergeRollupTask metrics by @zhtaoxiang in #9864
Bug fix: Get correct primary key count by @KKcorps in #9876
Fix issues for real-time table reload by @Jackie-Jiang in #9885
UI: fix segment status color remains same in different table page by @jayeshchoudhary in #9891
Fix bloom filter creation on BYTES by @Jackie-Jiang in #9898
[hotfix] broker selection not using table name by @walterddr in #9902
Fix race condition when 2 segment upload occurred for the same segment by @jackjlli in #9905
fix timezone_hour/timezone_minute functions by @agavra in #9949
[Bugfix] Move brokerId extraction to BaseBrokerStarter by @jackjlli in #9965
Fix ser/de for StringLongPair by @Jackie-Jiang in #9985
bugfix dir check for HadoopPinotFS.copyFromLocalDir by @klsince in #9979
Bugfix: Use correct exception import in TableRebalancer. by @mayankshriv in #10025
Fix NPE in AbstractMetrics From Race Condition by @ankitsultana in #10022

1.1.0

Release Notes for 1.1.0

Summary

This release comes with several features, including SQL, UI, and performance enhancements. Also included are bug fixes across multiple features such as the V2 multi-stage query engine, ingestion, storage format, and SQL support.

Multi-stage query engine

Features

Support RelDistribution-based trait planning (#11976, #12079)

Adds support for RelDistribution optimization for more accurate leaf-stage direct exchange/shuffle. Also extends partition optimization beyond leaf stage to entire query plan.
Applies optimization based on distribution trait in the mailbox/worker assignment stage
- Fixes previous direct exchange which was decided based on the table partition hint. Now direct exchange is decided via distribution trait: it will applied if-and-only-if the trait propagated matches the exchange requirement.
- As a side effect, is_colocated_by_join_keys query option is reintroduced to ensure dynamic broadcast which can also benefit from direct exchange optimization
- Allows propagation of partition distribution trait info across the tree to be used during Physical Planning phase. It can be used in the following scenarios (will follow up in separate PRs)
Note on backward incompatbility
- is_colocated_by_join_keys hint is now required for making colocated joins
  - it should only affect semi-join b/c it is the only one utilizing broadcast exchange but were pulled to act as direct exchange.
  - inner/left/right/full join should automatically apply colocation thus the backward incompatibility should not affect these.

Leaf stage planning with multi-semi join support (#11937)

Solves the limitation of pinotQuery that supports limited amount of PlanNodes.
Splits the ServerRequest planning into 2 stages
- First plan as much as possible into PinotQuery
- for any remainder nodes that cannot be planned into PinotQuery, will be run together with the LeafStageTransferrableBlockOperator as the input locally.

Support for ArrayAgg aggregation function (#11822)

Usage: ArrayAgg(column, 'dataType' [, 'isDistinct'])
Float type column is treated as Double in the multistage engine, so FLOAT type is not supported.
Supports data BOOLEAN, INT, LONG, FLOAT(only in V1), DOUBLE, STRING, TIMESTAMP. E.g. ArrayAgg(intCol, 'INT') returns ARRAY<INT>

Enhancements

Canonicalize SqlKind.OTHERS and SqlKind.OTHER_FUNCTIONS and support
concat as || operator (#12025)
Capability for constant filter in QueryContext, with support for server to handle it (#11956)
Tests for filter pushdown (#11994)
Enhancements to query plan tests (#11966)
Refactor PlanFragmenter to make the logic clear (#11912)
Observability enhancements to emit metrics for grpc request and multi-stage leaf stage (#11838)
- pinot.server.query.log.maxRatePerSecond: query log max rate (QPS, default 10K)
- pinot.server.query.log.droppedReportMaxRatePerSecond: dropped query log report max rate (QPS, default 1)
Security enhancement to add RBAC authorization checks for multi-stage query engine (#11830)
Enhancement to leaf-stage execution stats NPE handling (#11805)
Enhancement to add a framework to back-propagate metadata across opChains (#11746)
Use of BinaryArray to wire proto for multi-stage engine bytes literal handling (#11738)
Enable dynamic broadcast for SEMI joins. Adds a fallback option to enable hash table join using joinOptions(join_strategy = 'hash_table')(#11696)
Improvements to dispatch exception handling (#11688)
Allow malformed dateTime string to return default value configurable in the function signature (#11258)
- fromDateTime(colContainsMalformedStr, '<dateTimeFormat>', '<timezone>', <default_value>)
Improvement in multi-stage aggregation to directly store column index as identifier (#11617)
Perf optimization to avoid unnecessary rows conversion in aggregation (#11607)
Enhance SegmentPartitionMetadataManager to handle new segment (#11585)
Optimize mailbox info in query plan to reduce memory footprint (#12382)
- This PR changes the proto object structure, which will cause backward incompatibility when broker and server are running different version.
Optimizations to query plan serialization (#12370)
Optimization for parallel execution of Ser/de stage plan (#12363)
Optimizations in query dispatch (#12358)
Perf optimization for group-by and join for single key scenario (#11630)

Bugfixes, refactoring, cleanups, tests

Bugfix for evaluation of chained literal functions (#12248)
Fixes to sort copy rule (#12251 and #12237)
Fixes duplicate results for literal queries (#12240)
Bugfix to use UTF-8 encoding for default Charset (#12213)
Bugfix to escape table name when routing queries (#12212)
Refactoring of planner code and removing unnecessary rules (#12070, #12052)
Fix to remove unnecessar project after agg during relBuilder (#12058)
Fixes issues multi-semi-join (#12038)
Fixes leaf limit refactor issue (#12001)
Add back filter merge after rule (#11989)
Fix operator EOS pull (#11970)
Fix type cast issue with dateTimeConvert scalar function (#11839, #11971)
Fix to set explicit warning flags set on each stage stats (#11936)
Fix mailbox visitor mismatch receive/send (#11908)
Fix eliminate multiple exchanges in nested semi-join queries (#11882)
Bugfix for multiple consecutive Exchange returning empty response (#11885)
Fixing unit-test-2 build (#11889)
Fix issue with realtime partition mismatch metric (#11871)
Fix the NPE for rebalance retry (#11883)
Bugfix to make Agg literal attach happen after BASIC_RULES (#11863)
Fix NPE by init execution stats map (#11801)
Test cases for special column escape (#11737)
Fix StPoint scalar function usage in multi-stage engine intermediate stage (#11731)
Clean up for transform function type (#11726)
Add capability to ignore test (#11703)
Fix custom property naming (#11675)
Log warning when multi-stage engine planning throws exception (#11595)
Fix usage of metadata overrides (#11587)
Test change to enable metadata manager by default for colocated join quickstart (#11579)
Tests for IN/NOT-IN operation (#12349)
Fix stage id in stage plan (#12366)
Bugfix for IN and NOT IN filters within case statements (#12305)

Notable features

Server-level throttling for realtime consumption (#12292)

Use server config pinot.server.consumption.rate.limit to enable this feature
Server rate limiter is disabled by default (default value 0)

Reduce segment generation disk footprint for Minion Tasks (#12220)

Supported in MergeRollupTask and RealtimeToOfflineSegmentsTask minion tasks
Use taskConfig segmentMapperFileSizeThresholdInBytes to specify the threshold size

"task": {
  "taskTypeConfigsMap": {
    "<task_name>": {
      "segmentMapperFileSizeThresholdInBytes": "1000000000"
    }
  }
}

Support for swapping of TLS keystore/truststore (#12277, #12325)

Security feature that makes the keystore/truststore swappable.
Auto-reloads keystore/truststore (without need for a restart) if they are local files

Sticky query routing (#12276)

Adds support for deterministic and sticky routing for a query / table / broker. This setting would lead to same server / set of servers (for MultiStageReplicaGroupSelector) being used for all queries of a given table.
Query option (takes precedence over fixed routing setting at table / broker config level) SET "useFixedReplica"=true;
Table config (takes precedence over fixed routing setting at broker config level)
```
"routing": {
   ...          
   "useFixedReplica": true
}
```
Broker conf - pinot.broker.use.fixed.replica=true

Table Config to disallow duplicate primary key for dimension tables (#12290)

Use tableConfig dimensionTableConfig.errorOnDuplicatePrimaryKey=true to enable this behavior
Disabled by default

Partition-level ForceCommit for realtime tables (#12088)

Support to force-commit specific partitions of a realtime table.
Partitions can be specified to the forceCommit API as a comma separated list of partition names or consuming segment names

Support initializing broker tags from config (#12175)

Support to give the broker initial tags on startup.
Automatically updates brokerResource when broker joins the cluster for the first time
Broker tags are provided as comma-separated values in pinot.broker.instance.tags

Support for StreamNative OAuth2 authentication for Pulsar (#12068)

StreamNative (the cloud SAAS offering of Pulsar) uses OAuth2 to authenticate clients to their Pulsar clusters.
For more information, see how to Configure OAuth2 authentication in Pulsar clients
Can be configured by adding the following properties to streamConfigs:

"stream.pulsar.issuerUrl": "https://auth.streamnative.cloud"
"stream.pulsar.credsFilePath": "file:///path/to/private_creds_file"
"stream.pulsar.audience": "urn:sn:pulsar:test:test-cluster"

Introduce low disk mode to table rebalance (#12072)

Introduces a new table rebalance boolean config lowDiskMode.Default value is false.
Applicable for rebalance with downtime=false.
When enabled, segments will first be offloaded from servers, then added to servers after offload is done. It may increase the total time of the rebalance, but can be useful when servers are low on disk space, and we want to scale up the cluster and rebalance the table to more servers.
#12112 adds the UI capability to toggle this option

Support vector index and hierarchical navigable small worlds (HNSW) (#11977)

Supports Vector Index on float array/multi-value columnz
Add predicate and function to retrieve topK closest vector. Example query

SELECT ProductId, UserId, l2_distance(embedding, ARRAY[-0.0013143676,-0.011042999,...]) AS l2_dist, n_tokens, combined
FROM fineFoodReviews
WHERE VECTOR_SIMILARITY(embedding, ARRAY[-0.0013143676,-0.011042999,...], 5)  
ORDER by l2_dist ASC 
LIMIT 10

The function l2_distance will return a double value where the first parameter is the embedding column and the second parameter is the search term embedding literal.
Since VectorSimilarity is a predicate, once config the topK, this predicate will return topk rows per segment. Then if you are using this index with other predicate, you may not get expected number of rows since the records matching other predicate might not in the topk rows.

Support for retention on deleted keys of upsert tables (#12037)

Adds an upsert config deletedKeysTTL which will remove deleted keys from in-memory hashmap and mark the validDocID as invalid after the deletedKeysTTL threshold period.
Disabled by default. Enabled only if a valid value for deletedKeysTTL is set.

Configurable Lucene analyzer (#12027)

Introduces the capability to specify a custom Lucene analyzer used by text index for indexing and search on an individual column basis.
Sample usage

fieldConfigList: [
   {
        "name": "columnName",
        "indexType": "TEXT",
        "indexTypes": [
          "TEXT"
        ],
        "properties": {
          "luceneAnalyzerClass": "org.apache.lucene.analysis.core.KeywordAnalyzer"
        },
      }
  ]

Default Behavior falls back to using the standardAnalyzer unless the luceneAnalyzerClass property is specified.

Support for murmur3 as a partition function (#12049)

Murmur3 support with optional fields seed and variant for the hash in functionConfig field of columnPartitionMap.Default value for seed is 0.
Added support for 2 variants of Murmur3: x86_32 and x64_32 configurable using the variant field in functionConfig. If no variant is provided we choose to keep the x86_32 variant as it was part of the original implementation.

Examples of functionConfig;

"tableIndexConfig": {
      ..
      "segmentPartitionConfig": {
        "columnPartitionMap": {
          "memberId": {
            "functionName": "Murmur3",
            "numPartitions": 3 
          },
          ..
        }
      }

Here there is no functionConfig configured, so the seed value will be 0 and variant will be x86_32.

"tableIndexConfig": {
      ..
      "segmentPartitionConfig": {
        "columnPartitionMap": {
          "memberId": {
            "functionName": "Murmur3",
            "numPartitions": 3,
            "functionConfig": {
               "seed": "9001"
             },
          },
          ..
        }
      }

Here the seed is configured as 9001 but as no variant is provided, x86_32 will be picked up.

 "tableIndexConfig": {
      ..
      "segmentPartitionConfig": {
        "columnPartitionMap": {
          "memberId": {
            "functionName": "Murmur3",
            "numPartitions": 3,
            "functionConfig" :{
               "seed": "9001"
               "variant": "x64_32"
             },
          },
          ..
        }
      }

Here the variant is mentioned so Murmur3 will use the x64_32 variant with 9001 as seed.

Note on users using Debezium and Murmur3 as partitioning function :
- The partitioning key should be set up on either of byte[], String or long[] columns.
- On pinot variant should be set as x64_32 and seed should be set as 9001.

New optimized MV forward index to only store unique MV values

Adds new MV dictionary encoded forward index format that only stores the unique MV entries.
This new index format can significantly reduce the index size when the MV entries repeat a lot
The new index format can be enabled during index creation, derived column creation, and segment reload

To enable the new index format, set the compression codec in the FieldConfig:

{
  "name": "myCol",
  "encodingType": "DICTIONARY",
  "compressionCodec": "MV_ENTRY_DICT"
}

Or use the new index JSON:

{
  "name": "myCol",
  "encodingType": "DICTIONARY",
  "indexes": {
    "forward": {
      "dictIdCompressionType": "MV_ENTRY_DICT"
    }
  }
}

Support for explicit null handling modes (#11960)

Adds support for 2 possible ways to handle null:
- Table mode - which already exists
- Column mode, which means that each column specifies its own nullability in the FieldSpec
Column mode can be enabled by the below config.
The default value for enableColumnBasedNullHandling is false. When set to true, Pinot will ignore TableConfig.IndexingConfig.nullHandlingEnabled and columns will be nullable if and only if FieldSpec.notNull is false, which is also the default value.

{
  "schemaName": "blablabla",
  "dimensionFieldSpecs": [
    {
      "dataType": "INT",
      "name": "nullableField",
      "notNull": false
    },
    {
      "dataType": "INT",
      "name": "notNullableField",
      "notNull": true
    },
    {
      "dataType": "INT",
      "name": "defaultNullableField"
    },
    ...
  ],
  "enableColumnBasedNullHandling": true/false
}

Support tracking out of order events in Upsert (#11877)

Adds a new upsert config outOfOrderRecordColumn
When set to a non-null value, we check whether an event is OOO or not and then accordingly update the corresponding column value to true / false.
This will help in tracking which event is out-of-order while using skipUpsert

Compression configuration support for aggregationConfigs to StartreeIndexConfigs (#11744)

Can be used to save space. For eg: when a functionColumnPairs has a output type of bytes, such as when you use distinctcountrawhll.
Sample config

"starTreeIndexConfigs": [
        {
          "dimensionsSplitOrder": [
            "a",
            "b",
            "c"
          ],
          "skipStarNodeCreationForDimensions": [],
          "functionColumnPairs": [],
          "aggregationConfigs": [
            {
              "columnName": "column1",
              "aggregationFunction": "SUM",
              "compressionCodec": "SNAPPY"
            },
            {
              "columnName": "column2",
              "aggregationFunction": "distinctcounthll",
              "compressionCodec": "LZ4"
            }
          ],
          "maxLeafRecords": 10000
        }
      ]

Preconfiguration based mirror instance assignment (#11578)

Supports instance assignment based pre-configured instance assignment map.
The assignment will always respect the mirrored servers in the pre-configured map
More details here
Sample table config

"instanceAssignmentConfigMap": {
  "CONSUMING": {
    "partitionSelector": "MIRROR_SERVER_SET_PARTITION_SELECTOR",
    "replicaGroupPartitionConfig": { ... },
     "tagPoolConfig": {
       ...
       "tag": "mt1_REALTIME"
     }
     ...
 }
 "COMPLETED": {
   "partitionSelector": "MIRROR_SERVER_SET_PARTITION_SELECTOR",
   "replicaGroupPartitionConfig": { ... },
    "tagPoolConfig": {
       ...
       "tag": "mt1_OFFLINE"
     }
     ...
 },
 "instancePartitionsMap": {
      "CONSUMING": “mt1_CONSUMING"
      "COMPLETED": "mt1_OFFLINE"
 },

Support for listing dimension tables (#11859)

Adds dimension as a valid option to table "type" in the /tables controller API

Support in upsert for dropping out of order events (#11811)

This patch adds a new config for upsert: dropOutOfOrderRecord
If set to true, pinot doesn't persist out-of-order events in the segment.
This feature is useful to
- Save disk-usage
- Avoid any confusion when using skipUpsert for partial-upsert tables as nulls start showing up for columns where a previous non-null was encountered and we don't know if it's an out-of-order event or not.

Support to retry failed table rebalance tasks (#11740)

New configs for the RebalanceChecker periodic task:
- controller.rebalance.checker.frequencyPeriod: 5min by default ; -1 to disable
- controller.rebalanceChecker.initialDelayInSeconds: 2min+ by default
New configs added for RebalanceConfig:
- heartbeatIntervalInMs: 300_000 i.e. 5min
- heartbeatTimeoutInMs: 3600_000 i.e. 1hr
- maxAttempts: 3 by default, i.e. the original run plus two retries
- retryInitialDelayInMs: 300_000 i.e. 5min, for exponential backoff w/ jitters
New metrics to monitor rebalance and its retries:
- TABLE_REBALANCE_FAILURE("TableRebalanceFailure", false), emit from TableRebalancer.rebalanceTable()
- TABLE_REBALANCE_EXECUTION_TIME_MS("tableRebalanceExecutionTimeMs", false), emit from TableRebalancer.rebalanceTable()
- TABLE_REBALANCE_FAILURE_DETECTED("TableRebalanceFailureDetected", false), emit from RebalanceChecker
- TABLE_REBALANCE_RETRY("TableRebalanceRetry", false), emit from RebalanceChecker
New restful API
- DELETE /tables/{tableName}/rebalance API to stop rebalance. In comparison, POST /tables/{tableName}/rebalance was used to start one.

Support for `UltraLogLog` (#11835)

UltraLogLog aggregations for Count Distinct (distinctCountULL and distinctCountRawULL)
UltraLogLog creation via Transform Function
UltraLogLog merging in MergeRollup
Support for UltraLogLog in Star-Tree indexes

Support for Apache Datasketches CPC sketch (#11774)

Ingestion via transformation function
Extracting estimates via query aggregation functions
Segment rollup aggregation
StarTree aggregation

Support to reduce DirectMemory OOM chances on broker (#11710)

Broadly there are two configs that will enable this feature:
- maxServerResponseSizeBytes: Maximum serialized response size across all servers for a query. This value is equally divided across all servers processing the query.
- maxQueryResponseSizeBytes: Maximum length of the serialized response per server for a query

Configs are available as queryOption, tableConfig and Broker config. The priority of enforcement is as follows:

The overriding order of priority is:
1. QueryOption  -> maxServerResponseSizeBytes
2. QueryOption  -> maxQueryResponseSizeBytes
3. TableConfig  -> maxServerResponseSizeBytes
4. TableConfig  -> maxQueryResponseSizeBytes
5. BrokerConfig -> pinot.broker.max.server.response.size.bytes
6. BrokerConfig -> pinot.broker.max.query.response.size.bytes

UI support to allow schema to be created with JSON config (#11809)

This is helpful when user has the entire JSON handy
UI still keeps Form Way to add Schema along with JSON view

Support in JSON index for ignoring values longer than a given length (#11604)

Use option maxValueLength in jsonIndexConfig to restrict length of values
A value of 0 (or when the key is omitted) means there is no restriction

Support for MultiValue VarByte V4 index writer (#11674)

Supports serializing and writing MV columns in VarByteChunkForwardIndexWriterV4
Supports V4 reader that can be used to read SV var length, MV fixed length and MV var length buffers encoded with V4 writer

Improved scalar function support for Multivalue columns(#11555, #11654)

arrayIndexOfInt(int[] value, int valToFind)
arrayIndexOfLong(int[] value, long valToFind)
arrayIndexOfFloat(int[] value, float valToFind)
arrayIndexOfDouble(int[] value, double valToFind)
arrayIndexOfString(int[] value, String valToFind)
intersectIndices(int[] values1, int[] values2)

Support for `FrequentStringsSketch` and `FrequentLonsSketch` aggregation functions (#11098)

Approximation aggregation functions for estimating the frequencies of items a dataset in a memory efficient way. More details in Apache Datasketches library.

FREQUENTLONGSSKETCH(col, maxMapSize=256) -> Base64 encoded sketch object
FREQUENTSTRINGSSKETCH(col, maxMapSize=256) -> Base64 encoded sketch object

Controller API for table index (#11576)

Table index api to get the aggregate index details of all segments for a table.
- URL/tables/{tableName}/indexes

Response format

{
    "totalSegments": 31,
    "columnToIndexesCount":
    {
        "col1":
        {
            "dictionary": 31,
            "bloom": 0,
            "null": 0,
            "forward": 31,
            ...
            "inverted": 0,
            "some-dynamically-injected-index-type": 31,
        },
        "col2":
        {
            ...
        }
        ...
}

Support for configurable rebalance delay at lead controller (#11509)

The lead controller rebalance delay is now configurable with controller.resource.rebalance.delay_ms
Changing rebalance configurations will now update the lead controller resource

Support for configuration through environment variables (#12307)

Adds support for Pinot configuration through ENV variables with Dynamic mapping.
More details in issue: #10651
Sample configs through ENV

export PINOT_CONTROLLER_HOST=host
export PINOT_SERVER_PROPERTY_WHATEVER=whatever_property
export ANOTHER_VARIABLE=random

Add hyperLogLogPlus aggregation function for distinct count (#11346)

HLL++ has higher accuracy than HLL when cardinality of dimension is at 10k-100k.
More details here

DISTINCTCOUNTHLLPLUS(some_id, 12)

Support for clpMatch

Adds query rewriting logic to transform a "virtual" UDF, clpMatch, into a boolean expression on the columns of a CLP-encoded field.
To use the rewriter, modify broker config to add org.apache.pinot.sql.parsers.rewriter.ClpRewriter to pinot.broker.query.rewriter.class.names.

Support for `DATETIMECONVERTWINDOWHOP` function (#11773)

Support for `JSON_EXTRACT_INDEX` transform function to leverage json index for json value extraction (#11739)

Support for ArrayAgg aggregation function (#11822)

`GenerateData` command support for generating data in JSON format (#11778)

Enhancements

SQL

Support ARRAY function as a literal evaluation (#12278)
Support for ARRAY literal transform functions (#12118)
Theta Sketch Aggregation enhancements (#12042)
- Adds configuration options for DistinctCountThetaSketchAggregationFunction
- Respects ordering for existing Theta sketches to use "early-stop" optimisations for unions
Add query option override for Broker MinGroupTrimSize (#11984)
Support for 2 new scalar functions for bytes: toUUIDBytes and fromUUIDBytes (#11988)
Config option to make groupBy trim size configurable at Broker (#11958)
Pre-aggregation support for distinct count hll++ (#11747)
Add float type into literal thrift to preserve literal type conforming to SQL standards (#11697)
Enhancement to add query function override for Aggregate functions of multi valued columns (#11307)
Perf optimization in IN clause evaluation (#11557)
Add TextMatchFilterOptimizer to maximally push down text_match filters to Lucene (#12339

UI

Async rendering of UI elements to load UI elements async resulting in faster page loads (#12210)
Make the table name link clickable in task details (#12253)
Swagger UI enhancements to resumeConsumption API call (#12200)
Adds support for CTRL key as a modifier for Query shortcuts (#12087)
UI enhancement to show partial index in reload (#11913)
UI improvement to add Links to Instance in Table and Segment View (#11807)
Fixes reload to use the right indexes API instead of fetching all segment metadata (#11793)
Enhancement to add toggle to hide/show query exceptions (#11611)

Misc

Enhancement to reduce the heap usage of String Dictionaries that are loaded on-heap (#12223)
Wire soft upsert delete for Compaction task (12330)
Upsert compaction debuggability APIs for validDocId metadata (#12275)
Make server resource classes configurable (#12324)
Shared aggregations for Startree index - mapping from aggregation used in the query to aggregation used to store pre-aggregated values (#12164)
Increased fetch timeout for Kineses to prevent stuck kinesis consumers
Metric to track table rebalance (#12270)
Allow server-level configs for upsert metadata (#18851)
Support to dynamically initialize Kafka client SSL configs (#12249)
Optimize segment metadata file creation without having to download full segment (#12255)
Allow string / numeric data type for deleteRecordColumn config (#12222)
Atomic and Deterministic snapshots of validDocId for upsert tables (#12232, #12246)
Observability enhancement to add column name when JSON index building fails (#12151)
Creation of DateTimeGenerator for DATE_TIME field type columns (#12206)
Add singleton registry for all controller and minion metrics (#12119)
Support helm chart server separate liveness and readiness probe endpoints (#11800)
Observability enhancement to add metrics for Table Disabled and Consumption Paused (#12000)
Support for SegmentGenerationAndPushTask to push segment to realtime table (#12084)
Enhancement to make the deep store upload retry async with configurable parallelism (#12017)
Optimizations in segment commit to not read partition group metadata (#11943)
Replace timer with scheduled executor service in IngestionDelayTracker to reduce number of threads (#11849)
Adds an option skipControllerCertValidation to skip controller cert validation in AddTableCommand (#11967)
Adds instrumentation for DataTable Creation (#11942)
Improve performance of ZkBasicAuthAccessFactory by caching Bcrypt password (#11904)
Adds support to to fetch metadata for specific list of segments (#11949)
Allow user specify local temp directory for quickstart (#11961)
Optimization for server to directly return final result for queries hitting single server (#11938)
Explain plan optimization to early release AcquireReleaseColumnsSegmentOperator (#11945)
Observability metric to track query timeouts (#11892)
Add support for auth in QueryRunner (#11897)
Allow users to pass custom RecordTransformers to SegmentProcessorFramework (#11887)
Add isPartialResult flag to broker response (#11592)
Add new configs to Google Cloud Storage (GCS) connector: jsonKey (#11890)
- jsonKey is the GCP credential key in string format (either in plain string or base64 encoded string). Refer Creating and managing service account keys to download the keys.
Performance enhancement to build segments in column orientation (#11776)
- Disabled by default. Can be enabled by setting table config columnMajorSegmentBuilderEnabled
Observability enhancements to emit metrics for grpc request and multi-stage leaf stage (#11838)
- pinot.server.query.log.maxRatePerSecond: query log max rate (QPS, default 10K)
- pinot.server.query.log.droppedReportMaxRatePerSecond: dropped query log report max rate (QPS, default 1)
Observability improvement to expose GRPC metrics (#11842)
Improvements to response format for reload API to be pretty printed (#11608)
Enhancements to support Java 21 (#11672)
Add more information in RequestContext class (#11708)
Support to read exact buffer byte ranges corresponding to a given forward index doc id (#11729)
Enhance Broker reducer to handle expression format change (#11762)
Capture build scans on ge.apache.org to benefit from deep build insights (#11767)
Performance enhancement in multiple places by updating initial capacity of HashMap (#11709)
Support for building indexes post segment file creation, allowing indexes that may depend on a completed segment to be built as part of the segment creation process (#11711)
Support excluding time values in SimpleSegmentNameGenerator (#11650)
Perf enhancement to reduce cpu usage by avoiding throwing an exception during query execution (#11715)
Added framework for supporting nulls in ScalarTransformFunctionWrapper in the future (#11653)
Observability change to metrics to export netty direct memory used and max (#11575)
Observability change to add a metric to measure total thread cpu time for a table (#11713)
Observability change to use SlidingTimeWindowArrayReservoirin dropwizard metrics (#11695)
Minor improvements to upsert preload (#11694)
Observability changes to expose additional Realtime Ingestion Metrics (#11685)
Perf enhancement to remove the global lock in SegmentCompletionManager (#11679)
Enhancements to unify tmp file naming format and delete tmp files at a regular cadence by extending the ControllerPeriodicTask (#10815)
- controller.realtime.segment.tmpFileAsyncDeletionEnabled (default false)
- controller.realtime.segment.tmpFileRetentionInSeconds (default 3600)
Improvements to skip unparseable records in the csv record reader (#11540, #11594)
Enhancements to allow override/force options when add schema (#11572)
Enhancement to handle direct memory OOM on brokers (#11496)
Enhancement to metadata API to return upsert partition to primary key count map for both controller and server APIs (#12334)
Enhancements to peer server segment download by retrying both peer discovery and download. (#12317)
Helper functions in StarTreeBuilderUtils and StarTreeV2BuilderConfig (#12361)
Perf optimizations to release all segments of a table in releaseAndRemoveAllSegments method (#12297)
Enhancement to Maintain pool selection for the minimizeDataMovement instance partition assignment strategy (#11953)
Upsert enhancement to assign segments for with respect to ideal state (#11628)
Observability change to export Additional Upsert Metrics to Prom (#11660)
Observibility enhancement to add CPU metrics for minion purge task (#12337)
Add HttpHeaders in broker event listener requestContext (#12258)

Bug fixes, refactoring, cleanups, deprecations

Upsert bugfix in "rewind()" for CompactedPinotSegmentRecordReader (#12329)
Fix error message format for Preconditions.checks failures(#12327)
Bugfix to distribute Pinot as a multi-release JAR (#12131, #12300)
Fixes in upsert metadata manager (#12319)
Security fix to allow querying tables with table-type suffix (#12310)
Bugfix to ensure tagConfigOverride config is null for upsert tables (#12233 and #12311)
Increased fetch timeout for Kineses to prevent stuck kinesis consumers(#12214)
Fixes to catch-all Regex for JXM -> Prom Exporter (#12073 and #12295)
Fixes lucene index errors when using QuickStart (#12289)
Null handling bugfix for sketch group-by queries (#12259)
Null pointer exception fixes in Controller SQL resource (#12211)
Synchronization fixes to replace upsert segments (#12105 and #12241)
Bugfix for S3 connection pool error when AWS session tokens expire after an hour (#12221)
FileWriter fixes to append headerline only for required formats like csv (#12208)
Security bugfix for pulsar OAuth2 authentication (#12195)
Bugfix to appropriately compute "segment.flush.threshold.size" when force-committing realtime segments (#12188)
Fixes rebalance converge check that reports success before rebalance completes (#12182)
Fixes upsertPrimaryKeysCount metric reporting when table is deleted (#12169)
Update LICENSE-binary for commons-configuration2 upgrade (#12165)
Improve error logging when preloading segments not exist on server (#12153)
Fixes to file access resource leaks (#12129)
Ingestion bugfix to avoid unnecessary transformers in CompositeTransformer (#12138)
Improve logging to print OS name during service statup (#12135)
Improve logging in multiple files (#12134, #12137, #12127, #12121)
Test fixes for ExprMinMaxRewriterTest.testQueryRewrite (#12047)
Fixes default path of log4j in helmchart (#12069, #12083)
Fix default brokerUpdateFrequencyInMillis for connector (#12093)
Updates to README file (#12075)
Fix to remove unnecessary locking during segment preloading (#12077)
Fix bug with silently ignoring force commit call failures (#12044)
Upsert bugfix to allow optional segments that can be skipped by servers without failing the query (#11978)
Fix incorrect handling of consumer creation errors (#12045)
Fix the memory leak issue on CommonsConfigurationUtils (#12056)
Fix rebalance on upsert table (#12054)
Add new Transformer to transform -0.0 and NaN (#12032)
Improve inverted index validation in table config to enhance user experience (#12043)
Fixes test flakiness by replacing HashSet/HashMap with LinkedHashSet/LinkedHashMap (#11941)
Flaky test fix for ServerRoutingStatsManagerTest.testQuerySubmitAndCompletionStats (#12029)
Fix derived column from MV column (#12028)
Support for leveraging StarTree index in conjunction with filtered aggregations (#11886)
Improves tableConfig validation for enabling size based threshold for realtime tables (#12016)
Fix flaky PinotTenantRestletResourceTest (#12026)
Fix flaky Fix PinotTenantRestletResourceTest (#12019)
Fix the race condition of concurrent modification to segment data managers (#12004)
Fix the misuse of star-tree when all predicates are always false under OR (#12003)
Fix the test failures caused by instance drop failure (#12002)
Fix fromULL scalar function (#11995)
Fix to exclude module-info.class during shade operations (#11975)
Fix the wrong import for Preconditions (#11979)
Add check for illegal character '/' in taskName (#11955)
Bugfix to only register new segments when it's fully initalized by partitionUpsertMetadataManager (#11964)
Obervability fix to add logs to track sequence of events for table creation (#11946)
Fix the NPE in minimizeDataMovement instance assignment strategy (#11952)
Fix to add catch all logging for exception during DQL/DML process (#11944)
Fix bug where we don't handle cases that a upsert table has both upsert deletion and upsert ttl configs (#11791)
Removing direct dependencies on commons-logging and replacing with jcl-over-slf4j (#11920)
Fix NPE for IN clause on constant STRING dictionary (#11930)
Fix flaky OfflineClusterIntegrationTest on server response size tests (#11926)
Avoid npe when checking mirror server set assignment (#11915)
Deprecate _segmentAssignmentStrategy in favor of SegmentsValidationAndRetentionConfig #11869
Bugfix to capture auth phase timing even if access is denied (#11884)
Bugfix to mark rows as invalid in case primary time column is out of range (#11907)
Fix to radomize server port to avoid port already bind issue (#11861)
Add LazyRow abstraction for previously indexed record (#11826)
Config Validation for upsert table to not assign COMPLETED segments to another server (#11852)
Bugfix to resolve dependency conflict in pinot-protobuf module (#11867)
Fix case of useMultistageEngine property reference in JsonAsyncHttpPinotClientTransportFactory (#11820)
Bugfix to add woodstox-core to pinot-s3 dependencies and fix stack trace (#11799)
Fix to move pinot-segment-local test from unit test suite 1 to 2 (#11865)
Observability fix to log upsert config when initializing the metadata manager (#11864)
Fix to improve tests when errors are received in the consumer thread (#11858)
Fix for flaky ArrayAgg test (#11860)
Fix for flaky tests in TupleSelectionTransformFunctionsTest (#11848)
Fix for arrayAgg null support (#11853)
Fix the bug of reading decimal value stored in int32 or int64 (#11840)
Remove duplicate pinot-integration-tests from unit test suite 2 (#11844)
Fix for a null handling error in queries (#11829)
Fix the way of fetching the segment zk metadata for task generators (#11832)
Make testInvalidateCachedControllerLeader times based on getMinInvalidateIntervalMs (#11815)
Update doap to reflect latest release (#11827)
Clean up integration test pom file (#11817)
Bugfix to exclude OFFLINE segments when reading server to segments map (#11818)
Add tests for zstd compressed parquet files (#11808)
Fix job submission time for reload and foce commit job (#11803)
Remove actually unsupported config that selectively enable nullable columns (#10653)
Fix LLCRealtimeClusterIntegrationTest.testReset (#11806)
Use expected version in api for table config read modify write change (#11782)
Move jobId out of rebalanceConfig (#11790)
Fix PeerServerSegmentFinder not respecting HTTPS port (#11752)
Enhanced geospatial v2 integration tests (#11741)
Add integration test for rebalance in upsert tables (#11568)
Fix trivy CI issue (#11757)
Cleanup rebalance configs by adding a RebalanceConfig class (#11730)
Fix a protobuf comment to be more precise (#11735)
Move scala dependencies to root pom (#11671)
Fix ProtoBuf inputformat plug-in handling for null values (#11723)
Bugfix where segment download URI is invalid after same CRC refresh using tar push (#11720)
Fix in TableCacheTest (#11717)
Add more test for broker jersey bounded thread pool (#11705)
Fix bug in gapfill with SumAvgGapfillProcessor. (#11714)
Bugfix to allow GcsPinotFS to work with granular permissions (#11655)
Fix default log4j2 config file path in helm chart (#11707)
Refactor code and doc occurrences of argmin/max -> exprmin/max (#11700)
Make constructor and functions public to be used from scheduler plugins (#11699)
Bugfix to change json_format to return java null when java null is received (#11673)
Fix the potential access to upsert metadata manager after it is closed (#11692)
Bugfix to use isOptional instead of the deprecated hasOptional Keyword (#11682)
Fix logging issue in RealtimeTableDataManager (#11693)
Cleanup some reader/writer logic for raw forward index (#11669)
Do not execute spotless in Java 21 (#11670)
Update license-maven-plugin (#11665)
Bugfix to allow deletion of local files with special characters (#11664)
Clean up CaseTransformFunction::constructStatementListLegacy. (#11339)
Bugfix to force FileChannel to commit data to disk (#11625)
Remove the old deprecated commit end without metadata (#11662)
Fix for a jackson vulnerability (#11619)
Refactor BasicAuthUtils from pinot-core to pinot-common and remove pinot-core dependency from pinot-jdbc-client (#11620)
Bugfix to support several extensions for different indexes (#11600)
Fix the alias handling in single-stage engine (#11610)
Fix to use constant null place holder (#11615)
Refactor to move all BlockValSet into the same package (#11616)
Remove deprecated Request class from pinot-java-client (#11614)
Refactoring to remove old thirdeye files. (#11609)
Testing fix to use builder method in integration test (#11564)
Fix the broken Pinot JDBC client. (#11606)
Bugfix to change the Forbidden error to Unauthorized (#11501)
Fix for schema add UI issue that passing wrong data in the request header (#11602)
Remove/Deprecate HLC handling code (#11590)
Fix the bug of using push time to identify new created segment (#11599)
Bugfix in CSVRecordReader when using line iterator (#11581)
Remove split commit and some deprecated config for real-time protocol on controller (#11663)Improved validation for single argument aggregation functions (#11556)
Fix to not emit lag once tabledatamanager shutdown (#11534)
Bugfix to fail reload if derived columns can't be created (#11559)
Fix the double unescape of property value (#12405)
Fix for the backward compatible issue that existing metadata may contain unescaped characters (#12393)
Skip invalid json string rather than throwing error during json indexing (#12238)
Fixing the multiple files concurrent write issue when reloading SSLFactory (#12384)
Fix memory leaking issue by making thread local variable static (#12242)
Bugfixfor Upsert compaction task generator (#12380)
Log information about SSLFactory renewal (#12357)
Fixing array literal usage for vector (#12365)
Fixing quickstart table baseballStats minion ingestion (#12371)
Fix backward compatible issue in DistinctCountThetaSketchAggregationFunction (#12347)
Bugfix to skip instead of throwing error on 'getValidDocIdMetadata' (#12360)
Fix to clean up segment metadata when the associated segment gets deleted from remote store (#12350)
Fix getBigDecimal() scale throwing rounding error (#12326)
Workaround fix for the problem of Helix sending 2 transitions for CONSUMING -> DROPPED (#12351)
Bugfix for making nonLeaderForTables exhaustive (#12345)
Bugfixes for graceful interrupt handling of mutable lucene index (#11558,#12274)
Remove split commit and some deprecated config for real-time protocol on controller (#11663)
Update the table config in quick start (#11652)
Deprecate k8s skaffold scripts and move helm to project root directory (#11648)
Fix NPE in SingleColumnKeySelector (#11644)
Simplify kafka build and remove old kafka 0.9 files (#11638)
Adding comments for docker image tags, make a hyper link of helmChart from root directory (#11646)
Improve the error response on controller. (#11624)
Simplify authrozation for table config get (#11640)
Bugfix to remove segments with empty download url in UpsertCompactionTask (#12320)
Test changes to make taskManager resources protected for derived classes to override in their setUp() method. (#12335)

Backward incompatible Changes

Fix a race condition for upsert compaction (#12346). Notes on backward incompatibility below:
- This PR is introducing backward incompatibility for UpsertCompactionTask. Previously, we allowed to configure the compaction task without the snapshot enabled. We found that using in-memory based validDocIds is a bit dangerous as it will not give us the consistency (e.g. fetching validDocIds bitmap while the server is restarting & updating validDocIds).
  We now enforce the enableSnapshot=true for UpsertCompactionTask if the advanced customer wants to run the compaction with the in-memory validDocId bitmap.
  { "upsertConfig": { "mode": "FULL", "enableSnapshot": true } } ... "task": { "taskTypeConfigsMap": { "UpsertCompactionTask": { "schedule": "0 */5 * ? * *", "bufferTimePeriod": "7d", "invalidRecordsThresholdPercent": "30", "invalidRecordsThresholdCount": "100000", "invalidDocIdsType": "SNAPSHOT/IN_MEMORY/IN_MEMORY_WITH_DELETE" } } }
  Also, we allow to configure invalidDocIdsType to UpsertCompactionTask for advanced user.
  1. snapshot: Default validDocIds type. This indicates that the validDocIds bitmap is loaded from the snapshot from the Pinot segment. UpsertConfig's enableSnapshot must be enabled for this type.
    onHeap: the validDocIds bitmap will be fetched from the server.
    onHeapWithDelete: the validDocIds bitmap will be fetched from the server. This will also take account into the deleted documents. UpsertConfig's deleteRecordColumn must be provided for this type.
Removal of the feature flag allow.table.name.with.database (#12402)
Error handling to throw exception when schema name doesn't match table name during table creation (#11591)
Fix type cast issue with dateTimeConvert scalar function (#11839, #11971)
Incompatible API fix to remove table state update operation in GET call (#11621)
Use string to represent BigDecimal datatype in JSON response (#11716)
Single quoted literal will not have its type auto-derived to maintain SQL compatibility (#11763)
Changes to always use split commit on server and disables the option to disable it (#11680, #11687)
Change to not allow NaN as default value for Float and Double in Schemas (#11661)
Code cleanup and refactor that removes TableDataManagerConfig (#12189)
Fix partition handling for consistency of values between query and segment (#12115)
Changes for migration to commons-configuration2 (#11985)
Cleanup to simplify the upsert metadata manager constructor (#12120)
Fixes typo in pom.xml (#11997)
JDBC Driver fixes to support Jetbrains Intellij/Datagrip database tooling (#11814)
Fix regression in ForwardIndexType for noDictionaryConfig and noDictionaryColumns (#11784)
Separate pr test scripts and codecov (#11804)
Bugfix to make reload status should only count online/consuming segments (#11787)
Fix flaky TableViewsTest (#11770)
Fix a flaky test (#11771)
Cleanup to fee more disk for trivy job (#11780)
Fix schema name in table config during controller startup (#11574)
Prevent NPE when attempt to fetch partition information fails (#11769)
Added UTs for null handling in CaseTransform function. (#11721)
Bugfix to disallow peer download when replication is < 2 (#11469)
Updates to Docker image and GitHub action scripts (#12378)
Enhancements to queries test framework (#12215)

Library upgrades and dependencies

Update maven-jar-plugin and maven-enforcer-plugin version (#11637)
Update testng as the test provider explicitly instead of relying on the classpath. (#11612)
Update compatibility verifier version (#11684)
Upgrade Avro dependency to 1.10.2 (#11698)
Upgrade testng version to 7.8.0 (#11462)
Update lombok version and config (#11742)
Upgrading Apache Helix to 1.3.1 version (#11754)
Upgrade spark from 3.2 to 3.5 (#11702)
Added commons-configuration2 dependency. (#11792)
Upgrade confluent libraries to 7.2.6 to fix some errors related to optional proto fields (#11753)
Upgrade lucene to 9.8.0 and upgrade text index version (#11857)
Upgrade the PinotConfiguartion to commons-configuartion2(#11916)
Pre PinotConfig commons-configuartions2 upgrade (#11868)
Bump commons-codec:commons-codec from 1.15 to 1.16.0 (#12204)
Bump flink.version from 1.12.0 to 1.14.6 (#12202)
Bump com.yscope.clp:clp-ffi from 0.4.3 to 0.4.4 (#12203)
Bump org.apache.spark:spark-launcher_2.12 from 3.2.1 to 3.5.0 (#12199)
Bump io.grpc:grpc-context from 1.59.0 to 1.60.1 (#12198)
Bump com.azure:azure-core from 1.37.0 to 1.45.1 (#12193)
Bump org.freemarker:freemarker from 2.3.30 to 2.3.32 (#12192)
Bump com.google.auto.service:auto-service from 1.0.1 to 1.1.1 (#12183)
Bump dropwizard-metrics.version from 4.2.22 to 4.2.23 (#12178)
Bump org.apache.yetus:audience-annotations from 0.13.0 to 0.15.0 (#12170)
Bump com.gradle:common-custom-user-data-maven-extension (#12171)
Bump org.apache.httpcomponents:httpclient from 4.5.13 to 4.5.14 (#12172)
Bump org.glassfish.tyrus.bundles:tyrus-standalone-client (#12162)
Bump com.google.api.grpc:proto-google-common-protos (#12159)
Bump org.apache.datasketches:datasketches-java from 4.1.0 to 5.0.0 (#12161)
Bump org.apache.zookeeper:zookeeper from 3.6.3 to 3.7.2 (#12152)
Bump org.apache.commons:commons-collections4 from 4.1 to 4.4 (#12149)
Bump log4j.version from 2.20.0 to 2.22.0 (#12143)
Bump com.github.luben:zstd-jni from 1.5.5-6 to 1.5.5-11 (#12125)
Bump com.google.guava:guava from 32.0.1-jre to 32.1.3-jre (#12124)
Bump org.apache.avro:avro from 1.10.2 to 1.11.3 (#12116)
Bump org.apache.maven.plugins:maven-assembly-plugin from 3.1.1 to 3.6.0 (#12109)
Bump net.java.dev.javacc:javacc from 7.0.10 to 7.0.13 (#12103)
Bump com.azure:azure-identity from 1.8.1 to 1.11.1 (#12095)
Bump xml-apis:xml-apis from 1.4.01 to 2.0.2 (#12082)
Bump up the parquet version to 1.13.1 (#12076)
Bump io.grpc:grpc-context from 1.14.0 to 1.59.0 (#12034)
Bump org.reactivestreams:reactive-streams from 1.0.3 to 1.0.4 (#12033)
Bump org.codehaus.mojo:appassembler-maven-plugin from 1.10 to 2.1.0 (#12030)
Bump com.google.code.findbugs:jsr305 from 3.0.0 to 3.0.2 (#12031)
Bump org.jacoco:jacoco-maven-plugin from 0.8.9 to 0.8.11 (#12024)
Bump dropwizard-metrics.version from 4.2.2 to 4.2.22 (#12022)
Bump grpc.version from 1.53.0 to 1.59.0 (#12023)
Bump com.google.code.gson:gson from 2.2.4 to 2.10.1 (#12009)
Bump net.nicoulaj.maven.plugins:checksum-maven-plugin from 1.8 to 1.11 (#12008)
Bump circe.version from 0.14.2 to 0.14.6 (#12006)
Bump com.mercateo:test-clock from 1.0.2 to 1.0.4 (#12005)
Bump simpleclient_common.version from 0.8.1 to 0.16.0 (#11986)
Bump com.jayway.jsonpath:json-path from 2.7.0 to 2.8.0 (#11987)
Bump commons-net:commons-net from 3.1 to 3.10.0 (#11982)
Bump org.scalatest:scalatest-maven-plugin from 1.0 to 2.2.0 (#11973)
Bump io.netty:netty-bom from 4.1.94.Final to 4.1.100.Final (#11972)
Bump com.google.errorprone:error_prone_annotations from 2.3.4 to 2.23.0 (#11905)
Bump net.minidev:json-smart from 2.4.10 to 2.5.0 (#11875)
Bump org.yaml:snakeyaml from 2.0 to 2.2 (#11876)
Bump browserify-sign in /pinot-controller/src/main/resources (#11896)
Bump org.easymock:easymock from 4.2 to 5.2.0 (#11854)
Bump org.codehaus.mojo:exec-maven-plugin from 1.5.0 to 3.1.0 (#11856)
Bump com.github.luben:zstd-jni from 1.5.2-3 to 1.5.5-6 (#11855)
Bump aws.sdk.version from 2.20.94 to 2.20.137 (#11463)
Bump org.xerial.snappy:snappy-java from 1.1.10.1 to 1.1.10.4 (#11678)