1 of 100

release-1.0.0

Introduction

Apache Pinot is a real-time distributed OLAP datastore purpose-built for low-latency, high-throughput analytics.

We'd love to hear from you! Join us in our Slack channel to ask questions, troubleshoot, and share feedback.

Apache Pinot is a real-time distributed online analytical processing (OLAP) datastore. Use Pinot to ingest and immediately query data from streaming or batch data sources (including, Apache Kafka, Amazon Kinesis, Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage).

Apache Pinot includes the following:

Ultra low-latency analytics even at extremely high throughput.
Columnar data store with several smart indexing and pre-aggregation techniques.
Scaling up and out with no upper bound.
Consistent performance based on the size of your cluster and an expected query per second (QPS) threshold.

It's perfect for user-facing real-time analytics and other analytical use cases, including internal dashboards, anomaly detection, and ad hoc data exploration.

User-facing real-time analytics

User-facing analytics refers to the analytical tools exposed to the end users of your product. In a user-facing analytics application, all users receive personalized analytics on their devices, resulting in hundreds of thousands of queries per second. Queries triggered by apps may grow quickly in proportion to the number of active users on the app, as many as millions of events per second. Data generated in Pinot is immediately available for analytics in latencies under one second.

User-facing real-time analytics requires the following:

Fresh data. The system needs to be able to ingest data in real time and make it available for querying, also in real time.
Support for high-velocity, highly dimensional event data from a wide range of actions and from multiple sources.
Low latency. Queries are triggered by end users interacting with apps, resulting in hundreds of thousands of queries per second with arbitrary patterns.
Reliability and high availability.
Scalability.
Low cost to serve.

Why Pinot?

Pinot is designed to execute OLAP queries with low latency. It works well where you need fast analytics, such as aggregations, on both mutable and immutable data.

User-facing, real-time analytics

Pinot was originally built at LinkedIn to power rich interactive real-time analytics applications, such as Who Viewed Profile, Company Analytics, Talent Insights, and many more. UberEats Restaurant Manager is another example of a user-facing analytics app built with Pinot.

Real-time dashboards for business metrics

Pinot can perform typical analytical operations such as slice and dice, drill down, roll up, and pivot on large scale multi-dimensional data. For instance, at LinkedIn, Pinot powers dashboards for thousands of business metrics. Connect various business intelligence (BI) tools such as Superset, Tableau, or PowerBI to visualize data in Pinot.

Enterprise business intelligence

For analysts and data scientists, Pinot works well as a highly-scalable data platform for business intelligence. Pinot converges big data platforms with the traditional role of a data warehouse, making it a suitable replacement for analysis and reporting.

Enterprise application development

For application developers, Pinot works well as an aggregate store that sources events from streaming data sources, such as Kafka, and makes it available for a query using SQL. You can also use Pinot to aggregate data across a microservice architecture into one easily queryable view of the domain.

Pinot tenants prevent any possibility of sharing ownership of database tables across microservice teams. Developers can create their own query models of data from multiple systems of record depending on their use case and needs. As with all aggregate stores, query models are eventually consistent.

Get started

If you're new to Pinot, take a look at our Getting Started guide:

To start importing data into Pinot, see how to import batch and stream data:

To start querying data in Pinot, check out our Query guide:

Learn

For a conceptual overview that explains how Pinot works, check out the Concepts guide:

To understand the distributed systems architecture that explains Pinot's operating model, take a look at our basic architecture section:

Basics

Concepts

Explore the fundamental concepts of Apache Pinot for efficient data processing and analysis. Gain insights into the core principles and foundational ideas behind Pinot's capabilities.

Pinot is designed to deliver low latency queries on large datasets. To achieve this performance, Pinot stores data in a columnar format and adds additional indices to perform fast filtering, aggregation and group by.

Raw data is broken into small data shards. Each shard is converted into a unit called a . One or more segments together form a , which is the logical container for querying Pinot using .

Pinot storage model

Pinot's storage model and infrastructure components include segments, tables, tenants, and clusters.

Segment

Pinot has a distributed systems architecture that scales horizontally. Pinot expects the size of a table to grow infinitely over time. In order to achieve this, all data needs to be distributed across multiple nodes. Pinot achieves this by breaking data into smaller chunks known as (similar to shards/partitions in high-availability (HA) relational databases). Another way to describe segments is as time-based partitions.

Table

Similar to traditional databases, Pinot has the concept of a —a logical abstraction that refers to a collection of related data.

As is the case with relational database management systems (RDBMS), a table is a construct that consists of columns and rows (documents) that are queried using SQL. A table is associated with a that defines the columns in a table as well as their data types.

In contrast to RDBMS schemas, multiple tables in Pinot can share a single schema definition. Tables are independently configured for concerns such as indexing strategies, partitioning, tenants, data sources, or replication.

Tenant

Pinot supports multi-tenancy. Every Pinot table is associated with a This allows all tables belonging to a particular logical namespace to be grouped under a single tenant name and isolated from other tenants. This isolation between tenants provides different namespaces for applications and teams to prevent sharing tables or schemas. Development teams building applications will never have to operate an independent deployment of Pinot. An organization can operate a single cluster and scale it out as new tenants increase the overall volume of queries. Developers can manage their own schemas and tables without being impacted by any other tenant on a cluster.

By default, all tables belong to a default tenant named "default". The concept of tenants is very important, as it satisfies the architectural principle of a "database per service/application" without having to operate many independent data stores. Further, tenants will schedule resources so that segments (shards) are able to restrict a table's data to reside only on a specified set of nodes. Similar to the kind of isolation that is ubiquitously used in Linux containers, compute resources in Pinot can be scheduled to prevent resource contention between tenants.

Cluster

Logically, a is simply a group of tenants. As with the classical definition of a cluster, it is also a grouping of a set of compute nodes. Typically, there is only one cluster per environment/data center. There is no needed to create multiple clusters since Pinot supports the concept of tenants. At LinkedIn, the largest Pinot cluster consists of 1000+ nodes distributed across a data center. The number of nodes in a cluster can be added in a way that will linearly increase performance and availability of queries. The number of nodes and the compute resources per node will reliably predict the QPS for a Pinot cluster, and as such, capacity planning can be easily achieved using SLAs that assert performance expectations for end-user applications.

Auto-scaling is also achievable, however, we recommend a set amount of nodes to keep QPS consistent when query loads vary in sudden unpredictable end-user usage scenarios.

Pinot components

A Pinot cluster consists of multiple distributed system components. These components are useful to understand for operators that are monitoring system usage or are debugging an issue with a cluster deployment.

Controller
Broker
Server
Minion (optional)

Pinot's integration with and allow it to be linearly scalable for an unbounded number of nodes.

Helix is a cluster management solution designed and created by the authors of Pinot at LinkedIn. Helix drives the state of a Pinot cluster from a transient state to an ideal state, acting as the fault-tolerant distributed state store that guarantees consistency. Helix is embedded as agents that operate within a controller, broker, and server, and does not exist as an independent and horizontally scaled component.

Pinot Controller

A is the core orchestrator that drives the consistency and routing in a Pinot cluster. Controllers are horizontally scaled as an independent component (container) and has visibility of the state of all other components in a cluster. The controller reacts and responds to state changes in the system and schedules the allocation of resources for tables, segments, or nodes. As mentioned earlier, Helix is embedded within the controller as an agent that is a participant responsible for observing and driving state changes that are subscribed to by other components.

In addition to cluster management, resource allocation, and scheduling, the controller is also the HTTP gateway for REST API administration of a Pinot deployment. A web-based query console is also provided for operators to quickly and easily run SQL/PQL queries.

Pinot Broker

A receives queries from a client and routes its execution to one or more Pinot servers before returning a consolidated response.

Pinot Server

host segments (shards) that are scheduled and allocated across multiple nodes and routed on an assignment to a tenant (there is a single-tenant by default). Servers are independent containers that scale horizontally and are notified by Helix through state changes driven by the controller. A server can either be a real-time server or an offline server.

A real-time and offline server have very different resource usage requirements, where real-time servers are continually consuming new messages from external systems (such as Kafka topics) that are ingested and allocated on segments of a tenant. Because of this, resource isolation can be used to prioritize high-throughput real-time data streams that are ingested and then made available for query through a broker.

Pinot Minion

Pinot is an optional component that can be used to run background tasks such as "purge" for GDPR (General Data Protection Regulation). As Pinot is an immutable aggregate store, records containing sensitive private data need to be purged on a request-by-request basis. Minion provides a solution for this purpose that complies with GDPR while optimizing Pinot segments and building additional indices that guarantee performance in the presence of the possibility of data deletion. One can also write a custom task that runs on a periodic basis. While it's possible to perform these tasks on the Pinot servers directly, having a separate process (Minion) lessens the overall degradation of query latency as segments are impacted by mutable writes.

Components

Discover the core components of Apache Pinot, enabling efficient data processing and analytics. Unleash the power of Pinot's building blocks for high-performance data-driven applications.

Pages in this section define and describe the major components and logical abstractions used in Pinot.

For a general overview that ties all these components together, see Basic Concepts.

Operator reference

Developer reference

Cluster

Learn to build and manage Apache Pinot clusters, uncovering key components for efficient data processing and optimized analysis.

A cluster is a set of nodes comprising of servers, brokers, controllers and minions.

Pinot uses Apache Helix for cluster management. Helix is a cluster management framework that manages replicated, partitioned resources in a distributed system. Helix uses Zookeeper to store cluster state and metadata.

Cluster configuration

For details of cluster configuration settings, see Cluster configuration reference.

Cluster components

Helix divides nodes into logical components based on their responsibilities:

Participant

Participants are the nodes that host distributed, partitioned resources

Pinot servers are modeled as participants. For details about server nodes, see Server.

Spectator

Spectators are the nodes that observe the current state of each participant and use that information to access the resources. Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).

Pinot brokers are modeled as spectators. For details about broker nodes, see Broker.

Controller

The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability.

Pinot controllers are modeled as controllers. For details about controller nodes, see Controller.

Logical view

Another way to visualize the cluster is a logical view, where:

A cluster contains tenants
Tenants contain tables
Tables contain segments

Set up a Pinot cluster

Typically, there is only one cluster per environment/data center. There is no need to create multiple Pinot clusters because Pinot supports tenants.

To set up a cluster, see one of the following guides:

Running Pinot in Docker
Running Pinot locally

Tenant

Discover the tenant component of Apache Pinot, which facilitates efficient data isolation and resource management within Pinot clusters.

A tenant is a logical component defined as a group of server/broker nodes with the same Helix tag.

In order to support multi-tenancy, Pinot has first-class support for tenants. Every table is associated with a server tenant and a broker tenant. This controls the nodes that will be used by this table as servers and brokers. This allows all tables belonging to a particular use case to be grouped under a single tenant name.

The concept of tenants is very important when the multiple use cases are using Pinot and there is a need to provide quotas or some sort of isolation across tenants. For example, consider we have two tables Table A and Table B in the same Pinot cluster.

We can configure Table A with server tenant Tenant A and Table B with server tenant Tenant B. We can tag some of the server nodes for Tenant A and some for Tenant B. This will ensure that segments of Table A only reside on servers tagged with Tenant A, and segment of Table B only reside on servers tagged with Tenant B. The same isolation can be achieved at the broker level, by configuring broker tenants to the tables.

No need to create separate clusters for every table or use case!

Tenant configuration

This tenant is defined in the section of the table config.

This section contains two main fields broker and server , which decide the tenants used for the broker and server components of this table.

In the above example:

The table will be served by brokers that have been tagged as brokerTenantName_BROKER in Helix.
If this were an offline table, the offline segments for the table will be hosted in Pinot servers tagged in Helix as serverTenantName_OFFLINE
If this were a real-time table, the real-time segments (both consuming as well as completed ones) will be hosted in pinot servers tagged in Helix as serverTenantName_REALTIME.

Create a tenant

Broker tenant

Here's a sample broker tenant config. This will create a broker tenant sampleBrokerTenant by tagging three untagged broker nodes as sampleBrokerTenant_BROKER.

To create this tenant use the following command. The creation will fail if number of untagged broker nodes is less than numberOfInstances.

Follow instructions in to get Pinot locally, and then

Check out the table config in the to make sure it was successfully uploaded.

Server tenant

Here's a sample server tenant config. This will create a server tenant sampleServerTenant by tagging 1 untagged server node as sampleServerTenant_OFFLINE and 1 untagged server node as sampleServerTenant_REALTIME.

To create this tenant use the following command. The creation will fail if number of untagged server nodes is less than offlineInstances + realtimeInstances.

Follow instructions in to get Pinot locally, and then

Check out the table config in the to make sure it was successfully uploaded.

Server

Uncover the efficient data processing and storage capabilities of Apache Pinot's server component, optimizing performance for data-driven applications.

Servers host the data segments and serve queries off the data they host. There are two types of servers:

Offline Offline servers are responsible for downloading segments from the segment store, to host and serve queries off. When a new segment is uploaded to the controller, the controller decides the servers (as many as replication) that will host the new segment and notifies them to download the segment from the segment store. On receiving this notification, the servers download the segment file and load the segment onto the server, to server queries off them.

Real-time Real-time servers directly ingest from a real-time stream (such as Kafka or EventHubs). Periodically, they make segments of the in-memory ingested data, based on certain thresholds. This segment is then persisted onto the segment store.

Pinot servers are modeled as Helix participants, hosting Pinot tables (referred to as resources in Helix terminology). Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more Helix partitions of one or more helix resources (i.e. one or more segments of one or more tables).

Starting a server

Make sure you've . If you're using Docker, make sure to . To start a server:

Controller

Discover the controller component of Apache Pinot, enabling efficient data and query management.

The Pinot controller is responsible for the following:

Maintaining global metadata (e.g., configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store.
Hosting the Helix Controller and managing other Pinot components (brokers, servers, minions)
Maintaining the mapping of which servers are responsible for which segments. This mapping is used by the servers to download the portion of the segments that they are responsible for. This mapping is also used by the broker to decide which servers to route the queries to.
Serving admin endpoints for viewing, creating, updating, and deleting configs, which are used to manage and operate the cluster.
Serving endpoints for segment uploads, which are used in offline data pushes. They are responsible for initializing real-time consumption and coordination of persisting real-time segments into the segment store periodically.
Undertaking other management activities such as managing retention of segments, validations.

For redundancy, there can be multiple instances of Pinot controllers. Pinot expects that all controllers are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or .

Running the periodic task manually

The controller runs several periodic tasks in the background, to perform activities such as management and validation. Each periodic task has to define the run frequency and default frequency. Each task runs at its own schedule or can also be triggered manually if needed. The task runs on the lead controller for each table.

For period task configuration details, see .

Use the GET /periodictask/names API to fetch the names of all the periodic tasks running on your Pinot cluster.

To manually run a named periodic task, use the GET /periodictask/run API:

The Log Request Id (api-09630c07) can be used to search through pinot-controller log file to see log entries related to execution of the Periodic task that was manually run.

If tableName (and its type OFFLINE or REALTIME) is not provided, the task will run against all tables.

Starting a controller

Make sure you've . If you're using Docker, make sure to . To start a controller:

Broker

Discover how Apache Pinot's broker component optimizes query processing, data retrieval, and enhances data-driven applications.

Brokers handle Pinot queries. They accept queries from clients and forward them to the right servers. They collect results back from the servers and consolidate them into a single response, to send back to the client.

Pinot brokers are modeled as Helix spectators. They need to know the location of each segment of a table (and each replica of the segments) and route requests to the appropriate server that hosts the segments of the table being queried.

The broker ensures that all the rows of the table are queried exactly once so as to return correct, consistent results for a query. The brokers may optimize to prune some of the segments as long as accuracy is not sacrificed.

Helix provides the framework by which spectators can learn the location in which each partition of a resource (i.e. participant) resides. The brokers use this mechanism to learn the servers that host specific segments of a table.

In the case of hybrid tables, the brokers ensure that the overlap between real-time and offline segment data is queried exactly once, by performing offline and real-time federation.

Let's take this example, we have real-time data for 5 days - March 23 to March 27, and offline data has been pushed until Mar 25, which is 2 days behind real-time. The brokers maintain this time boundary.

Suppose, we get a query to this table : select sum(metric) from table. The broker will split the query into 2 queries based on this time boundary – one for offline and one for real-time. This query becomes select sum(metric) from table_REALTIME where date >= Mar 25 and select sum(metric) from table_OFFLINE where date < Mar 25

The broker merges results from both these queries before returning the result to the client.

Starting a broker

Make sure you've . If you're using Docker, make sure to . To start a broker:

Deep Store

Leverage Apache Pinot's deep store component for efficient large-scale data storage and management, enabling impactful data processing and analysis.

The deep store (or deep storage) is the permanent store for segment files.

It is used for backup and restore operations. New server nodes in a cluster will pull down a copy of segment files from the deep store. If the local segment files on a server gets damaged in some way (or accidentally deleted), a new copy will be pulled down from the deep store on server restart.

The deep store stores a compressed version of the segment files and it typically won't include any indexes. These compressed files can be stored on a local file system or on a variety of other file systems. For more details on supported file systems, see File Systems.

Note: Deep store by itself is not sufficient for restore operations. Pinot stores metadata such as table config, schema, segment metadata in Zookeeper. For restore operations, both Deep Store as well as Zookeeper metadata are required.

How do segments get into the deep store?

There are several different ways that segments are persisted in the deep store.

For offline tables, the batch ingestion job writes the segment directly into the deep store, as shown in the diagram below:

The ingestion job then sends a notification about the new segment to the controller, which in turn notifies the appropriate server to pull down that segment.

For real-time tables, by default, a segment is first built-in memory by the server. It is then uploaded to the lead controller (as part of the Segment Completion Protocol sequence), which writes the segment into the deep store, as shown in the diagram below:

Having all segments go through the controller can become a system bottleneck under heavy load, in which case you can use the peer download policy, as described in Decoupling Controller from the Data Path.

When using this configuration, the server will directly write a completed segment to the deep store, as shown in the diagram below:

Configuring the deep store

For hands-on examples of how to configure the deep store, see the following tutorials:

Getting Started

This section contains quick start guides to help you get up and running with Pinot.

Running Pinot

To simplify the getting started experience, Pinot ships with quick start guides that launch Pinot components in a single process and import pre-built datasets.

For a full list of these guides, see .

Deploy to a public cloud

Data import examples

Getting data into Pinot is easy. Take a look at these two quick start guides which will help you get up and running with sample data for offline and real-time .

Running on public clouds

This page links to multiple quick start guides for deploying Pinot to different public cloud providers.

These quickstart guides show you how to run an Apache Pinot cluster using Kubernetes on different public cloud providers.

Running on Azure

This quickstart guide helps you get started running Pinot on Microsoft Azure.

In this quickstart guide, you will set up a Kubernetes Cluster on

1. Tooling Installation

1.1 Install Kubectl

Follow this link () to install kubectl.

For Mac users

Check kubectl version after installation.

Quickstart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12

1.2 Install Helm

To install Helm, see .

For Mac users

Check helm version after installation.

This quickstart provides helm supports for helm v3.0.0 and v2.12.1. Pick the script based on your helm version.

1.3 Install Azure CLI

Follow this link () to install Azure CLI.

For Mac users

2. (Optional) Log in to your Azure account

This script will open your default browser to sign-in to your Azure Account.

3. (Optional) Create a Resource Group

Use the following script create a resource group in location eastus.

4. (Optional) Create a Kubernetes cluster(AKS) in Azure

This script will create a 3 node cluster named pinot-quickstart for demo purposes.

Modify the parameters in the following example command with your resource group and cluster details:

Once the command succeeds, the cluster is ready to be used.

5. Connect to an existing cluster

Run the following command to get the credential for the cluster pinot-quickstart that you just created:

To verify the connection, run the following:

6. Pinot quickstart

Follow this to deploy your Pinot demo.

7. Delete a Kubernetes Cluster

Running on GCP

This quickstart guide helps you get started running Pinot on Google Cloud Platform (GCP).

In this quickstart guide, you will set up a Kubernetes Cluster on

1. Tooling Installation

1.1 Install Kubectl

Follow this link () to install kubectl.

For Mac users

Check kubectl version after installation.

Quickstart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12

1.2 Install Helm

Follow this link () to install helm.

For Mac users

Check helm version after installation.

This quickstart provides helm supports for helm v3.0.0 and v2.12.1. Choose the script based on your helm version.

1.3 Install Google Cloud SDK

To install Google Cloud SDK, see

1.3.1 For Mac users

Install Google Cloud SDK

Restart your shell

2. (Optional) Initialize Google Cloud Environment

3. (Optional) Create a Kubernetes cluster(GKE) in Google Cloud

This script will create a 3 node cluster named pinot-quickstart in us-west1-b with n1-standard-2 machines for demo purposes.

Modify the parameters in the following example command with your gcloud details:

Use the following command do monitor cluster status:

Once the cluster is in RUNNING status, it's ready to be used.

4. Connect to an existing cluster

Run the following command to get the credential for the cluster pinot-quickstart that you just created:

To verify the connection, run the following:

5. Pinot quickstart

Follow this to deploy your Pinot demo.

6. Delete a Kubernetes Cluster

Troubleshooting Pinot

Find debug information in Pinot

Pinot offers various ways to assist with troubleshooting and debugging problems that might happen.

Start with the debug api which will surface many of the commonly occurring problems. The debug api provides information such as tableSize, ingestion status, and error messages related to state transition in server.

The table debug api can be invoked via the Swagger UI, as in the following image:

It can also be invoked directly by accessing the URL as follows. The api requires the tableName, and can optionally take tableType (offline|realtime) and verbosity level.

curl -X GET "http://localhost:9000/debug/tables/airlineStats?verbosity=0" -H "accept: application/json"

Pinot also provides a variety of operational metrics that can be used for creating dashboards, alerting and monitoring.

Finally, all pinot components log debug information related to error conditions.

Debug a slow query or a query which keeps timing out

Use the following steps:

If the query executes, look at the query result. Specifically look at numEntriesScannedInFilter and numDocsScanned.
1. If numEntriesScannedInFilter is very high, consider adding indexes for the corresponding columns being used in the filter predicates. You should also think about partitioning the incoming data based on the dimension most heavily used in your filter queries.
2. If numDocsScanned is very high, that means the selectivity for the query is low and lots of documents need to be processed after the filtering. Consider refining the filter to increase the selectivity of the query.
If the query is not executing, you can extend the query timeout by appending a timeoutMs parameter to the query, for example, select * from mytable limit 10 option(timeoutMs=60000). Then repeat step 1, as needed.
Look at garbage collection (GC) stats for the corresponding Pinot servers. If a particular server seems to be running full GC all the time, you can do a couple of things such as
1. Increase Java Virtual Machine (JVM) heap (java -Xmx<size>).
2. Consider using off-heap memory for segments.
3. Decrease the total number of segments per server (by partitioning the data in a more efficient way).

Frequently Asked Questions (FAQs)

This page lists pages with frequently asked questions with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, .

General

This page has a collection of frequently asked questions of a general nature with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, .

How does Apache Pinot use deep storage?

When data is pushed to Apache Pinot, Pinot makes a backup copy of the data and stores it on the configured deep-storage (S3/GCP/ADLS/NFS/etc). This copy is stored as tar.gz Pinot segments. Note, that Pinot servers keep a (untarred) copy of the segments on their local disk as well. This is done for performance reasons.

How does Pinot use Zookeeper?

Pinot uses Apache Helix for cluster management, which in turn is built on top of Zookeeper. Helix uses Zookeeper to store the cluster state, including Ideal State, External View, Participants, and so on. Pinot also uses Zookeeper to store information such as Table configurations, schemas, Segment Metadata, and so on.

Why am I getting "Could not find or load class" error when running Quickstart using 0.8.0 release?

Please check the JDK version you are using. You may be getting this error if you are using an older version than the current Pinot binary release was built on. If so, you have two options: switch to the same JDK release as Pinot was built with or download the for the Pinot release and it locally.

Pinot On Kubernetes FAQ

This page has a collection of frequently asked questions about Pinot on Kubernetes with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, .

How to increase server disk size on AWS

The following is an example using Amazon Elastic Kubernetes Service (Amazon EKS).

1. Update Storage Class

In the Kubernetes (k8s) cluster, check the storage class: in Amazon EKS, it should be gp2.

Then update StorageClass to ensure:

Once StorageClass is updated, it should look like this:

2. Update PVC

Once the storage class is updated, then we can update the PersistentVolumeClaim (PVC) for the server disk size.

Now we want to double the disk size for pinot-server-3.

The following is an example of current disks:

The following is the output of data-pinot-server-3:

Now, let's change the PVC size to 2T by editing the server PVC.

Once updated, the specification's PVC size is updated to 2T, but the status's PVC size is still 1T.

3. Restart pod to let it reflect

Restart the pinot-server-3 pod:

Recheck the PVC size:

Import Data

This page lists options for importing data into Pinot with links to detailed instructions with examples.

There are multiple options for importing data into Pinot. The pages in this section provide step-by-step instructions for importing records into Pinot, supported by our plugin architecture. The intent is to get you up and running with imported data as quickly as possible.

Pinot supports multiple file input formats without needing to change anything other than the file name. Each example imports a ready-made dataset so you can see how things work without needing to find or create your own dataset.

Pinot Batch Ingestion

These guides show you how to import data from popular big data platforms.

Pinot Stream Ingestion

This guide shows you how to import data using stream ingestion from Apache Kafka topics.

This guide shows you how to import data using stream ingestion with upsert.

This guide shows you how to import data using stream ingestion with deduplication.

This guide shows you how to import data using stream ingestion with CLP.

Pinot file systems

By default, Pinot does not come with a storage layer, so all the data sent won't be stored in case of system crash. In order to persistently store the generated segments, you will need to change controller and server configs to add a deep storage. See File systems for all the info and related configs.

These guides show you how to import data and persist it in these file systems.

Pinot input formats

This guide shows you how to import data from various Pinot-supported input formats.

This guide shows you how to handle the complex type in the ingested data, such as map and array.

Reloading and uploading existing Pinot segments

This guide shows you how to reload Pinot segments from your deep store.

This guide shows you how to upload Pinot segments from an old, closed Pinot instance.

From Query Console

Insert a file into Pinot from Query Console

This feature is supported after the 0.11.0 release. Reference PR:

Prerequisite

Ensure you have available Pinot Minion instances deployed within the cluster.
Pinot version is 0.11.0 or above

How it works

Parse the query with the table name and directory URI along with a list of options for the ingestion job.
Call controller minion task execution API endpoint to schedule the task on minion
Response has the schema of table name and task job id.

Usage Syntax

INSERT INTO [database.]table FROM FILE dataDirURI OPTION ( k=v ) [, OPTION (k=v)]*

Example

Screenshot

Insert Rows into Pinot

We are actively developing this feature...

The details will be revealed soon.

Flink

Batch ingestion of data into Apache Pinot using Apache Flink.

Pinot supports Apache Flink as a processing framework to push segment files to the database.

Pinot distribution contains an Apache Flink that can be used as part of the Apache Flink application (Streaming or Batch) to directly write into a designated Pinot database.

Example

Flink application

Here is an example code snippet to show how to utilize the in a Flink streaming application:

As in the example shown above, the only required information from the Pinot side is the table and the table .

For a more detailed executable, refer to the .

Table Config

PinotSinkFunction uses mostly the TableConfig object to infer the batch ingestion configuration to start a SegmentWriter and SegmentUploader to communicate with the Pinot cluster.

Note that even though in the above example Flink application is running in streaming mode, the data is still batch together and flush/upload to Pinot once the flush threshold is reached. It is not a direct streaming write into Pinot.

Here is an example table config

the only required configurations are:

"outputDirURI": where PinotSinkFunction should write the constructed segment file to
"push.controllerUri": which Pinot cluster (controller) URL PinotSinkFunction should communicate with.

The rest of the configurations are standard for any Pinot table.

Backfill Data

Batch ingestion of backfill data into Apache Pinot.

Introduction

Pinot batch ingestion involves two parts: routine ingestion job(hourly/daily) and backfill. Here are some examples to show how routine batch ingestion works in Pinot offline table:

High-level description

Organize raw data into buckets (eg: /var/pinot/airlineStats/rawdata/2014/01/01). Each bucket typically contains several files (eg: /var/pinot/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01_0.avro)
Run a Pinot batch ingestion job, which points to a specific date folder like ‘/var/pinot/airlineStats/rawdata/2014/01/01’. The segment generation job will convert each such avro file into a Pinot segment for that day and give it a unique name.
Run Pinot segment push job to upload those segments with those uniques names via a Controller API

IMPORTANT: The segment name is the unique identifier used to uniquely identify that segment in Pinot. If the controller gets an upload request for a segment with the same name - it will attempt to replace it with the new one.

This newly uploaded data can now be queried in Pinot. However, sometimes users will make changes to the raw data which need to be reflected in Pinot. This process is known as 'Backfill'.

How to backfill data in Pinot

Pinot supports data modification only at the segment level, which means you must update entire segments for doing backfills. The high level idea is to repeat steps 2 (segment generation) and 3 (segment upload) mentioned above:

Backfill jobs must run at the same granularity as the daily job. E.g., if you need to backfill data for 2014/01/01, specify that input folder for your backfill job (e.g.: ‘/var/pinot/airlineStats/rawdata/2014/01/01’)
The backfill job will then generate segments with the same name as the original job (with the new data).
When uploading those segments to Pinot, the controller will replace the old segments with the new ones (segment names act like primary keys within Pinot) one by one.

Edge case example

Backfill jobs expect the same number of (or more) data files on the backfill date. So the segment generation job will create the same number of (or more) segments than the original run.

For example, assuming table airlineStats has 2 segments(airlineStats_2014-01-01_2014-01-01_0, airlineStats_2014-01-01_2014-01-01_1) on date 2014/01/01 and the backfill input directory contains only 1 input file. Then the segment generation job will create just one segment: airlineStats_2014-01-01_2014-01-01_0. After the segment push job, only segment airlineStats_2014-01-01_2014-01-01_0 got replaced and stale data in segment airlineStats_2014-01-01_2014-01-01_1 are still there.

If the raw data is modified in such a way that the original time bucket has fewer input files than the first ingestion run, backfill will fail.

Dimension table

Batch ingestion of data into Apache Pinot using dimension tables.

Dimension tables are a special kind of offline tables from which data can be looked up via the lookup UDF, providing join-like functionality.

Dimension tables are replicated on all the hosts for a given tenant to allow faster lookups.

To mark an offline table as a dimension table, isDimTable should be set to true and segmentsConfig.segementPushType should be set to REFRESH in the table config, like this:

{
  "OFFLINE": {
    "tableName": "dimBaseballTeams_OFFLINE",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "schemaName": "dimBaseballTeams",
      "segmentPushType": "REFRESH"
    },
    "metadata": {},
    "quota": {
      "storage": "200M"
    },
    "isDimTable": true
  }
}

As dimension tables are used to perform lookups of dimension values, they are required to have a primary key (can be a composite key).

{
  "dimensionFieldSpecs": [
    {
      "dataType": "STRING",
      "name": "teamID"
    },
    {
      "dataType": "STRING",
      "name": "teamName"
    }
  ],
  "schemaName": "dimBaseballTeams",
  "primaryKeyColumns": ["teamID"]
}

When a table is marked as a dimension table, it will be replicated on all the hosts, which means that these tables must be small in size.

The maximum size quota for a dimension table in a cluster is controlled by the controller.dimTable.maxSize controller property. Table creation will fail if the storage quota exceeds this maximum size.

A dimension table cannot be part of a hybrid table.

Amazon Kinesis

This guide shows you how to ingest a stream of records from an Amazon Kinesis topic into a Pinot table.

To ingest events from an Amazon Kinesis stream into Pinot, set the following configs into the table config:

{
  "tableName": "kinesisTable",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "timestamp",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kinesis",
      "stream.kinesis.topic.name": "<your kinesis stream name>",
      "region": "<your region>",
      "accessKey": "<your access key>",
      "secretKey": "<your secret key>",
      "shardIteratorType": "AFTER_SEQUENCE_NUMBER",
      "stream.kinesis.consumer.type": "lowlevel",
      "stream.kinesis.fetch.timeout.millis": "30000",
      "stream.kinesis.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kinesis.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kinesis.KinesisConsumerFactory",
      "realtime.segment.flush.threshold.rows": "1000000",
      "realtime.segment.flush.threshold.time": "6h"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}

where the Kinesis specific properties are:

Property

Description

streamType

This should be set to "kinesis"

stream.kinesis.topic.name

Kinesis stream name

region

Kinesis region e.g. us-west-1

accessKey

Kinesis access key

secretKey

Kinesis secret key

shardIteratorType

Set to LATEST to consume only new records, TRIM_HORIZON for earliest sequence number_,_ AT___SEQUENCE_NUMBER and AFTER_SEQUENCE_NUMBER to start consumptions from a particular sequence number

maxRecordsToFetch

... Default is 20.

Kinesis supports authentication using the DefaultCredentialsProviderChain. The credential provider looks for the credentials in the following order -

Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey
Web Identity Token credentials from the environment or container
Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable,
Instance profile credentials delivered through the Amazon EC2 metadata service

Although you can also specify the accessKey and secretKey in the properties above, we don't recommend this unsecure method. We recommend using it only for non-production proof-of-concept (POC) setups. You can also specify other AWS fields such as AWS_SESSION_TOKEN as environment variables and config and it will work.

Limitations

ShardID is of the format "shardId-000000000001". We use the numeric part as partitionId. Our partitionId variable is integer. If shardIds grow beyond Integer.MAX\_VALUE, we will overflow into the partitionId space.
Segment size based thresholds for segment completion will not work. It assumes that partition "0" always exists. However, once the shard 0 is split/merged, we will no longer have partition 0.

Stream Ingestion with Dedup

Deduplication support in Apache Pinot.

Pinot provides native support for deduplication (dedup) during the real-time ingestion (v0.11.0+).

Prerequisites for enabling dedup

To enable dedup on a Pinot table, make the following table configuration and schema changes:

Define the primary key in the schema

To be able to dedup records, a primary key is needed to uniquely identify a given record. To define a primary key, add the field primaryKeyColumns to the schema definition.

schemaWithPK.json

{
    "primaryKeyColumns": ["id"]
}

Note this field expects a list of columns, as the primary key can be composite.

While ingesting a record, if its primary key is found to be already present, the record will be dropped.

Partition the input stream by the primary key

An important requirement for the Pinot dedup table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the send API. If the original stream is not partitioned, then a streaming processing job (e.g. Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.

Use strictReplicaGroup for routing

The dedup Pinot table can use only the low-level consumer for the input streams. As a result, it uses the partitioned replica-group assignment for the segments. Moreover, dedup poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires strictReplicaGroup as the routing strategy. To use that, configure instanceSelectorType in Routing as the following:

routing

{
  "routing": {
    "instanceSelectorType": "strictReplicaGroup"
  }
}

Other limitations

The high-level consumer is not allowed for the input stream ingestion, which means stream.kafka.consumer.type must be lowLevel.
The incoming stream must be partitioned by the primary key such that, all records with a given primaryKey must be consumed by the same Pinot server instance.

Enable dedup in the table configurations

To enable dedup for a REALTIME table, add the following to the table config.

tableConfigWithDedup.json

{ 
 ...
  "dedupConfig": { 
        "dedupEnabled": true, 
        "hashFunction": "NONE" 
   }, 
 ...
}

Supported values for hashFunction are NONE, MD5 and MURMUR3, with the default being NONE.

Best practices

Unlike other real-time tables, Dedup table takes up more memory resources as it needs to bookkeep the primary key and its corresponding segment reference, in memory. As a result, it's important to plan the capacity beforehand, and monitor the resource usage. Here are some recommended practices of using Dedup table.

Create the Kafka topic with more partitions. The number of Kafka partitions determines the partition numbers of the Pinot table. The more partitions you have in the Kafka topic, more Pinot servers you can distribute the Pinot table to and therefore more you can scale the table horizontally.
Dedup table maintains an in-memory map from the primary key to the segment reference. So it's recommended to use a simple primary key type and avoid composite primary keys to save the memory cost. In addition, consider the hashFunction config in the Dedup config, which can be MD5 or MURMUR3, to store the 128-bit hashcode of the primary key instead. This is useful when your primary key takes more space. But keep in mind, this hash may introduce collisions, though the chance is very low.
Monitoring: Set up a dashboard over the metric pinot.server.dedupPrimaryKeysCount.tableName to watch the number of primary keys in a table partition. It's useful for tracking its growth which is proportional to the memory usage growth.
Capacity planning: It's useful to plan the capacity beforehand to ensure you will not run into resource constraints later. A simple way is to measure the amount of the primary keys in the Kafka throughput per partition and time the primary key space cost to approximate the memory usage. A heap dump is also useful to check the memory usage so far on an dedup table instance.

File Systems

This section contains a collection of short guides to show you how to import data from a Pinot-supported file system.

FileSystem is an abstraction provided by Pinot to access data stored in distributed file systems (DFS).

Pinot uses distributed file systems for the following purposes:

Batch ingestion job: To read the input data (CSV, Avro, Thrift, etc.) and to write generated segments to DFS.
Controller: When a segment is uploaded to the controller, the controller saves it in the configured DFS.
Server:- When a server(s) is notified of a new segment, the server copies the segment from remote DFS to their local node using the DFS abstraction.

Supported file systems

Pinot lets you choose a distributed file system provider. The following file systems are supported by Pinot:

Amazon S3
Google Cloud Storage
HDFS
Azure Data Lake Storage

Enabling a file system

To use a distributed file system, you need to enable plugins. To do that, specify the plugin directory and include the required plugins:

-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-plugin-to-include-1,pinot-plugin-to-include-2

You can change the file system in the controller and server configuration. In the following configuration example, the URI is s3://bucket/path/to/file and scheme refers to the file system URI prefix s3.

#CONTROLLER

pinot.controller.storage.factory.class.[scheme]=className of the pinot file system
pinot.controller.segment.fetcher.protocols=file,http,[scheme]
pinot.controller.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

#SERVER

pinot.server.storage.factory.class.[scheme]=className of the Pinot file system
pinot.server.segment.fetcher.protocols=file,http,[scheme]
pinot.server.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

You can also change the file system during ingestion. In the ingestion job spec, specify the file system with the following configuration:

pinotFSSpecs
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS

Azure Data Lake Storage

This guide shows you how to import data from files stored in Azure Data Lake Storage Gen2 (ADLS Gen2)

Enable the Azure Data Lake Storage using the pinot-adls plugin. In the controller or server, add the config:

By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...

Azure Blob Storage provides the following options:

accountName: Name of the Azure account under which the storage is created.
accessKey: Access key required for the authentication.
fileSystemName: Name of the file system to use, for example, the container name (similar to the bucket name in S3).
enableChecksum: Enable MD5 checksum for verification. Default is false.

Each of these properties should be prefixed by pinot.[node].storage.factory.class.adl2. where node is either controller or server depending on the config, like this:

Examples

Job spec

Controller config

Server config

Minion config

Google Cloud Storage

This guide shows you how to import data from GCP (Google Cloud Platform).

Enable the using the pinot-gcs plugin. In the controller or server, add the config:

GCP file systems provides the following options:

projectId - The name of the Google Cloud Platform project under which you have created your storage bucket.
gcpKey - Location of the json file containing GCP keys. You can refer to download the keys.

Each of these properties should be prefixed by pinot.[node].storage.factory.class.gs. where node is either controller or server depending on the configuration, like this:

Examples

Job spec

Controller config

Server config

Minion config

Reload a table segment

Reload a table segment in Apache Pinot.

When Pinot writes data to segments in a table, it saves those segments to a deep store location specified in your table configuration, such as a storage drive or Amazon S3 bucket.

To reload segments from your deep store, use the Pinot Controller API or Pinot Admin Console.

Use the Pinot Controller API to reload segments

To reload all segments from a table, use:

POST /segments/{tableName}/reload

To reload a specific segment from a table, use:

POST /segments/{tableName}/{segmentName}/reload

A successful API call returns the following response:

{
    "status": "200"
}

Use the Pinot Admin Console to reload segments

To use the Pinot Admin Console, do the following:

From the left navigation menu, select Cluster Manager.
Under TENANTS, select the Tenant Name.
From the list of tables in the tenant, select the Table Name.
Do one of the following:
- To reload all segments, under OPERATIONS, click Reload All Segments.
- To reload a specific segment, under SEGMENTS, select the Segment Name, and then in the new OPERATIONS section, select Reload Segment.

Indexing

This page describes the indexing techniques available in Apache Pinot

Apache Pinot supports the following indexing techniques:

Bloom filter
Forward index
- Dictionary-encoded forward index with bit compression
- Raw value forward index
- Sorted forward index with run-length encoding
Geospatial
Inverted index
- Bitmap inverted index
- Sorted inverted index
JSON index
Range index
Star-tree index
Text Index
- Native text index
- Text search support
Timestamp index

By default, Pinot creates a dictionary-encoded forward index for each column.

Enabling indexes

There are two ways to enable indexes for a Pinot table.

As part of ingestion, during Pinot segment generation

Indexing is enabled by specifying the desired column names in the table configuration. More details about how to configure each type of index can be found in the respective index's section linked above or in the table configuration reference.

Dynamically added or removed

Indexes can also be dynamically added to or removed from segments at any point. Update your table configuration with the latest set of indexes you want to have.

For example, if you have an inverted index on the foo field and now want to also include the bar field, you would update your table configuration from this:

"tableIndexConfig": {
        "invertedIndexColumns": ["foo"],
        ...
    }

To this:

"tableIndexConfig": {
        "invertedIndexColumns": ["foo", "bar"],
        ...
    }

The updated index configuration won't be picked up unless you invoke the reload API. This API sends reload messages via Helix to all servers, as part of which indexes are added or removed from the local segments. This happens without any downtime and is completely transparent to the queries.

When adding an index, only the new index is created and appended to the existing segment. When removing an index, its related states are cleaned up from Pinot servers. You can find this API under the Segments tab on Swagger:

curl -X POST \
  "http://localhost:9000/segments/myTable/reload" \
  -H "accept: application/json"

You can also find this action on the Cluster Manager in the Pinot UI, on the specific table's page.

Not all indexes can be retrospectively applied to existing segments. For more detailed documentation on applying indexes, see the Indexing FAQ.

Tuning Index

The inverted index provides good performance for most use cases, especially if your use case doesn't have a strict low latency requirement.

You should start by using this, and if your queries aren't fast enough, switch to advanced indices like the sorted or star-tree index.

Inverted Index

This page describes configuring the inverted index for Apache Pinot

An inverted index stores a map of words to the documents that contain them.

Bitmap inverted index

When an inverted index is enabled for a column, Pinot maintains a map from each value to a bitmap of rows, which makes value lookup take constant time. If you have a column that is frequently used for filtering, adding an inverted index will improve performance greatly. You can create an inverted index on a multi-value column.

An inverted index can be configured for a table by setting it in the table configuration:

{
    "tableIndexConfig": {
        "invertedIndexColumns": [
            "column_name",
            ...
        ],
        ...
    }
}

Sorted inverted index

A sorted forward index can directly be used as an inverted index, with log(n) time lookup and it can benefit from data locality.

For the following example, if the query has a filter on memberId, Pinot will perform a binary search on memberId values to find the range pair of docIds for corresponding filtering value. If the query needs to scan values for other columns after filtering, values within the range docId pair will be located together, which means we can benefit from data locality.

A sorted index performs much better than an inverted index, but it can only be applied to one column per table. When the query performance with an inverted index is not good enough and most queries are filtering on the same column (e.g. memberId), a sorted index can improve the query performance.

Bloom Filter

This page describes configuring the bloom filter for Apache Pinot

The bloom filter prunes segments that do not contain any record matching an EQUALITY predicate.

This is useful for a query like the following:

There are 3 parameters to configure the bloom filter:

fpp: False positive probability of the bloom filter (from 0 to 1, 0.05 by default). The lower the fpp , the higher accuracy the bloom filter has, but it will also increase the size of the bloom filter.
maxSizeInBytes: Maximum size of the bloom filter (unlimited by default). If a fpp setting generates a bloom filter larger than this size, using this setting will increase the fpp to keep the bloom filter size within this limit.
loadOnHeap: Whether to load the bloom filter using heap memory or off-heap memory (false by default).

There are 2 ways to configure a bloom filter for a table in the :

Default settings

Customized parameters

Range Index

This page describes configuring the range index for Apache Pinot

Range indexing allows you to get better performance for queries that involve filtering over a range.

It would be useful for a query like the following:

SELECT COUNT(*) 
FROM baseballStats 
WHERE hits > 11

A range index is a variant of an inverted index, where instead of creating a mapping from values to columns, we create mapping of a range of values to columns. You can use the range index by setting the following config in the table configuration.

{
    "tableIndexConfig": {
        "rangeIndexColumns": [
            "column_name",
            ...
        ],
        ...
    }
}

Range index is supported for both dictionary and raw-encoded columns.

A good thumb rule is to use a range index when you want to apply range predicates on metric columns that have a very large number of unique values. This is because using an inverted index for such columns will create a very large index that is inefficient in terms of storage and performance.

Timestamp Index

Use a timestamp index to speed up your time query with different granularities

This feature is supported from Pinot 0.11+.

Background

The TIMESTAMP data type introduced in the stores value as millisecond epoch long value.

Typically, users won't need this low level granularity for analytics queries. Scanning the data and time value conversion can be costly for big data.

A common query pattern for timestamp columns is filtering on a time range and then grouping by using different time granularities(days/month/etc).

Typically, this requires the query executor to extract values, apply the transform functions then do filter/groupBy, with no leverage on the dictionary or index.

This was the inspiration for the Pinot timestamp index, which is used to improve the query performance for range query and group by queries on TIMESTAMP columns.

Supported data type

A TIMESTAMP index can only be created on the TIMESTAMP data type.

Timestamp Index

You can configure the granularity for a Timestamp data type column. Then:

Pinot will pre-generate one column per time granularity using a forward index and range index. The naming convention is $${ts_column_name}$${ts_granularity}, where the timestamp column ts with granularities DAY, MONTH will have two extra columns generated: $ts$DAY and $ts$MONTH.
Query overwrite for predicate and selection/group by: 2.1 GROUP BY: Functions like dateTrunc('DAY', ts) will be translated to use the underly column $ts$DAY to fetch data. 2.2 PREDICATE: range index is auto-built for all granularity columns.

Example query usage:

Some preliminary benchmarking shows the query performance across 2.7 billion records improved from 45 secs to 4.2 secs using a timestamp index and a query like this:

vs.

Usage

The timestamp index is configured on a per column basis inside the fieldConfigList section in the table configuration.

Specify TIMESTAMP as part of the indexTypes. Then, in the timestampConfig field, specify the granularities that you want to index.

Sample config:

Releases

The following summarizes Pinot's releases, from the latest one to the earliest one.

Note

Before upgrading from one version to another one, read the release notes. While the Pinot committers strive to keep releases backward-compatible and introduce new features in a compatible manner, your environment may have a unique combination of configurations/data/schema that may have been somehow overlooked. Before you roll out a new release of Pinot on your cluster, it is best that you run the compatibility test suite that Pinot provides. The tests can be easily customized to suit the configurations and tables in your pinot cluster(s). As a good practice, you should build your own test suite, mirroring the table configurations, schema, sample data, and queries that are used in your cluster.

1.0.0 (September 2023)

0.12.1 (March 2023)

0.12.0 (December 2022)

0.11.0 (September 2022)

0.10.0 (March 2022)

0.9.3 (December 2021)

0.9.2 (December 2021)

0.9.1 (December 2021)

0.9.0 (November 2021)

0.8.0 (August 2021)

0.7.1 (April 2021)

0.6.0 (November 2020)

0.5.0 (September 2020)

0.4.0 (June 2020)

0.3.0 (March 2020)

0.2.0 (November 2019)

0.1.0 (March 2019, First release)

0.12.1

Summary

This is a bug-fixing release contains:

use legacy case-when format ()

The release is based on the release 0.12.0 with the following cherry-picks:

0.9.3

Summary

This is a bug fixing release contains:

Update Log4j to 2.17.0 to address ()

The release is based on the release 0.9.2 with the following cherry-picks:

0.9.2

Summary

This is a bug fixing release contains:

Upgrade log4j to 2.16.0 to fix ()
Upgrade swagger-ui to 3.23.11 to fix ()
Fix the bug that RealtimeToOfflineTask failed to progress with large time bucket gaps ().

The release is based on the release 0.9.1 with the following cherry-picks:

0.9.1

Summary

This release fixes the major issue of and a major bug fixing of pinot admin exit code issue().

The release is based on the release 0.9.0 with the following cherry-picks:

0.2.0

The 0.2.0 release is the first release after the initial one and includes several improvements, reported following.

New Features and Bug Fixes

Added support for Kafka 2.0
Table rebalancer now supports a minimum number of serving replicas during rebalance
Added support for UDF in filter predicates and selection
Added support to use hex string as the representation of byte array for queries (see PR #4041)
Added support for parquet reader (see PR #3852)
Introduced interface stability and audience annotations (see PR #4063)
Refactor HelixBrokerStarter to separate constructor and start() - backwards incompatible (see PR #4100)
Admin tool for listing segments with invalid intervals for offline tables
Migrated to log4j2 (see PR #4139)
Added simple avro msg decoder
Added support for passing headers in Pinot client
Table rebalancer now supports a minimum number of serving replicas during rebalance
Support transform functions with AVG aggregation function (see PR #4557)
Configurations additions/changes
- Allow customized metrics prefix (see PR #4392)
- Controller.enable.batch.message.mode to false by default (see PR #3928)
- RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (see PR #3946)
- Config to control kafka fetcher size and increase default (see PR #3869)
- Added a percent threshold to consider startup of services (see PR #4011)
- Make SingleConnectionBrokerRequestHandler as default (see PR #4048)
- Always enable default column feature, remove the configuration (see PR #4074)
- Remove redundant default broker configurations (see PR #4106)
- Removed some config keys in server(see PR #4222)
- Add config to disable HLC realtime segment (see PR #4235)
- Make RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (see PR #3946)
- The following config variables are deprecated and will be removed in the next release:
  - pinot.broker.requestHandlerType will be removed, in favor of using the "singleConnection" broker request handler. If you have set this configuration, remove it and use the default type ("singleConnection") for broker request handler.

Work in Progress

We are in the process of separating Helix and Pinot controllers, so that administrators can have the option of running independent Helix controllers and Pinot controllers.
We are in the process of moving towards supporting SQL query format and results.
We are in the process of separating instance and segment assignment using instance pools to optimize the number of Helix state transitions in Pinot clusters with thousands of tables.

Other Notes

Task management does not work correctly in this release, due to bugs in Helix. We will upgrade to Helix 0.9.2 (or later) version to get this fixed.
You must upgrade to this release before moving onto newer versions of Pinot release. The protocol between Pinot-broker and Pinot-server has been changed and this release has the code to retain compatibility moving forward. Skipping this release may (depending on your environment) cause query errors if brokers are upgraded and servers are in the process of being upgraded.
As always, we recommend that you upgrade controllers first, and then brokers and lastly the servers in order to have zero downtime in production clusters.
Pull Request #4100 introduces a backwards incompatible change to Pinot broker. If you use the Java constructor on HelixBrokerStarter class, then you will face a compilation error with this version. You will need to construct the object and call start() method in order to start the broker.
Pull Request #4139 introduces a backwards incompatible change for log4j configuration. If you used a custom log4j configuration (log4j.xml), you need to write a new log4j2 configuration (log4j2.xml). In addition, you may need to change the arguments on the command line to start Pinot components.
If you used Pinot-admin command to start Pinot components, you don't need any change. If you used your own commands to start pinot components, you will need to pass the new log4j2 config as a jvm parameter (i.e. substitute -Dlog4j.configuration or -Dlog4j.configurationFile argument with -Dlog4j2.configurationFile=log4j2.xml).

0.1.0

The 0.1.0 is first release of Pinot as an Apache project

New Features

First release
Off-line data ingestion from Apache Hadoop
Real-time data ingestion from Apache Kafka

Recipes

Here you will find a collection of ready-made sample applications and examples for real-world data

For Users

Query

Learn how to query Apache Pinot using SQL or explore data using the web-based Pinot query console.

Explore query syntax:

Query Options

This document contains all the available query options

Supported Query Options

Key

Description

Default Behavior

timeoutMs

Timeout of the query in milliseconds

Use table/broker level timeout

enableNullHandling

Enable the null handling of the query (introduced in 0.11.0)

false (disabled)

explainPlanVerbose

Return verbose result for EXPLAIN query (introduced in 0.11.0)

false (not verbose)

useMultistageEngine

Use multi-stage engine to execute the query (introduced in 0.11.0)

false (use single-stage engine)

maxExecutionThreads

Maximum threads to use to execute the query. Useful to limit the resource usage for expensive queries

Half of the CPU cores for non-group-by queries; all CPU cores for group-by queries

numReplicaGroupsToQuery

When replica-group based routing is enabled, use it to query multiple replica-groups (introduced in 0.11.0)

1 (only query servers within the same replica-group)

minSegmentGroupTrimSize

Minimum groups to keep when trimming groups at the segment level for group-by queries. See

Server level config

minServerGroupTrimSize

Minimum groups to keep when trimming groups at the server level for group-by queries. See

Server level config

skipUpsert

For upsert-enabled table, skip the effect of upsert and query all the records. See

false (exclude the replaced records)

useStarTree

Useful to debug the star-tree index (introduced in 0.11.0)

true (use star-tree if available)

AndScanReordering

disabled

Set Query Options

SET statement

After release 0.11.0, query options can be set using the SET statement:

SET key1 = 'value1';
SET key2 = 123;
SELECT * FROM myTable

OPTION keyword (deprecated)

Before release 0.11.0, query options can be appended to the query with the OPTION keyword:

SELECT * FROM myTable OPTION(key1=value1, key2=123)
SELECT * FROM myTable OPTION(key1=value1) OPTION(key2=123)
SELECT * FROM myTable OPTION(timeoutMs=30000)

Running in Kubernetes

Pinot quick start in Kubernetes

Get started running Pinot in Kubernetes.

Note: The examples in this guide are sample configurations to be used as reference. For production setup, you may want to customize it to your needs.

Prerequisites

Kubernetes

This guide assumes that you already have a running Kubernetes cluster.

If you haven't yet set up a Kubernetes cluster, see the links below for instructions:

Enable Kubernetes on Docker-Desktop
Install Minikube for local setup
- Make sure to run with enough resources: minikube start --vm=true --cpus=4 --memory=8g --disk-size=50g
Set up a Kubernetes Cluster using Amazon Elastic Kubernetes Service (Amazon EKS)
Set up a Kubernetes Cluster using Google Kubernetes Engine (GKE)
Set up a Kubernetes Cluster using Azure Kubernetes Service (AKS)

Pinot

Make sure that you've downloaded Apache Pinot. The scripts for the setup in this guide can be found in our open source project on GitHub.

# checkout pinot
git clone https://github.com/apache/pinot.git
cd pinot/helm/pinot

Set up a Pinot cluster in Kubernetes

Start Pinot with Helm

The Pinot repository has pre-packaged Helm charts for Pinot and Presto. The Helm repository index file is here.

helm repo add pinot https://raw.githubusercontent.com/apache/pinot/master/helm
kubectl create ns pinot-quickstart
helm install pinot pinot/pinot \
    -n pinot-quickstart \
    --set cluster.name=pinot \
    --set server.replicaCount=2

Note: Specify StorageClass based on your cloud vendor. Don't mount a blob store (such as AzureFile, GoogleCloudStorage, or S3) as the data serving file system. Use only Amazon EBS/GCP Persistent Disk/Azure Disk-style disks.

For AWS: "gp2"
For GCP: "pd-ssd" or "standard"
For Azure: "AzureDisk"
For Docker-Desktop: "hostpath"

1.1.1 Update Helm dependency

helm dependency update

1.1.2 Start Pinot with Helm

For Helm v2.12.1:

If your Kubernetes cluster is recently provisioned, ensure Helm is initialized by running:

helm init --service-account tiller

Then deploy a new HA Pinot cluster using the following command:

helm install --namespace "pinot-quickstart" --name "pinot" pinot

For Helm v3.0.0:

kubectl create ns pinot-quickstart
helm install -n pinot-quickstart pinot ./pinot

1.1.3 Troubleshooting (For helm v2.12.1)

If you see the error below:

Error: could not find tiller.

Run the following:

kubectl -n kube-system delete deployment tiller-deploy
kubectl -n kube-system delete service/tiller-deploy
helm init --service-account tiller

If you encounter a permission issue, like the following:

Error: release pinot failed: namespaces "pinot-quickstart" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "namespaces" in API group "" in the namespace "pinot-quickstart"

Run the command below:

kubectl apply -f helm-rbac.yaml

Check Pinot deployment status

kubectl get all -n pinot-quickstart

Load data into Pinot using Kafka

Bring up a Kafka cluster for real-time data ingestion

helm repo add kafka https://charts.bitnami.com/bitnami
helm install -n pinot-quickstart kafka kafka/kafka --set replicas=1,zookeeper.image.tag=latest

Check Kafka deployment status

Ensure the Kafka deployment is ready before executing the scripts in the following steps. Run the following command:

kubectl get all -n pinot-quickstart | grep kafka

Below is an example output showing the deployment is ready:

pod/kafka-0                                                 1/1     Running     0          2m
pod/kafka-zookeeper-0                                       1/1     Running     0          10m
pod/kafka-zookeeper-1                                       1/1     Running     0          9m
pod/kafka-zookeeper-2                                       1/1     Running     0          8m

Create Kafka topics

Run the scripts below to create two Kafka topics for data ingestion:

kubectl -n pinot-quickstart exec kafka-0 -- kafka-topics.sh --bootstrap-server kafka-0:9092 --topic flights-realtime --create --partitions 1 --replication-factor 1
kubectl -n pinot-quickstart exec kafka-0 -- kafka-topics.sh --bootstrap-server kafka-0:9092 --topic flights-realtime-avro --create --partitions 1 --replication-factor 1

Load data into Kafka and create Pinot schema/tables

The script below does the following:

Ingests 19492 JSON messages to Kafka topic flights-realtime at a speed of 1 msg/sec
Ingests 19492 Avro messages to Kafka topic flights-realtime-avro at a speed of 1 msg/sec
Uploads Pinot schema airlineStats
Creates Pinot table airlineStats to ingest data from JSON encoded Kafka topic flights-realtime
Creates Pinot table airlineStatsAvro to ingest data from Avro encoded Kafka topic flights-realtime-avro

kubectl apply -f pinot/pinot-realtime-quickstart.yml

Query with the Pinot Data Explorer

Pinot Data Explorer

The script below, located at ./pinot/helm/pinot, performs local port forwarding, and opens the Pinot query console in your default web browser.

./query-pinot-data.sh

Query Pinot with Superset

Bring up Superset using Helm

Install the SuperSet Helm repository:

helm repo add superset https://apache.github.io/superset

Get the Helm values configuration file:

helm inspect values superset/superset > /tmp/superset-values.yaml

For Superset to install Pinot dependencies, edit /tmp/superset-values.yaml file to add apinotdb pip dependency into bootstrapScript field.
You can also build your own image with this dependency or use the image apachepinot/pinot-superset:latest instead.

Replace the default admin credentials inside the init section with a meaningful user profile and stronger password.
Install Superset using Helm:

kubectl create ns superset
helm upgrade --install --values /tmp/superset-values.yaml superset superset/superset -n superset

Ensure your cluster is up by running:

kubectl get all -n superset

Access the Superset UI

Run the below command to port forward Superset to your localhost:18088.

kubectl port-forward service/superset 18088:8088 -n superset

Navigate to Superset in your browser with the admin credentials you set in the previous section.
Create a new database connection with the following URI: pinot+http://pinot-broker.pinot-quickstart:8099/query?controller=http://pinot-controller.pinot-quickstart:9000/
Once the database is added, you can add more data sets and explore the dashboard options.

Access Pinot with Trino

Deploy Trino

Deploy Trino with the Pinot plugin installed:

helm repo add trino https://trinodb.github.io/charts/

See the charts in the Trino Helm chart repository:

helm search repo trino

In order to connect Trino to Pinot, you'll need to add the Pinot catalog, which requires extra configurations. Run the below command to get all the configurable values.

helm inspect values trino/trino > /tmp/trino-values.yaml

To add the Pinot catalog, edit the additionalCatalogs section by adding:

additionalCatalogs:
  pinot: |
    connector.name=pinot
    pinot.controller-urls=pinot-controller.pinot-quickstart:9000

Pinot is deployed at namespace pinot-quickstart, so the controller serviceURL is pinot-controller.pinot-quickstart:9000

After modifying the /tmp/trino-values.yaml file, deploy Trino with:

kubectl create ns trino-quickstart
helm install my-trino trino/trino --version 0.2.0 -n trino-quickstart --values /tmp/trino-values.yaml

Once you've deployed Trino, check the deployment status:

kubectl get pods -n trino-quickstart

Query Pinot with the Trino CLI

Once Trino is deployed, run the below command to get a runnable Trino CLI.

Download the Trino CLI:

curl -L https://repo1.maven.org/maven2/io/trino/trino-cli/363/trino-cli-363-executable.jar -o /tmp/trino && chmod +x /tmp/trino

Port forward Trino service to your local if it's not already exposed:

echo "Visit http://127.0.0.1:18080 to use your application"
kubectl port-forward service/my-trino 18080:8080 -n trino-quickstart

Use the Trino console client to connect to the Trino service:

/tmp/trino --server localhost:18080 --catalog pinot --schema default

Query Pinot data using the Trino CLI, like in the sample queries below.

Sample queries to execute

List all catalogs

trino:default> show catalogs;

  Catalog
---------
 pinot
 system
 tpcds
 tpch
(4 rows)

Query 20211025_010256_00002_mxcvx, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0.70 [0 rows, 0B] [0 rows/s, 0B/s]

List all tables

trino:default> show tables;

    Table
--------------
 airlinestats
(1 row)

Query 20211025_010326_00003_mxcvx, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0.28 [1 rows, 29B] [3 rows/s, 104B/s]

Show schema

trino:default> DESCRIBE airlinestats;

        Column        |      Type      | Extra | Comment
----------------------+----------------+-------+---------
 flightnum            | integer        |       |
 origin               | varchar        |       |
 quarter              | integer        |       |
 lateaircraftdelay    | integer        |       |
 divactualelapsedtime | integer        |       |
 divwheelsons         | array(integer) |       |
 divwheelsoffs        | array(integer) |       |
......

Query 20211025_010414_00006_mxcvx, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0.37 [79 rows, 5.96KB] [212 rows/s, 16KB/s]

Count total documents

trino:default> select count(*) as cnt from airlinestats limit 10;

 cnt
------
 9746
(1 row)

Query 20211025_015607_00009_mxcvx, FINISHED, 2 nodes
Splits: 17 total, 17 done (100.00%)
0.24 [1 rows, 9B] [4 rows/s, 38B/s]

Access Pinot with Presto

Deploy Presto with the Pinot plugin

First, deploy Presto with default configurations:

helm install presto pinot/presto -n pinot-quickstart

kubectl apply -f presto-coordinator.yaml

To customize your deployment, run the below command to get all the configurable values.

helm inspect values pinot/presto > /tmp/presto-values.yaml

After modifying the /tmp/presto-values.yaml file, deploy Presto:

helm install presto pinot/presto -n pinot-quickstart --values /tmp/presto-values.yaml

Once you've deployed the Presto instance, check the deployment status:

kubectl get pods -n pinot-quickstart

Query Presto using the Presto CLI

Once Presto is deployed, you can run the below command from here, or follow the steps below.

./pinot-presto-cli.sh

Download the Presto CLI:

curl -L https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.246/presto-cli-0.246-executable.jar -o /tmp/presto-cli && chmod +x /tmp/presto-cli

Port forward presto-coordinator port 8080 to localhost port 18080:

kubectl port-forward service/presto-coordinator 18080:8080 -n pinot-quickstart> /dev/null &

Start the Presto CLI with the Pinot catalog:

/tmp/presto-cli --server localhost:18080 --catalog pinot --schema default

Query Pinot data with the Presto CLI, like in the sample queries below.

Sample queries to execute

List all catalogs

presto:default> show catalogs;

 Catalog
---------
 pinot
 system
(2 rows)

Query 20191112_050827_00003_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

List all tables

presto:default> show tables;

    Table
--------------
 airlinestats
(1 row)

Query 20191112_050907_00004_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [1 rows, 29B] [1 rows/s, 41B/s]

Show schema

presto:default> DESCRIBE pinot.dontcare.airlinestats;

        Column        |  Type   | Extra | Comment
----------------------+---------+-------+---------
 flightnum            | integer |       |
 origin               | varchar |       |
 quarter              | integer |       |
 lateaircraftdelay    | integer |       |
 divactualelapsedtime | integer |       |
......

Query 20191112_051021_00005_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:02 [80 rows, 6.06KB] [35 rows/s, 2.66KB/s]

Count total documents

presto:default> select count(*) as cnt from pinot.dontcare.airlinestats limit 10;

 cnt
------
 9745
(1 row)

Query 20191112_051114_00006_xkm4g, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [1 rows, 8B] [2 rows/s, 19B/s]

Delete a Pinot cluster in Kubernetes

To delete your Pinot cluster in Kubernetes, run the following command:

kubectl delete ns pinot-quickstart

Stream Ingestion with Upsert

Upsert support in Apache Pinot.

Pinot provides native support of upsert during real-time ingestion. There are scenarios where records need modifications, such as correcting a ride fare or updating a delivery status.

Partial upsert is convenient as you only need to specify the columns where values change, and you ignore the rest.

To enable upsert on a Pinot table, make some configuration changes in the table configurations and on the input stream.

Define the primary key in the schema

To update a record, you need a primary key to uniquely identify the record. To define a primary key, add the field primaryKeyColumns to the schema definition. For example, the schema definition of UpsertMeetupRSVP in the quick start example has this definition.

upsert_meetupRsvp_schema.json

{
    "primaryKeyColumns": ["event_id"]
}

Note this field expects a list of columns, as the primary key can be a composite.

When two records of the same primary key are ingested, the record with the greater comparison value (timeColumn by default) is used. When records have the same primary key and event time, then the order is not determined. In most cases, the later ingested record will be used, but this may not be true in cases where the table has a column to sort by.

Partition the input stream by the primary key

An important requirement for the Pinot upsert table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the send API. If the original stream is not partitioned, then a streaming processing job (such as with Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.

Enable upsert in the table configurations

To enable upsert, make the following configurations in the table configurations.

Upsert modes

Full upsert

The upsert mode defaults to NONE for real-time tables. To enable the full upsert, set the mode to FULL for the full update. FULL upsert means that a new record will replace the older record completely if they have same primary key. Example config:

{
  "upsertConfig": {
    "mode": "FULL"
  }
}

Partial upserts

Partial upsert lets you choose to update only specific columns and ignore the rest.

To enable the partial upsert, set the mode to PARTIAL and specify partialUpsertStrategies for partial upsert columns. Since release-0.10.0, OVERWRITE is used as the default strategy for columns without a specified strategy. defaultPartialUpsertStrategy is also introduced to change the default strategy for all columns. For example:

release-0.8.0

{
  "upsertConfig": {
    "mode": "PARTIAL",
    "partialUpsertStrategies":{
      "rsvp_count": "INCREMENT",
      "group_name": "IGNORE",
      "venue_name": "OVERWRITE"
    }
  }
}

release-0.10.0

{
  "upsertConfig": {
    "mode": "PARTIAL",
    "defaultPartialUpsertStrategy": "OVERWRITE",
    "partialUpsertStrategies":{
      "rsvp_count": "INCREMENT",
      "group_name": "IGNORE"
    }
  }
}

Pinot supports the following partial upsert strategies:

Strategy

Description

OVERWRITE

Overwrite the column of the last record

INCREMENT

Add the new value to the existing values

APPEND

Add the new item to the Pinot unordered set

UNION

Add the new item to the Pinot unordered set if not exists

IGNORE

Ignore the new value, keep the existing value (v0.10.0+)

MAX

Keep the maximum value betwen the existing value and new value (v0.12.0+)

MIN

Keep the minimum value betwen the existing value and new value (v0.12.0+)

With partial upsert, if the value is null in either the existing record or the new coming record, Pinot will ignore the upsert strategy and the null value:

(null, newValue) -> newValue

(oldValue, null) -> oldValue

(null, null) -> null

Comparison column

By default, Pinot uses the value in the time column (timeColumn in tableConfig) to determine the latest record. That means, for two records with the same primary key, the record with the larger value of the time column is picked as the latest update. However, there are cases when users need to use another column to determine the order. In such case, you can use option comparisonColumn to override the column used for comparison. For example,

{
  "upsertConfig": {
    "mode": "FULL",
    "comparisonColumn": "anotherTimeColumn",
    "hashFunction": "NONE"
  }
}

For partial upsert table, the out-of-order events won't be consumed and indexed. For example, for two records with the same primary key, if the record with the smaller value of the comparison column came later than the other record, it will be skipped.

Multiple comparison columns

In some cases, especially where partial upsert might be employed, there may be multiple producers of data each writing to a mutually exclusive set of columns, sharing only the primary key. In such a case, it may be helpful to use one comparison column per producer group so that each group can manage its own specific versioning semantics without the need to coordinate versioning across other producer groups.

{
  "upsertConfig": {
    "mode": "PARTIAL",
    "defaultPartialUpsertStrategy": "OVERWRITE",
    "partialUpsertStrategies":{},
    "comparisonColumns": ["secondsSinceEpoch", "otherComparisonColumn"],
    "hashFunction": "NONE"
  }
}

Documents written to Pinot are expected to have exactly 1 non-null value out of the set of comparisonColumns; if more than 1 of the columns contains a value, the document will be rejected. When new documents are written, whichever comparison column is non-null will be compared against only that same comparison column seen in prior documents with the same primary key. Consider the following examples, where the documents are assumed to arrive in the order specified in the array.

[
  {
    "event_id": "aa",
    "orderReceived": 1,
    "description" : "first",
    "secondsSinceEpoch": 1567205394
  },
  {
    "event_id": "aa",
    "orderReceived": 2,
    "description" : "update",
    "secondsSinceEpoch": 1567205397
  },
  {
    "event_id": "aa",
    "orderReceived": 3,
    "description" : "update",
    "secondsSinceEpoch": 1567205396
  },
  {
    "event_id": "aa",
    "orderReceived": 4,
    "description" : "first arrival, other column",
    "otherComparisonColumn": 1567205395
  },
  {
    "event_id": "aa",
    "orderReceived": 5,
    "description" : "late arrival, other column",
    "otherComparisonColumn": 1567205392
  },
  {
    "event_id": "aa",
    "orderReceived": 6,
    "description" : "update, other column",
    "otherComparisonColumn": 1567205398
  }
]

The following would occur:

orderReceived: 1

Result: persisted
Reason: first doc seen for primary key "aa"

orderReceived: 2

Result: persisted (replacing orderReceived: 1)
Reason: comparison column (secondsSinceEpoch) larger than that previously seen

orderReceived: 3

Result: rejected
Reason: comparison column (secondsSinceEpoch) smaller than that previously seen

orderReceived: 4

Result: persisted (replacing orderReceived: 2)
Reason: comparison column (otherComparisonColumn) larger than previously seen (never seen previously), despite the value being smaller than that seen for secondsSinceEpoch

orderReceived: 5

Result: rejected
Reason: comparison column (otherComparisonColumn) smaller than that previously seen

orderReceived: 6

Result: persist (replacing orderReceived: 4)
Reason: comparison column (otherComparisonColumn) larger than that previously seen

Delete column

Upsert Pinot table can support soft-deletes of primary keys. This requires the incoming record to contain a dedicated boolean single-field column that serves as a delete marker for a primary key. Once the real-time engine encounters a record with delete column set to true , the primary key will no longer be part of the queryable set of documents. This means the primary key will not be visible in the queries, unless explicitly requested via query option skipUpsert=true.

{ 
    "upsertConfig": {  
        ... 
        "deleteRecordColumn": <column_name>
    } 
}

Note that the delete column has to be a single-value boolean column.

// In the Schema
{
    ...
    {
      "name": "<delete_column_name>",
      "dataType": "BOOLEAN"
    },
    ...
}

Note that when deleteRecordColumn is added to an existing table, it will require a server restart to actually pick up the upsert config changes.

A deleted primary key can be revived by ingesting a record with the same primary, but with higher comparison column value(s).

Note that when reviving a primary key in a partial upsert table, the revived record will be treated as the source of truth for all columns. This means any previous updates to the columns will be ignored and overwritten with the new record's values.

Use strictReplicaGroup for routing

The upsert Pinot table can use only the low-level consumer for the input streams. As a result, it uses the partitioned replica-group assignment for the segments. Moreover, upsert poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires to use strictReplicaGroup as the routing strategy. To use that, configure instanceSelectorType in Routing as the following:

{
  "routing": {
    "instanceSelectorType": "strictReplicaGroup"
  }
}

Enable validDocIds snapshots for upsert metadata recovery

Upsert snapshot support is also added in release-0.12.0. To enable the snapshot, set the enableSnapshot to true. For example:

{
  "upsertConfig": {
    "mode": "FULL",
    "hashFunction": "NONE",
    "enableSnapshot": true
  }
}

Upsert maintains metadata in memory containing which docIds are valid in a particular segment (ValidDocIndexes). This metadata gets lost during server restarts and needs to be recreated again. ValidDocIndexes can not be recovered easily after out-of-TTL primary keys get removed. Enabling snapshots addresses this problem by adding functions to store and recover validDocIds snapshot for Immutable Segments

The snapshots are taken on every segment commit to ensure that they are consistent with the persisted data in case of abrupt shutdown. We recommend that you enable this feature so as to speed up server boot times during restarts.

The lifecycle for validDocIds snapshots are shows as follows,

If snapshot is enabled, load validDocIds from snapshot during add segments.
If snapshot is not enabled, delete validDocIds snapshots during add segments if exists.
If snapshot is enabled, persist validDocIds snapshot for immutable segments when removing segment.

Enable preload for faster restarts

Upsert preload support is also added in master. To enable the preload, set the enablePreload to true. For example:

{
  "upsertConfig": {
    "mode": "FULL",
    "hashFunction": "NONE",
    "enablePreload": true
  }
}

For preload to improve your restart times, enableSnapshot: true should also we set in the table config. Under the hood, it uses the snapshots to quickly insert the data instead of performing a whole upsert comparison flow for all the primary keys. The flow is triggered before server is marked as ready to load segments without snapshots (hence the name preload).

The feature also requires you to specify pinot.server.instance.max.segment.preload.threads: N in the server config where N should be replaced with the number of threads that should be used for preload. This feature is still in beta.

Upsert table limitations

There are some limitations for the upsert Pinot tables.

The high-level consumer is not allowed for the input stream ingestion, which means stream.[consumerName].consumer.type must always be lowLevel.
The star-tree index cannot be used for indexing, as the star-tree index performs pre-aggregation during the ingestion.
Unlike append-only tables, out-of-order events (with comparison value in incoming record less than the latest available value) won't be consumed and indexed by Pinot partial upsert table, these late events will be skipped.

Best practices

Unlike other real-time tables, Upsert table takes up more memory resources as it needs to bookkeep the record locations in memory. As a result, it's important to plan the capacity beforehand, and monitor the resource usage. Here are some recommended practices of using Upsert table.

Create the topic/stream with more partitions.

The number of partitions in input streams determines the partition numbers of the Pinot table. The more partitions you have in input topic/stream, more Pinot servers you can distribute the Pinot table to and therefore more you can scale the table horizontally. Do note that you can't increase the partitions in future for upsert enabled tables so you need to start with good enough partitions (atleast 2-3X the number of pinot servers)

Memory usage

Upsert table maintains an in-memory map from the primary key to the record location. So it's recommended to use a simple primary key type and avoid composite primary keys to save the memory cost. In addition, consider the hashFunction config in the Upsert config, which can be MD5 or MURMUR3, to store the 128-bit hashcode of the primary key instead. This is useful when your primary key takes more space. But keep in mind, this hash may introduce collisions, though the chance is very low.

Monitoring

Set up a dashboard over the metric pinot.server.upsertPrimaryKeysCount.tableName to watch the number of primary keys in a table partition. It's useful for tracking its growth which is proportional to the memory usage growth. **** The total memory usage by upsert is roughly (primaryKeysCount * (sizeOfKeyInBytes + 24))

Capacity planning

It's useful to plan the capacity beforehand to ensure you will not run into resource constraints later. A simple way is to measure the rate of the primary keys in the input stream per partition and extrapolate the data to a specific time period (based on table retention) to approximate the memory usage. A heap dump is also useful to check the memory usage so far on an upsert table instance.

Example

Putting these together, you can find the table configurations of the quick start example as the following:

{
  "tableName": "meetupRsvp",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "mtime",
    "timeType": "MILLISECONDS",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "1",
    "segmentPushType": "APPEND",
    "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
    "schemaName": "meetupRsvp",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "lowLevel",
      "stream.kafka.topic.name": "meetupRSVPEvents",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kafka.hlc.zk.connect.string": "localhost:2191/kafka",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.zk.broker.url": "localhost:2191/kafka",
      "stream.kafka.broker.list": "localhost:19092",
      "realtime.segment.flush.threshold.rows": 30
    }
  },
  "metadata": {
    "customConfigs": {}
  },
  "routing": {
    "instanceSelectorType": "strictReplicaGroup"
  },
  "upsertConfig": {
    "mode": "FULL"
  }
}

Pinot server maintains a primary key to record location map across all the segments served in an upsert-enabled table. As a result, when updating the config for an existing upsert table (e.g. change the columns in the primary key, change the comparison column), servers need to be restarted in order to apply the changes and rebuild the map.

Quick Start

To illustrate how the full upsert works, the Pinot binary comes with a quick start example. Use the following command to creates a real-time upsert table meetupRSVP.

# stop previous quick start cluster, if any
bin/quick-start-upsert-streaming.sh

You can also run partial upsert demo with the following command

# stop previous quick start cluster, if any
bin/quick-start-partial-upsert-streaming.sh

As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to checkout the real-time data.

For partial upsert you can see only the value from configured column changed based on specified partial upsert strategy.

An example for partial upsert is shown below, each of the event_id kept being unique during ingestion, meanwhile the value of rsvp_count incremented.

To see the difference from the non-upsert table, you can use a query option skipUpsert to skip the upsert effect in the query result.

FAQ

Can I change primary key columns in existing upsert table?

Yes, you can add or delete columns to primary keys as long as input stream is partitioned on one of the primary key columns. However, you need to restart all Pinot servers so that it can rebuild the primary key to record location map with the new columns.

Text search support

This page talks about support for text search in Pinot.

Why do we need text search?

Pinot supports super-fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index, and sorted index.

This is useful for a query like the following, which looks for exact matches on two columns of type STRING and INT respectively:

SELECT COUNT(*) 
FROM Foo 
WHERE STRING_COL = 'ABCDCD' 
AND INT_COL > 2000

For arbitrary text data that falls into the BLOB/CLOB territory, we need more than exact matches. This often involves using regex, phrase, fuzzy queries on BLOB like data. Text indexes can efficiently perform arbitrary search on STRING columns where each column value is a large BLOB of text using the TEXT\_MATCH function, like this:

SELECT COUNT(*) 
FROM Foo 
WHERE TEXT_MATCH (<column_name>, '<search_expression>')

where <column_name> is the column text index is created on and <search_expression> conforms to one of the following:

Search Expression Type

Example

Phrase query

TEXT_MATCH (<column_name>, '"distributed system"')

Term Query

TEXT_MATCH (<column_name>, 'Java')

Boolean Query

TEXT_MATCH (<column_name>, 'Java AND c++')

Prefix Query

TEXT_MATCH (<column_name>, 'stream*')

Regex Query

TEXT_MATCH (<column_name>, '/Exception.*/')

Current restrictions

Pinot supports text search with the following requirements:

The column type should be STRING.
The column should be single-valued.
Using a text index in coexistence with other Pinot indexes is not supported.

Sample Datasets

Text search should ideally be used on STRING columns where doing standard filter operations (EQUALITY, RANGE, BETWEEN) doesn't fit the bill because each column value is a reasonably large blob of text.

Apache Access Log

Consider the following snippet from an Apache access log. Each line in the log consists of arbitrary data (IP addresses, URLs, timestamps, symbols etc) and represents a column value. Data like this is a good candidate for doing text search.

Let's say the following snippet of data is stored in the ACCESS\_LOG\_COL column in a Pinot table.

109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-
109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
83.167.113.100 - - [12/Dec/2015:18:31:25 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
83.167.113.100 - - [12/Dec/2015:18:31:25 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
95.29.198.15 - - [12/Dec/2015:18:32:10 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
95.29.198.15 - - [12/Dec/2015:18:32:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
109.184.11.34 - - [12/Dec/2015:18:32:56 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
109.184.11.34 - - [12/Dec/2015:18:32:56 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
91.227.29.79 - - [12/Dec/2015:18:33:51 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"

Here are some examples of search queries on this data:

Count the number of GET requests.

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(ACCESS_LOG_COL, 'GET')

Count the number of POST requests that have administrator in the URL (administrator/index)

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(ACCESS_LOG_COL, 'post AND administrator AND index')

Count the number of POST requests that have a particular URL and handled by Firefox browser

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(ACCESS_LOG_COL, 'post AND administrator AND index AND firefox')

Resume text

Let's consider another example using text from job candidate resumes. Each line in this file represents skill-data from resumes of different candidates.

This data is stored in the SKILLS\_COL column in a Pinot table. Each line in the input text represents a column value.

Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing
Java, Python, C++, Machine learning, building and deploying large scale production systems, concurrency, multi-threading, CPU processing
C++, Python, Tensor flow, database kernel, storage, indexing and transaction processing, building large scale systems, Machine learning
Amazon EC2, AWS, hadoop, big data, spark, building high performance scalable systems, building and deploying large scale production systems, concurrency, multi-threading, Java, C++, CPU processing
Distributed systems, database development, columnar query engine, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
CUDA, GPU, Python, Machine learning, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, database engine, cluster management, docker image building and distribution
Kubernetes, cluster management, operating systems, concurrency, multi-threading, apache airflow, Apache Spark,
Apache spark, Java, C++, query processing, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Big data stream processing, Apache Flink, Apache Beam, database kernel, distributed query engines for analytics and data warehouses
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems
Distributed systems, Apache Kafka, publish-subscribe, building and deploying large scale production systems, concurrency, multi-threading, C++, CPU processing, Java
Realtime stream processing, publish subscribe, columnar processing for data warehouses, concurrency, Java, multi-threading, C++,

Here are some examples of search queries on this data:

Count the number of candidates that have "machine learning" and "gpu processing": This is a phrase search (more on this further in the document) where we are looking for exact match of phrases "machine learning" and "gpu processing", not necessarily in the same order in the original data.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND "gpu processing"')

Count the number of candidates that have "distributed systems" and either 'Java' or 'C++': This is a combination of searching for exact phrase "distributed systems" along with other terms.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"distributed systems" AND (Java C++)')

Query Log

Next, consider a snippet from a log file containing SQL queries handled by a database. Each line (query) in the file represents a column value in the QUERY\_LOG\_COL column in a Pinot table.

SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560988800000 AND 1568764800000 GROUP BY dimensionCol3 TOP 2500
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560988800000 AND 1568764800000 GROUP BY dimensionCol3 TOP 2500
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1545436800000 AND 1553212800000 GROUP BY dimensionCol3 TOP 2500
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1537228800000 AND 1537660800000 GROUP BY dimensionCol3 TOP 2500
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1561366800000 AND 1561370399999 AND dimensionCol3 = 2019062409 LIMIT 10000
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1563807600000 AND 1563811199999 AND dimensionCol3 = 2019072215 LIMIT 10000
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1563811200000 AND 1563814799999 AND dimensionCol3 = 2019072216 LIMIT 10000
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1566327600000 AND 1566329400000 AND dimensionCol3 = 2019082019 LIMIT 10000
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560834000000 AND 1560837599999 AND dimensionCol3 = 2019061805 LIMIT 0
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560870000000 AND 1560871800000 AND dimensionCol3 = 2019061815 LIMIT 0
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560871800001 AND 1560873599999 AND dimensionCol3 = 2019061815 LIMIT 0
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560873600000 AND 1560877199999 AND dimensionCol3 = 2019061816 LIMIT 0

Here are some examples of search queries on this data:

Count the number of queries that have GROUP BY

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(QUERY_LOG_COL, '"group by"')

Count the number of queries that have the SELECT count... pattern

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(QUERY_LOG_COL, '"select count"')

Count the number of queries that use BETWEEN filter on timestamp column along with GROUP BY

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(QUERY_LOG_COL, '"timestamp between" AND "group by"')

Read on for concrete examples on each kind of query and step-by-step guides covering how to write text search queries in Pinot.

A column in Pinot can be dictionary-encoded or stored RAW. In addition, we can create an inverted index and/or a sorted index on a dictionary-encoded column.

The text index is an addition to the type of per-column indexes users can create in Pinot. However, it only supports text index on a RAW column, not a dictionary-encoded column.

Enable a text index

Enable a text index on a column in the table configuration by adding a new section with the name "fieldConfigList".

"fieldConfigList":[
  {
     "name":"text_col_1",
     "encodingType":"RAW",
     "indexTypes":["TEXT"]
  },
  {
     "name":"text_col_2",
     "encodingType":"RAW",
     "indexTypes":["TEXT"]
  }
]

Each column that has a text index should also be specified as noDictionaryColumns in tableIndexConfig:

"tableIndexConfig": {
   "noDictionaryColumns": [
     "text_col_1",
     "text_col_2"
 ]}

You can configure text indexes in the following scenarios:

Adding a new table with text index enabled on one or more columns.
Adding a new column with text index enabled to an existing table.
Enabling a text index on an existing column.

When you're using a text index, add the indexed column to the noDictionaryColumns columns list to reduce unnecessary storage overhead.

For instructions on that configuration property, see the Raw value forward index documentation.

Text index creation

Once the text index is enabled on one or more columns through a table configuration, segment generation code will automatically create the text index (per column).

Text index is supported for both offline and real-time segments.

Text parsing and tokenization

The original text document (denoted by a value in the column that has text index enabled) is parsed, tokenized and individual "indexable" terms are extracted. These terms are inserted into the index.

Pinot's text index is built on top of Lucene. Lucene's standard english text tokenizer generally works well for most classes of text. To build a custom text parser and tokenizer to suit particular user requirements, this can be made configurable for the user to specify on a per-column text-index basis.

There is a default set of "stop words" built in Pinot's text index. This is a set of high frequency words in English that are excluded for search efficiency and index size, including:

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "than", "there", "these", 
"they", "this", "to", "was", "will", "with", "those"

Any occurrence of these words will be ignored by the tokenizer during index creation and search.

In some cases, users might want to customize the set. A good example would be when IT (Information Technology) appears in the text that collides with "it", or some context-specific words that are not informative in the search. To do this, one can config the words in fieldConfig to include/exclude from the default stop words:

"fieldConfigList":[
  {
     "name":"text_col_1",
     "encodingType":"RAW",
     "indexType":"TEXT",
     "properties": {
        "stopWordInclude": "incl1, incl2, incl3",
        "stopWordExclude": "it"
     }
  }
]

The words should be comma separated and in lowercase. Words appearing in both lists will be excluded as expected.

Writing text search queries

The TEXT\_MATCH function enables using text search in SQL/PQL.

TEXT_MATCH(text_column_name, search_expression)

text_column_name - name of the column to do text search on.
search_expression - search query

You can use TEXT_MATCH function as part of queries in the WHERE clause, like this:

SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(...)
SELECT * FROM Foo WHERE TEXT_MATCH(...)

You can also use the TEXT\_MATCH filter clause with other filter operators. For example:

SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(...) AND some_other_column_1 > 20000
SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(...) AND some_other_column_1 > 20000 AND some_other_column_2 < 100000

You can combine multiple TEXT\_MATCH filter clauses:

SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(text_col_1, ....) AND TEXT_MATCH(text_col_2, ...)

TEXT\_MATCH can be used in WHERE clause of all kinds of queries supported by Pinot.

Selection query which projects one or more columns
- User can also include the text column name in select list
Aggregation query
Aggregation GROUP BY query

The search expression (the second argument to TEXT\_MATCH function) is the query string that Pinot will use to perform text search on the column's text index.

Phrase query

This query is used to seek out an exact match of a given phrase, where terms in the user-specified phrase appear in the same order in the original text document.

The following example reuses the earlier example of resume text data containing 14 documents to walk through queries. In this sentence, "document" means the column value. The data is stored in the SKILLS\_COL column and we have created a text index on this column.

Java, C++, worked on open source projects, coursera machine learning
Machine learning, Tensor flow, Java, Stanford university,
Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing
Java, Python, C++, Machine learning, building and deploying large scale production systems, concurrency, multi-threading, CPU processing
C++, Python, Tensor flow, database kernel, storage, indexing and transaction processing, building large scale systems, Machine learning
Amazon EC2, AWS, hadoop, big data, spark, building high performance scalable systems, building and deploying large scale production systems, concurrency, multi-threading, Java, C++, CPU processing
Distributed systems, database development, columnar query engine, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
CUDA, GPU, Python, Machine learning, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, database engine, cluster management, docker image building and distribution
Kubernetes, cluster management, operating systems, concurrency, multi-threading, apache airflow, Apache Spark,
Apache spark, Java, C++, query processing, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Big data stream processing, Apache Flink, Apache Beam, database kernel, distributed query engines for analytics and data warehouses
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems
Distributed systems, Apache Kafka, publish-subscribe, building and deploying large scale production systems, concurrency, multi-threading, C++, CPU processing, Java
Realtime stream processing, publish subscribe, columnar processing for data warehouses, concurrency, Java, multi-threading, C++,
C++, Java, Python, realtime streaming systems, Machine learning, spark, Kubernetes, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution
Database engine, OLAP systems, OLTP transaction processing at large scale, concurrency, multi-threading, GO, building large scale systems

This example queries the SKILL\_COL column to look for documents where each matching document MUST contain phrase "Distributed systems":

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"Distributed systems"')

The search expression is '\"Distributed systems\"'

The search expression is always specified within single quotes '<your expression>'
Since we are doing a phrase search, the phrase should be specified within double quotes inside the single quotes and the double quotes should be escaped
- '\"<your phrase>\"'

The above query will match the following documents:

Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing
Distributed systems, database development, columnar query engine, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
Distributed systems, Java, database engine, cluster management, docker image building and distribution
Distributed systems, Apache Kafka, publish-subscribe, building and deploying large scale production systems, concurrency, multi-threading, C++, CPU processing, Java
Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution

But it won't match the following document:

Distributed data processing, systems design experience

This is because the phrase query looks for the phrase occurring in the original document "as is". The terms as specified by the user in phrase should be in the exact same order in the original document for the document to be considered as a match.

NOTE: Matching is always done in a case-insensitive manner.

The next example queries the SKILL\_COL column to look for documents where each matching document MUST contain phrase "query processing":

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"query processing"')

The above query will match the following documents:

Apache spark, Java, C++, query processing, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution"

Term query

Term queries are used to search for individual terms.

This example will query the SKILL\_COL column to look for documents where each matching document MUST contain the term 'Java'.

As mentioned earlier, the search expression is always within single quotes. However, since this is a term query, we don't have to use double quotes within single quotes.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, 'Java')

Composite query using Boolean operators

The Boolean operators AND and OR are supported and we can use them to build a composite query. Boolean operators can be used to combine phrase and term queries in any arbitrary manner

This example queries the SKILL\_COL column to look for documents where each matching document MUST contain the phrases "distributed systems" and "tensor flow". This combines two phrases using the AND Boolean operator.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND "Tensor Flow"')

The above query will match the following documents:

Machine learning, Tensor flow, Java, Stanford university,
C++, Python, Tensor flow, database kernel, storage, indexing and transaction processing, building large scale systems, Machine learning
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems

This example queries the SKILL\_COL column to look for documents where each document MUST contain the phrase "machine learning" and the terms 'gpu' and 'python'. This combines a phrase and two terms using Boolean operators.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND gpu AND python')

The above query will match the following documents:

CUDA, GPU, Python, Machine learning, database kernel, storage, indexing and transaction processing, building large scale systems
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems

When using Boolean operators to combine term(s) and phrase(s) or both, note that:

The matching document can contain the terms and phrases in any order.
The matching document may not have the terms adjacent to each other (if this is needed, use appropriate phrase query).

Use of the OR operator is implicit. In other words, if phrase(s) and term(s) are not combined using AND operator in the search expression, the OR operator is used by default:

This example queries the SKILL\_COL column to look for documents where each document MUST contain ANY one of:

phrase "distributed systems" OR
term 'java' OR
term 'C++'.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"distributed systems" Java C++')

Grouping using parentheses is supported:

This example queries the SKILL\_COL column to look for documents where each document MUST contain

phrase "distributed systems" AND
at least one of the terms Java or C++

Here the terms Java and C++ are grouped without any operator, which implies the use of OR. The root operator AND is used to combine this with phrase "distributed systems"

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"distributed systems" AND (Java C++)')

Prefix query

Prefix queries can be done in the context of a single term. We can't use prefix matches for phrases.

This example queries the SKILL\_COL column to look for documents where each document MUST contain text like stream, streaming, streams etc

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, 'stream*')

The above query will match the following documents:

Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
Big data stream processing, Apache Flink, Apache Beam, database kernel, distributed query engines for analytics and data warehouses
Realtime stream processing, publish subscribe, columnar processing for data warehouses, concurrency, Java, multi-threading, C++,
C++, Java, Python, realtime streaming systems, Machine learning, spark, Kubernetes, transaction processing, distributed storage, concurrency, multi-threading, apache airflow

Regular Expression Query

Phrase and term queries work on the fundamental logic of looking up the terms in the text index. The original text document (a value in the column with text index enabled) is parsed, tokenized, and individual "indexable" terms are extracted. These terms are inserted into the index.

Based on the nature of the original text and how the text is segmented into tokens, it is possible that some terms don't get indexed individually. In such cases, it is better to use regular expression queries on the text index.

Consider a server log as an example where we want to look for exceptions. A regex query is suitable here as it is unlikely that 'exception' is present as an individual indexed token.

Syntax of a regex query is slightly different from queries mentioned earlier. The regular expression is written between a pair of forward slashes (/).

SELECT SKILLS_COL 
FROM MyTable 
WHERE text_match(SKILLS_COL, '/.*Exception/')

The above query will match any text document containing "exception".

Deciding Query Types

Combining phrase and term queries using Boolean operators and grouping lets you build a complex text search query expression.

The key thing to remember is that phrases should be used when the order of terms in the document is important and when separating the phrase into individual terms doesn't make sense from end user's perspective.

An example would be phrase "machine learning".

TEXT_MATCH(column, '"machine learning"')

However, if we are searching for documents matching Java and C++ terms, using phrase query "Java C++" will actually result in in partial results (could be empty too) since now we are relying the on the user specifying these skills in the exact same order (adjacent to each other) in the resume text.

TEXT_MATCH(column, '"Java C++"')

Term query using Boolean AND operator is more appropriate for such cases

TEXT_MATCH(column, 'Java AND C++')

Text Index Tuning

To improve Lucene index creation time, some configs have been provided. Field Config properties luceneUseCompoundFile and luceneMaxBufferSizeMB can provide faster index writing at but may increase file descriptors and/or memory pressure.

0.10.0

Summary

This release introduces some new great features, performance enhancements, UI improvements, and bug fixes which are described in details in the following sections. The release was cut from this commit fd9c58a.

Dependency Graph

The dependency graph for plug-and-play architecture that was introduced in release 0.3.0 has been extended and now it contains new nodes for Pinot Segment SPI.

SQL Improvements

Implement NOT Operator (#8148)
Add DistinctCountSmartHLLAggregationFunction which automatically store distinct values in Set or HyperLogLog based on cardinality (#8189)
Add LEAST and GREATEST functions (#8100)
Handle SELECT * with extra columns (#7959)
Add FILTER clauses for aggregates (#7916)
Add ST_Within function (#7990)
Handle semicolon in query (#7861)
Add EXPLAIN PLAN (#7568)

UI Enhancements

Show Reported Size and Estimated Size in human readable format in UI (#8199)
Make query console state URL based (#8194)
Improve query console to not show query result when multiple columns have the same name (#8131)
Improve Pinot dashboard tenant view to show correct amount of servers and brokers (#8115)
Fix issue with opening new tabs from Pinot Dashboard (#8021)
Fix issue with Query console going blank on syntax error (#8006)
Make query stats always show even there's error (#7981)
Implement OIDC auth workflow in UI (#7121)
Add tooltip and modal for table status (#7899)
Add option to wrap lines in custom code mirror (#7857)
Add ability to comment out queries with cmd + / (#7841)
Return exception when unavailable segments on empty broker response (#7823)
Properly handle the case where segments are missing in externalview (#7803)
Add TIMESTAMP to datetime column Type (#7746)

Performance Improvements

Reuse regex matcher in dictionary based LIKE queries (#8261)
Early terminate orderby when columns already sorted (#8228)
Do not do another pass of Query Automaton Minimization (#8237)
Improve RangeBitmap by upgrading RoaringBitmap (#8206)
Optimize geometry serializer usage when literal is available (#8167)
Improve performance of no-dictionary group by (#8195)
Allocation free DataBlockCache lookups (#8140)
Prune unselected THEN statements in CaseTransformFunction (#8138)
Aggregation delay conversion to double (#8139)
Reduce object allocation rate in ExpressionContext or FunctionContext (#8124)
Lock free DimensionDataTableManager (#8102)
Improve json path performance during ingestion by upgrading JsonPath (#7819)
Reduce allocations and speed up StringUtil.sanitizeString (#8013)
Faster metric scans - ForwardIndexReader (#7920)
Unpeel group by 3 ways to enable vectorization (#7949)
Power of 2 fixed size chunks (#7934)
Don't use mmap for compression except for huge chunks (#7931)
Exit group-by marking loop early (#7935)
Improve performance of base chunk forward index write (#7930)
Cache JsonPaths to prevent compilation per segment (#7826)
Use LZ4 as default compression mode (#7797)
Peel off special case for 1 dimensional groupby (#7777)
Bump roaringbitmap version to improve range queries performance (#7734)

Other Notable Features

Adding NoopPinotMetricFactory and corresponding changes (#8270)
Allow to specify fixed segment name for SegmentProcessorFramework (#8269)
Move all prestodb dependencies into a separated module (#8266)
Include docIds in Projection and Transform block (#8262)
Automatically update broker resource on broker changes (#8249)
Update ScalarFunction annotation from name to names to support function alias. (#8252)
Implemented BoundedColumnValue partition function (#8224)
Add copy recursive API to pinotFS (#8200)
Add Support for Getting Live Brokers for a Table (without type suffix) (#8188)
Pinot docker image - cache prometheus rules (#8241)
In BrokerRequestToQueryContextConverter, remove unused filterExpressionContext (#8238)
Adding retention period to segment delete REST API (#8122)
Pinot docker image - upgrade prometheus and scope rulesets to components (#8227)
Allow segment name postfix for SegmentProcessorFramework (#8230)
Superset docker image - update pinotdb version in superset image (#8231)
Add retention period to deleted segment files and allow table level overrides (#8176)
Remove incubator from pinot and superset (#8223)
Adding table config overrides for disabling groovy (#8196)
Optimise sorted docId iteration order in mutable segments (#8213)
Adding secure grpc query server support (#8207)
Move Tls configs and utils from pinot-core to pinot-common (#8210)
Reduce allocation rate in LookupTransformFunction (#8204)
Allow subclass to customize what happens pre/post segment uploading (#8203)
Enable controller service auto-discovery in Jersey framework (#8193)
Add support for pushFileNamePattern in pushJobSpec (#8191)
Add additionalMatchLabels to helm chart (#7177)
Simulate rsvps after meetup.com retired the feed (#8180)
Adding more checkstyle rules (#8197)
Add persistence.extraVolumeMounts and persistence.extraVolumes to Kubernetes statefulsets (#7486)
Adding scala profile for kafka 2.x build and remove root pom scala dependencies (#8174)
Allow real-time data providers to accept non-kafka producers (#8190)
Enhance revertReplaceSegments api (#8166)
Adding broker level config for disabling Pinot queries with Groovy (#8159)
Make presto driver query pinot server with SQL (#8186)
Adding controller config for disabling Groovy in ingestionConfig (#8169)
Adding main method for LaunchDataIngestionJobCommand for spark-submit command (#8168)
Add auth token for segment replace rest APIs (#8146)
Add allowRefresh option to UploadSegment (#8125)
Add Ingress to Broker and Controller helm charts (#7997)
Improve progress reporter in SegmentCreationMapper (#8129)
St_* function error messages + support literal transform functions (#8001)
Add schema and segment crc to SegmentDirectoryContext (#8127)
Extend enableParallePushProtection support in UploadSegment API (#8110)
Support BOOLEAN type in Config Recommendation Engine (#8055)
Add a broker metric to distinguish exception happens when acquire channel lock or when send request to server (#8105)
Add pinot.minion prefix on minion configs for consistency (#8109)
Enable broker service auto-discovery in Jersey framework (#8107)
Timeout if waiting server channel lock takes a long time (#8083)
Wire EmptySegmentPruner to routing config (#8067)
Support for TIMESTAMP data type in Config Recommendation Engine (#8087)
Listener TLS customization (#8082)
Add consumption rate limiter for LLConsumer (#6291)
Implement Real Time Mutable FST (#8016)
Allow quickstart to get table files from filesystem (#8093)
Add support for instant segment deletion (#8077)
Add a config file to override quickstart configs (#8059)
Add pinot server grpc metadata acl (#8030)
Move compatibility verifier to a separate module (#8049)
Move hadoop and spark ingestion libs from plugins directory to external-plugins (#8048)
Add global strategy for partial upsert (#7906)
Upgrade kafka to 2.8.1 (#7883)
Created EmptyQuickstart command (#8024)
Allow SegmentPushUtil to push real-time segment (#8032)
Add ignoreMerger for partial upsert (#7907)
Make task timeout and concurrency configurable (#8028)
Return 503 response from health check on shut down (#7892)
Pinot-druid-benchmark: set the multiValueDelimiterEnabled to false when importing TPC-H data (#8012)
Cleanup: Remove remaining occurrences of incubator. (#8023)
Refactor segment loading logic in BaseTableDataManager to decouple it with local segment directory (#7969)
Improving segment replacement/revert protocol (#7995)
PinotConfigProvider interface (#7984)
Enhance listSegments API to exclude the provided segments from the output (#7878)
Remove outdated broker metric definitions (#7962)
Add skip key for realtimeToOffline job validation (#7921)
Upgrade async-http-client (#7968)
Allow Reloading Segments with Multiple Threads (#7893)
Ignore query options in commented out queries (#7894)
Remove TableConfigCache which does not listen on ZK changes (#7943)
Switch to zookeeper of helm 3.0x (#7955)
Use a single react hook for table status modal (#7952)
Add debug logging for real-time ingestion (#7946)
Separate the exception for transform and indexing for consuming records (#7926)
Disable JsonStatementOptimizer (#7919)
Make index readers/loaders pluggable (#7897)
Make index creator provision pluggable (#7885)
Support loading plugins from multiple directories (#7871)
Update helm charts to honour readinessEnabled probes flags on the Controller, Broker, Server and Minion StatefulSets (#7891)
Support non-selection-only GRPC server request handler (#7839)
GRPC broker request handler (#7838)
Add validator for SDF (#7804)
Support large payload in zk put API (#7364)
Push JSON Path evaluation down to storage layer (#7820)
When upserting new record, index the record before updating the upsert metadata (#7860)
Add Post-Aggregation Gapfilling functionality. (#7781)
Clean up deprecated fields from segment metadata (#7853)
Remove deprecated method from StreamMetadataProvider (#7852)
Obtain replication factor from tenant configuration in case of dimension table (#7848)
Use valid bucket end time instead of segment end time for merge/rollup delay metrics (#7827)
Make pinot start components command extensible (#7847)
Make upsert inner segment update atomic (#7844)
Clean up deprecated ZK metadata keys and methods (#7846)
Add extraEnv, envFrom to statefulset help template (#7833)
Make openjdk image name configurable (#7832)
Add getPredicate() to PredicateEvaluator interface (#7840)
Make split commit the default commit protocol (#7780)
Pass Pinot connection properties from JDBC driver (#7822)
Add Pinot client connection config to allow skip fail on broker response exception (#7816)
Change default range index version to v2 (#7815)
Put thread timer measuring inside of wall clock timer measuring (#7809)
Add getRevertReplaceSegmentRequest method in FileUploadDownloadClient (#7796)
Add JAVA_OPTS env var in docker image (#7799)
Split thread cpu time into three metrics (#7724)
Add config for enabling real-time offset based consumption status checker (#7753)
Add timeColumn, timeUnit and totalDocs to the json segment metadata (#7765)
Set default Dockerfile CMD to -help (#7767)
Add getName() to PartitionFunction interface (#7760)
Support Native FST As An Index Subtype for FST Indices (#7729)
Add forceCleanup option for 'startReplaceSegments' API (#7744)
Add config for keystore types, switch tls to native implementation, and add authorization for server-broker tls channel (#7653)
Extend FileUploadDownloadClient to send post request with json body (#7751)

Major Bug Fixes

Fix string comparisons (#8253)
Bugfix for order-by all sorted optimization (#8263)
Fix dockerfile (#8239)
Ensure partition function never return negative partition (#8221)
Handle indexing failures without corrupting inverted indexes (#8211)
Fixed broken HashCode partitioning (#8216)
Fix segment replace test (#8209)
Fix filtered aggregation when it is mixed with regular aggregation (#8172)
Fix FST Like query benchmark to remove SQL parsing from the measurement (#8097)
Do not identify function types by throwing exceptions (#8137)
Fix regression bug caused by sharing TSerializer across multiple threads (#8160)
Fix validation before creating a table (#8103)
Check cron schedules from table configs after subscribing child changes (#8113)
Disallow duplicate segment name in tar file (#8119)
Fix storage quota checker NPE for Dimension Tables (#8132)
Fix TraceContext NPE issue (#8126)
Update gcloud libraries to fix underlying issue with api's with CMEK (#8121)
Fix error handling in jsonPathArray (#8120)
Fix error handling in json functions with default values (#8111)
Fix controller config validation failure for customized TLS listeners (#8106)
Validate the numbers of input and output files in HadoopSegmentCreationJob (#8098)
Broker Side validation for the query with aggregation and col but without group by (#7972)
Improve the proactive segment clean-up for REVERTED (#8071)
Allow JSON forward indexes (#8073)
Fix the PinotLLCRealtimeSegmentManager on segment name check (#8058)
Always use smallest offset for new partitionGroups (#8053)
Fix RealtimeToOfflineSegmentsTaskExecutor to handle time gap (#8054)
Refine segment consistency checks during segment load (#8035)
Fixes for various JDBC issues (#7784)
Delete tmp- segment directories on server startup (#7961)
Fix ByteArray datatype column metadata getMaxValue NPE bug and expose maxNumMultiValues (#7918)
Fix the issues that Pinot upsert table's uploaded segments get deleted when a server restarts. (#7979)
Fixed segment upload error return (#7957)
Fix QuerySchedulerFactory to plug in custom scheduler (#7945)
Fix the issue with grpc broker request handler not started correctly (#7950)
Fix real-time ingestion when an entire batch of messages is filtered out (#7927)
Move decode method before calling acquireSegment to avoid reference count leak (#7938)
Fix semaphore issue in consuming segments (#7886)
Add bootstrap mode for PinotServiceManager to avoid glitch for health check (#7880)
Fix the broker routing when segment is deleted (#7817)
Fix obfuscator not capturing secretkey and keytab (#7794)
Fix segment merge delay metric when there is empty bucket (#7761)
Fix QuickStart by adding types for invalid/missing type (#7768)
Use oldest offset on newly detected partitions (#7756)
Fix javadoc to compatible with jdk8 source (#7754)
Handle null segment lineage ZNRecord for getSelectedSegments API (#7752)
Handle fields missing in the source in ParquetNativeRecordReader (#7742)

Backward Incompatible Changes

Fix the issue with HashCode partitioning function (#8216)
Fix the issue with validation on table creation (#8103)
Change PinotFS API's (#8603)

GapFill Function For Time-Series Dataset

Many of the datasets are time series in nature, tracking state change of an entity over time. The granularity of recorded data points might be sparse or the events could be missing due to network and other device issues in the IOT environment. But analytics applications which are tracking the state change of these entities over time, might be querying for values at lower granularity than the metric interval.

Here is the sample data set tracking the status of parking lots in parking space.

lotId

event_time

is_occupied

2021-10-01 09:01:00.000

2021-10-01 09:17:00.000

2021-10-01 09:33:00.000

2021-10-01 09:47:00.000

2021-10-01 10:05:00.000

2021-10-01 10:06:00.000

2021-10-01 10:16:00.000

2021-10-01 10:31:00.000

2021-10-01 11:17:00.000

2021-10-01 11:54:00.000

We want to find out the total number of parking lots that are occupied over a period of time which would be a common use case for a company that manages parking spaces.

Let us take 30 minutes' time bucket as an example:

timeBucket/lotId

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

0,1

2021-10-01 10:00:00.000

0,1

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

If you look at the above table, you will see a lot of missing data for parking lots inside the time buckets. In order to calculate the number of occupied park lots per time bucket, we need gap fill the missing data.

The Ways of Gap Filling the Data

There are two ways of gap filling the data: FILL_PREVIOUS_VALUE and FILL_DEFAULT_VALUE.

FILL_PREVIOUS_VALUE means the missing data will be filled with the previous value for the specific entity, in this case, park lot, if the previous value exists. Otherwise, it will be filled with the default value.

FILL_DEFAULT_VALUE means that the missing data will be filled with the default value. For numeric column, the defaul value is 0. For Boolean column type, the default value is false. For TimeStamp, it is January 1, 1970, 00:00:00 GMT. For STRING, JSON and BYTES, it is empty String. For Array type of column, it is empty array.

We will leverage the following the query to calculate the total occupied parking lots per time bucket.

Aggregation/Gapfill/Aggregation

Query Syntax

SELECT time_col, SUM(status) AS occupied_slots_count
FROM (
    SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
                   '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
                    TIMESERIESON(lotId)), lotId, status
    FROM (
        SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
               lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
        FROM parking_data
        WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
        GROUP BY 1, 2
        ORDER BY 1
        LIMIT 100)
    LIMIT 100)
GROUP BY 1
LIMIT 100

Workflow

The most nested sql will convert the raw event table to the following table.

lotId

event_time

is_occupied

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

The second most nested sql will gap fill the returned data as following:

timeBucket/lotId

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

The outermost query will aggregate the gapfilled data as follows:

timeBucket

totalNumOfOccuppiedSlots

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

There is one assumption we made here that the raw data is sorted by the timestamp. The Gapfill and Post-Gapfill Aggregation will not sort the data.

The above example just shows the use case where the three steps happen:

The raw data will be aggregated;
The aggregated data will be gapfilled;
The gapfilled data will be aggregated.

There are three more scenarios we can support.

Select/Gapfill

If we want to gapfill the missing data per half an hour time bucket, here is the query:

Query Syntax

SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
               '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
               TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
FROM parking_data
WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
ORDER BY 1
LIMIT 100

Workflow

At first the raw data will be transformed as follows:

lotId

event_time

is_occupied

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

Then it will be gapfilled as follows:

lotId

event_time

is_occupied

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

Aggregate/Gapfill

Query Syntax

SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
               '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
               TIMESERIESON(lotId)), lotId, status
FROM (
    SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
           lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
    FROM parking_data
    WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
    GROUP BY 1, 2
    ORDER BY 1
    LIMIT 100)
LIMIT 100

Workflow

The nested sql will convert the raw event table to the following table.

lotId

event_time

is_occupied

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

The outer sql will gap fill the returned data as following:

timeBucket/lotId

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

Gapfill/Aggregate

Query Syntax

SELECT time_col, SUM(is_occupied) AS occupied_slots_count
FROM (
    SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
           '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
           TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
    FROM parking_data
    WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
    ORDER BY 1
    LIMIT 100)
GROUP BY 1
LIMIT 100

Workflow

The raw data will be transformed as following at first:

lotId

event_time

is_occupied

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

The transformed data will be gap filled as follows:

lotId

event_time

is_occupied

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

The aggregation will generate the following table:

timeBucket

totalNumOfOccuppiedSlots

2021-10-01 09:00:00.000

2021-10-01 09:30:00.000

2021-10-01 10:00:00.000

2021-10-01 10:30:00.000

2021-10-01 11:00:00.000

2021-10-01 11:30:00.000

Apache Pinot™ 1.0.0 release notes

This page covers the latest changes included in the Apache Pinot™ 1.0.0 release, including new features, enhancements, and bug fixes.

1.0.0 (2023-09-19)

This release includes several new features, enhancements, and bug fixes, including the following highlights:

Multi-stage query engine: new features, enhancements, and bug fixes. Learn how to enable and use the multi-stage query engine or more about how the multi-stage query engine works.

Multi-stage query engine new features

Support for window functions
- Initial (phase 1) Query runtime for window functions with ORDER BY within the OVER() clause (#10449)
- Support for the ranking ROW_NUMBER() window function (#10527, #10587)
Set operations support:
- Support SetOperations (UNION, INTERSECT, MINUS) compilation in query planner (#10535)
Timestamp and Date Operations
Support TIMESTAMP type and date ops functions (#11350)
Aggregate functions
- Support more aggregation functions that are currently implementable (#11208)
- Support multi-value aggregation functions (#11216)
Support Sketch based functions (#11153), (#11517)
Make Intermediate Stage Worker Assignment Tenant Aware (#10617)
Evaluate literal expressions during query parsing, enabling more efficient query execution. (#11438 )
Added support for partition parallelism in partitioned table scans, allowing for more efficient data retrieval (#11266).
[multistage]Adding more tuple sketch scalar functions and integration tests (#11517)

Multi-stage query engine enhancements

Turn on v2 engine by default (#10543)
Introduced the ability to stream leaf stage blocks for more efficient data processing (#11472).
Early terminate SortOperator if there is a limit (#11334)
Implement ordering for SortExchange (#10408)
Table level Access Validation, QPS Quota, Phase Metrics for multistage queries (#10534)
Support partition based leaf stage processing (#11234)
Populate queryOption down to leaf (#10626)
Pushdown explain plan queries from the controller to the broker (#10505)
Enhanced the multi-stage group-by executor to support limiting the number of groups,
improving query performance and resource utilization (#11424).
Improved resilience and reliability of the multi-stage join operator, now with added support for hash join right table protection (#11401).

Multi-stage query engine bug fixes

Fix Predicate Pushdown by Using Rule Collection (#10409)
Try fixing mailbox cancel race condition (#10432)
Catch Throwable to Propagate Proper Error Message (#10438)
Fix tenant detection issues (#10546)
Handle Integer.MIN_VALUE in hashCode based FieldSelectionKeySelector (#10596)
Improve error message in case of non-existent table queried from the controller (#10599)
Derive SUM return type to be PostgreSQL compatible (#11151)

Index SPI

Add the ability to include new index types at runtime in Apache Pinot. This opens the ability of adding third party indexes, including proprietary indexes. More details here

Null value support for pinot queries

NULL support for ORDER BY, DISTINCT, GROUP BY, value transform functions and filtering.

Upsert enhancements

Delete support in upsert enabled tables (#10703)

Support added to extend upserts and allow deleting records from a realtime table. The design details can be found here.

Preload segments with upsert snapshots to speedup table loading (#11020)

Adds a feature to preload segments from a table that uses the upsert snapshot feature. The segments with validDocIds snapshots can be preloaded in a more efficient manner to speed up the table loading (thus server restarts).

TTL configs for upsert primary keys (#10915)

Adds support for specifying expiry TTL for upsert primary key metadata cleanup.

Segment compaction for upsert real-time tables (#10463)

Adds a new minion task to compact segments belonging to a real-time table with upserts.

Pinot Spark Connector for Spark3 (#10394)

Added spark3 support for Pinot Spark Connector (#10394)
Also added support to pass pinot query options to spark connector (#10443)

PinotDataBufferFactory and new PinotDataBuffer implementations (#10528)

Adds new implementations of PinotDataBuffer that uses Unsafe java APIs and foreign memory APIs. Also added support for PinotDataBufferFactory to allow plugging in custom PinotDataBuffer implementations.

Query functions enhancements

Add PercentileKLL aggregation function (#10643)
Support for ARG_MIN and ARG_MAX Functions (#10636)
refactor argmin/max to exprmin/max and make it calcite compliant (#11296)
Integer Tuple Sketch support (#10427)
Adding vector scalar functions (#11222)
[feature] multi-value datetime transform variants (#10841)
FUNNEL_COUNT Aggregation Function (#10867)
[multistage] Add support for RANK and DENSE_RANK ranking window functions (#10700)
add theta sketch scalar (#11153)
Register dateTimeConverter,timeConvert,dateTrunc, regexpReplace to v2 functions (#11097)
Add extract(quarter/dow/doy) support (#11388)
Funnel Count - Multiple Strategies (no partitioning requisites) (#11092)
Add Boolean assertion transform functions. (#11547)

JSON and CLP encoded message ingestion and querying

Add clpDecode transform function for decoding CLP-encoded fields. (#10885)
Add CLPDecodeRewriter to make it easier to call clpDecode with a column-group name rather than the individual columns. (#11006)
Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. (#11210)

Tier level index config override (#10553)

Allows overriding index configs at tier level, allowing for more flexible index configurations for different tiers.

Ingestion connectors and features

Kinesis stream header extraction (#9713)
Extract record keys, headers and metadata from Pulsar sources (#10995)
Realtime pre-aggregation for Distinct Count HLL & Big Decimal (#10926)
Added support to skip unparseable records in the csv record reader (#11487)
Null support for protobuf ingestion. (#11553)

UI enhancements

Adds persistence of authentication details in the browser session. This means that even if you refresh the app, you will still be logged in until the authentication session expires (#10389)
AuthProvider logic updated to decode the access token and extract user name and email. This information will now be available in the app for features to consume. (#10925)

Pinot docker image improvements and enhancements

Make Pinot base build and runtime images support Amazon Corretto and MS OpenJDK (#10422)
Support multi-arch pinot docker image (#10429)
Update dockerfile with recent jdk distro changes (#10963)

Operational improvements

Rebalance

Rebalance status API (#10359)
Tenant level rebalance API Tenant rebalance and status tracking APIs (#11128)

Config to use customized broker query thread pool (#10614)

Added new configuration options below which allow use of a bounded thread pool and allocate capacities for it.

pinot.broker.enable.bounded.http.async.executor
pinot.broker.http.async.executor.max.pool.size
pinot.broker.http.async.executor.core.pool.size
pinot.broker.http.async.executor.queue.size

This feature allows better management of broker resources.

Drop results support (#10419)

Adds a parameter to queryOptions to drop the resultTable from the response. This mode can be used to troubleshoot a query (which may have sensitive data in the result) using metadata only.

Make column order deterministic in segment (#10468)

In segment metadata and index map, store columns in alphabetical order so that the result is deterministic. Segments generated before/after this PR will have different CRC, so during the upgrade, we might get segments with different CRC from old and new consuming servers. For the segment consumed during the upgrade, some downloads might be needed.

Allow configuring helix timeouts for EV dropped in Instance manager (#10510)

Adds options to configure helix timeouts external.view.dropped.max.wait.ms`` - The duration of time in milliseconds to wait for the external view to be dropped. Default - 20 minutes. external.view.check.interval.ms`` - The period in milliseconds in which to ping ZK for latest EV state.

Enable case insensitivity by default (#10771)

This PR makes Pinot case insensitive be default, and removes the deprecated property enable.case.insensitive.pql

Newly added APIs and client methods

Add Server API to get tenant pools (#11273)
Add new broker query point for querying multi-stage engine (#11341)
Add a new controller endpoint for segment deletion with a time window (#10758)
New API to get tenant tags (#10937)
Instance retag validation check api (#11077)
Use PUT request to enable/disable table/instance (#11109)
Update the pinot tenants tables api to support returning broker tagged tables (#11184)
Add requestId for BrokerResponse in pinot-broker and java-client (#10943)
Provide results in CompletableFuture for java clients and expose metrics (#10326)

Cleanup and backward incompatible changes

High level consumers are no longer supported

Cleanup HLC code (#11326)
Remove support for High level consumers in Apache Pinot (#11017)

Type information preservation of query literals

[feature] [backward-incompat] [null support # 2] Preserve null literal information in literal context and literal transform (#10380) String versions of numerical values are no longer accepted. For example, "123" won't be treated as a numerical anymore.

Controller job status ZNode path update

Moving Zk updates for reload, force_commit to their own Znodes which … (#10451) The status of previously completed reload jobs will not be available after this change is deployed.

Metric names for mutable indexes to change

Implement mutable index using index SPI (#10687) Due to a change in the IndexType enum used for some logs and metrics in mutable indexes, the metric names may change slightly.

Update in controller API to enable / disable / drop instances

Update getTenantInstances call for controller and separate POST operations on it (#10993)

Change in substring query function definition

Change substring to comply with standard sql definition (#11502)

Full list of features added

Allow queries on multiple tables of same tenant to be executed from controller UI #10336
Encapsulate changes in IndexLoadingConfig and SegmentGeneratorConfig #10352
[Index SPI] IndexType (#10191)
Simplify filtered aggregate transform operator creation (#10410)
Introduce BaseProjectOperator and ValueBlock (#10405)
Add support to create realtime segment in local (#10433)
Refactor: Pass context instead on individual arguments to operator (#10413)
Add "processAll" mode for MergeRollupTask (#10387)
Upgrade h2 version from 1.x to 2.x (#10456)
Added optional force param to the table configs update API (#10441)
Enhance broker reduce to handle different column names from server response (#10454)
Adding fields to enable/disable dictionary optimization. (#10484)
Remove converted H2 type NUMERIC(200, 100) from BIG_DECIMAL (#10483)
Add JOIN support to PinotQuery (#10421)
Add testng on verifier (#10491)
Clean up temp consuming segment files during server start (#10489)
make pinot k8s sts and deployment start command configurable (#10509)
Fix Bottleneck for Server Bootstrap by Making maxConnsPerRoute Configurable (#10487)
Type match between resultType and function's dataType (#10472)
create segment zk metadata cache (#10455)
Allow ValueBlock length to increase in TransformFunction (#10515)
Allow configuring helix timeouts for EV dropped in Instance manager (#10510)
Enhance error reporting (#10531)
Combine "GET /segments" API & "GET /segments/{tableName}/select" (#10412)
Exposed the CSV header map as part of CSVRecordReader (#10542)
Moving Zk updates for reload,force_commit to their own Znodes which will spread out Zk write load across jobTypes (#10451)
Enabling dictionary override optimization on the segment reload path as well. (#10557)
Make broker's rest resource packages configurable (#10588)
Check EV not exist before allowing creating the table (#10593)
Adding an parameter (toSegments) to the endSegmentReplacement API (#10630)
update target tier for segments if tierConfigs is provided (#10642)
Add support for custom compression factor for Percentile TDigest aggregation functions (#10649)
Utility to convert table config into updated format (#10623)
Segment lifecycle event listener support (#10536)
Add server metrics to capture gRPC activity (#10678)
Separate and parallelize BloomFilter based semgment pruner (#10660)
API to expose the contract/rules imposed by pinot on tableConfig #10655
Add description field to metrics in Pinot (#10744)
changing the dedup store to become pluggable #10639
Make the TimeUnit in the DATETRUNC function case insensitive. (#10750)
[feature] Consider tierConfigs when assigning new offline segment #10746
Compress idealstate according to estimated size #10766
10689: Update for pinot helm release version 0.2.7 (#10723)
Fail the query if a filter's rhs contains NULL. (#11188)
Support Off Heap for Native Text Indices (#10842)
refine segment reload executor to avoid creating threads unbounded #10837
compress nullvector bitmap upon seal (#10852)
Enable case insensitivity by default (#10771)
Push out-of-order events metrics for full upsert (#10944)
[feature] add requestId for BrokerResponse in pinot-broker and java-client #10943
Provide results in CompletableFuture for java clients and expose metrics #10326
Add minion observability for segment upload/download failures (#10978)
Enhance early terminate for combine operator (#10988)
Add fromController method that accepts a PinotClientTransport (#11013)
Ensure min/max value generation in the segment metadata. (#10891)
Apply some allocation optimizations on GrpcSendingMailbox (#11015)
When enable case-insensitive, don't allow to add newly column name which have the same lowercase name with existed columns. (#10991)
Replace Long attributes with primitive values to reduce boxing (#11059)
retry KafkaConsumer creation in KafkaPartitionLevelConnectionHandler.java (#253) (#11040)
Support for new dataTime format in DateTimeGranularitySpec without explicitly setting size (#11057)
Returning 403 status code in case of authorization failures (#11136)
Simplify compatible test to avoid test against itself (#11163)
Updated code for setting value of segment min/max property. (#10990)
Add stat to track number of segments that have valid doc id snapshots (#11110)
Add brokerId and brokerReduceTimeMs to the broker response stats (#11142)
safely multiply integers to prevent overflow (#11186)
Move largest comparison value update logic out of map access (#11157)
Optimize DimensionTableDataManager to abort unnecesarry loading (#11192)
Refine isNullsLast and isAsc functions. (#11199)
Update the pinot tenants tables api to support returning broker tagged tables (#11184)
add multi-value support for native text index (#11204)
Add percentiles report in QuerySummary (#11299)
Add meter for broker responses with unavailable segments (#11301)
Enhance Minion task management (#11315)
add additional lucene index configs (#11354)
Add DECIMAL data type to orc record reader (#11377)
add configuration to fail server startup on non-good status checker (#11347)
allow passing freshness checker after an idle threshold (#11345)
Add broker validation for hybrid tableConfig creation (#7908)
Support partition parallelism for partitioned table scan (#11266)

Vulnerability fixes, bugfixes, cleanups and deprecations

Remove support for High level consumers in Apache Pinot (#11017)
Fix JDBC driver check for username (#10416)
[Clean up] Remove getColumnName() from AggregationFunction interface (#10431)
fix jersey TerminalWriterInterceptor MessageBodyWriter not found issue (#10462)
Bug fix: Start counting operator execution time from first NoOp block (#10450)
Fix unavailable instances issues for StrictReplicaGroup (#10466)
Change shell to bash (#10469)
Fix the double destroy of segment data manager during server shutdown (#10475)
Remove "isSorted()" precondition check in the ForwardIndexHandler (#10476)
Fix null handling in streaming selection operator (#10453)
Fix jackson dependencies (#10477)
Startree index build enhancement (#10905)
optimize queries where lhs and rhs of predicate are equal (#10444)
Trivial fix on a warning detected by static checker (#10492)
wait for full segment commit protocol on force commit (#10479)
Fix bug and add test for noDict -> Dict conversion for sorted column (#10497)
Make column order deterministic in segment (#10468)
Type match between resultType and function's dataType (#10472)
Allow empty segmentsTo for segment replacement protocol (#10511)
Use string as default compatible type for coalesce (#10516)
Use threadlocal variable for genericRow to make the MemoryOptimizedTable threadsafe (#10502)
Fix shading in spark2 connector pom file (#10490)
Fix ramping delay caused by long lasting sequence of unfiltered messa… (#10418)
Do not serialize metrics in each Operator (#10473)
Make pinot-controller apply webpack production mode when bin-dist profile is used. (#10525)
Fix FS props handling when using /ingestFromUri (#10480)
Clean up v0_deprecated batch ingestion jobs (#10532)
Deprecate kafka 0.9 support (#10522)
safely multiply integers to prevent overflow (#11186)
Reduce timeout for codecov and not fail the job in any case (#10547)
Fix DataTableV3 serde bug for empty array (#10583)
Do not record operator stats when tracing is enabled (#10447)
Forward auth token for logger APIs from controller to other controllers and brokers (#10590)
Bug fix: Partial upsert default strategy is null (#10610)
Fix flaky test caused by EV check during table creation (#10616)
Fix withDissabledTrue typo (#10624)
Cleanup unnecessary mailbox id ser/de (#10629)
no error metric for queries where all segments are pruned (#10589)
bug fix: to keep QueryParser thread safe when handling many read requests on class RealtimeLuceneTextIndex (#10620)
Fix static DictionaryIndexConfig.DEFAULT_OFFHEAP being actually onheap (#10632)
10567: [cleanup pinot-integration-test-base], clean query generations and some other refactoring. (#10648)
Fixes backward incompatability with SegmentGenerationJobSpec for segment push job runners (#10645)
Bug fix to get the toSegments list correctly (#10659)
10661: Fix for failing numeric comparison in where clause for IllegalStateException. (#10662)
Fixes partial upsert not reflecting multiple comparison column values (#10693)
Fix Bug in Reporting Timer Value for Min Consuming Freshness (#10690)
Fix typo of rowSize -> columnSize (#10699)
update segment target tier before table rebalance (#10695)
Fix a bug in star-tree filter operator which can incorrecly filter documents (#10707)
Enhance the instrumentation for a corner case where the query doesn't go through DocIdSetOp (#10729)
bug fix: add missing properties when edit instance config (#10741)
Making segmentMapper do the init and cleanup of RecordReader (#10874)
Fix githubEvents table for quickstart recipes (#10716)
Minor Realtime Segment Commit Upload Improvements (#10725)
Return 503 for all interrupted queries. Refactor the query killing code. (#10683)
Add decoder initialization error to the server's error cache (#10773)
bug fix: add @JsonProperty to SegmentAssignmentConfig (#10759)
ensure we wait the full no query timeout before shutting down (#10784)
Clean up KLL functions with deprecated convention (#10795)
Redefine the semantics of SEGMENT_STREAMED_DOWNLOAD_UNTAR_FAILURES metric to count individual segment fetch failures. (#10777)
fix excpetion during exchange routing causes stucked pipeline (#10802)
[bugfix] fix floating point and integral type backward incompatible issue (#10650)
[pinot-core] Start consumption after creating segment data manager (#11227)
Fix IndexOutOfBoundException in filtered aggregation group-by (#11231)
Fix null pointer exception in segment debug endpoint #11228
Clean up RangeIndexBasedFilterOperator. (#11219)
Fix the escape/unescape issue for property value in metadata (#11223)
Fix a bug in the order by comparator (#10818)
Keeps nullness attributes of merged in comparison column values (#10704)
Add required JSON annotation in H3IndexResolution (#10792)
Fix a bug in SELECT DISTINCT ORDER BY. (#10827)
jsonPathString should return null instead of string literal "null" (#10855)
Bug Fix: Segment Purger cannot purge old segments after schema evolution (#10869)
Fix #10713 by giving metainfo more priority than config (#10851)
Close PinotFS after Data Manager Shutdowns (#10888)
bump awssdk version for a bugfix on http conn leakage (#10898)
Fix MultiNodesOfflineClusterIntegrationTest.testServerHardFailure() (#10909)
Fix a bug in SELECT DISTINCT ORDER BY LIMIT. (#10887)
Fix an integer overflow bug. (#10940)
Return true when _resultSet is not null (#10899)
Fixing table name extraction for lateral join queries (#10933)
Fix casting when prefetching mmap'd segment larger than 2GB (#10936)
Null check before closing reader (#10954)
Fixes SQL wildcard escaping in LIKE queries (#10897)
[Clean up] Do not count DISTINCT as aggregation (#10985)
do not readd lucene readers to queue if segment is destroyed #10989
Message batch ingestion lag fix (#10983)
Fix a typo in snapshot lock (#11007)
When extracting root-level field name for complex type handling, use the whole delimiter (#11005)
update jersey to fix Denial of Service (DoS) (#11021)
Update getTenantInstances call for controller and separate POST operations on it (#10993)
update freemaker to fix Server-side Template Injection (#11019)
format double 0 properly to compare with h2 results (#11049)
Fix double-checked locking in ConnectionFactory (#11014)
Remove presto-pinot-driver and pinot-java-client-jdk8 module (#11051)
Make RequestUtils always return a string array when getTableNames (#11069)
Fix BOOL_AND and BOOL_OR result type (#11033)
[cleanup] Consolidate some query and controller/broker methods in integration tests (#11064)
Fix grpc regression on multi-stage engine (#11086)
Delete an obsolete TODO. (#11080)
Minor fix on AddTableCommand.toString() (#11082)
Allow using Lucene text indexes on mutable MV columns. (#11093)
Allow offloading multiple segments from same table in parallel (#11107)
Added serviceAccount to minion-stateless (#11095)
Bug fix: TableUpsertMetadataManager is null (#11129)
Fix reload bug (#11131)
Allow extra aggregation types in RealtimeToOfflineSegmentsTask (#10982)
Fix a bug when use range index to solve EQ predicate (#11146)
Sanitise API inputs used as file path variables (#11132)
Fix NPE when nested query doesn't have gapfill (#11155)
Fix the NPE when query response error stream is null (#11154)
Make interface methods non private, for java 8 compatibility (#11164)
Increment nextDocId even if geo indexing fails (#11158)
Fix the issue of consuming segment entering ERROR state due to stream connection errors (#11166)
In TableRebalancer, remove instance partitions only when reassigning instances (#11169)
Remove JDK 8 unsupported code (#11176)
Fix compat test by adding -am flag to build pinot-integration-tests (#11181)
dont duplicate register scalar function in CalciteSchema (#11190)
Fix the storage quota check for metadata push (#11193)
Delete filtering NULL support dead code paths. (#11198)
[bugfix] Do not move real-time segments to working dir on restart (#11226)
Fix a bug in ExpressionScanDocIdIterator for multi-value. (#11253)
Exclude NULLs when PredicateEvaluator::isAlwaysTrue is true. (#11261)
UI: fix sql query options seperator (#10770)
Fix a NullPointerException bug in ScalarTransformFunctionWrapper. (#11309)
[refactor] improve disk read for partial upsert handler (#10927)
Fix the wrong query time when the response is empty (#11349)
getMessageAtIndex should actually return the value in the streamMessage for compatibility (#11355)
Remove presto jdk8 related dependencies (#11285)
Remove special routing handling for multiple consuming segments (#11371)
Properly handle shutdown of TableDataManager (#11380)
Fixing the stale pinot ServerInstance in _tableTenantServersMap (#11386)
Fix the thread safety issue for mutable forward index (#11392)
Fix RawStringDistinctExecutor integer overflow (#11403)
[logging] fix consume rate logging bug to respect 1 minute threshold (#11421)

0.12.0

Multi-Stage Query Engine

New join semantics support

Left join (#9466)
In-equi join (#9448)
Full join (#9907)
Right join (#9907)
Semi join (#9367)
Using keyword (#9373)

New sql semantics support:

Having (#9274)
Order by (#9279)
In/NotIn clause (#9374)
Cast (#9384)
LIke/Rexlike (#9654)
Range predicate (#9445)

Performance enhancement

Thread safe query planning (#9344)
Partial query execution and round robin scheduling (#9753)
Improve data table serde (#9731)

Major updates

Force commit consuming segments by @sajjad-moradi in #9197
add a freshness based consumption status checker by @jadami10 in #9244
Add metrics to track controller segment download and upload requests in progress by @gviedma in #9258
Adding endpoint to download local log files for each component by @xiangfu0 in #9259
[Feature] Add an option to search input files recursively in ingestion job. The default is set to true to be backward compatible. by @61yao in #9265
add query cancel APIs on controller backed by those on brokers by @klsince in #9276
Add Spark Job Launcher tool by @KKcorps in #9288
Enable Consistent Data Push for Standalone Segment Push Job Runners by @yuanbenson in #9295
Allow server to directly return the final aggregation result by @Jackie-Jiang in #9304
TierBasedSegmentDirectoryLoader to keep segments in multi-datadir by @klsince in #9306
Adaptive Server Selection by @vvivekiyer in #9311
[Feature] Support IsDistinctFrom and IsNotDistinctFrom by @61yao in #9312
Allow ingestion of errored records with incorrect datatype by @KKcorps in #9320
Allow setting custom time boundary for hybrid table queries by @saurabhd336 in #9356
skip late cron job with max allowed delay by @klsince in #9372
Do not allow implicit cast for BOOLEAN and TIMESTAMP by @Jackie-Jiang in #9385
Add missing properties in CSV plugin by @KKcorps in #9399
set MDC so that one can route minion task logs to separate files cleanly by @klsince in #9400
Add a new API to fix segment date time in metadata by @KKcorps in #9413
Update get bytes to return raw bytes of string and support getBytesMV by @61yao in #9441
Exposing consumer's record lag in /consumingSegmentsInfo by @navina in #9515
Do not create dictionary for high-cardinality columns by @KKcorps in #9527
get task runtime configs tracked in Helix by @klsince in #9540
Add more options to json index by @Jackie-Jiang in #9543
add SegmentTierAssigner and refine restful APIs to get segment tier info by @klsince in #9598
Add segment level debug API by @saurabhd336 in #9609
Add record availability lag for Kafka connector by @navina in #9621
notify servers that need to move segments to new tiers via SegmentReloadMessage by @klsince in #9624
Allow to configure multi-datadirs as instance configs and a Quickstart example about them by @klsince in #9705
Customize stopword for Lucene Index by @jasperjiaguo in #9708
Add memory optimized dimension table by @KKcorps in #9802
ADLS file system upgrade by @xiangfu0 in #9855
Added Delete Schema/Table pinot admin commands by @bagipriyank in #9857
Adding new ADLSPinotFS auth type: DEFAULT by @xiangfu0 in #9860
Add rate limit to Kinesis requests by @KKcorps in #9863
Adding configs for zk client timeout by @xiangfu0 in #9975

Other features/changes

Show most recent scheduling errors by @satishwaghela in #9161
Do not use aggregation result for distinct query in IntermediateResultsBlock by @Jackie-Jiang in #9262
Emit metrics for ratio of actual consumption rate to rate limit in real-time tables by @sajjad-moradi in #9201
add metrics entry offlineTableCount by @walterddr in #9270
refine query cancel resp msg by @klsince in #9242
add @ManualAuthorization annotation for non-standard endpoints by @apucher in #9252
Optimize ser/de to avoid using output stream by @Jackie-Jiang in #9278
Add Support for Covariance Function by @SabrinaZhaozyf in #9236
Throw an exception when MV columns are present in the order-by expression list in selection order-by only queries by @somandal in #9078
Improve server query cancellation and timeout checking during execution by @jasperjiaguo in #9286
Add capabilities to ingest from another stream without disabling the real-time table by @sajjad-moradi in #9289
Add minMaxInvalid flag to avoid unnecessary needPreprocess by @npawar in #9238
Add array cardinality function by @walterddr in #9300
TierBasedSegmentDirectoryLoader to keep segments in multi-datadir by @klsince in #9306
Add support for custom null values in CSV record reader by @KKcorps in #9318
Infer parquet reader type based on file metadata by @saurabhd336 in #9294
Add Support for Cast Function on MV Columns by @SabrinaZhaozyf in #9296
Allow ingestion of errored records with incorrect datatype by @KKcorps in #9320
[Feature] Not Operator Transformation by @61yao in #9330
Handle null string in CSV decoder by @KKcorps in #9340
[Feature] Not scalar function by @61yao in #9338
Add support for EXTRACT syntax and converts it to appropriate Pinot expression by @tanmesh in #9184
Add support for Auth in controller requests in java query client by @KKcorps in #9230
delete all related minion task metadata when deleting a table by @zhtaoxiang in #9339
BloomFilterRule should only recommend for supported column type by @yuanbenson in #9364
Support all the types in ParquetNativeRecordReader by @xiangfu0 in #9352
Improve segment name check in metadata push by @zhtaoxiang in #9359
Allow expression transformer cotinue on error by @xiangfu0 in #9376
skip late cron job with max allowed delay by @klsince in #9372
Enhance and filter predicate evaluation efficiency by @jasperjiaguo in #9336
Deprecate instanceId Config For Broker/Minion Specific Configs by @ankitsultana in #9308
Optimize combine operator to fully utilize threads by @Jackie-Jiang in #9387
Terminate the query after plan generation if timeout by @jasperjiaguo in #9386
[Feature] Support IsDistinctFrom and IsNotDistinctFrom by @61yao in #9312
[Feature] Support Coalesce for Column Names by @61yao in #9327
Disable logging for interrupted exceptions in kinesis by @KKcorps in #9405
Benchmark thread cpu time by @jasperjiaguo in #9408
Use ISODateTimeFormat as default for SIMPLE_DATE_FORMAT by @KKcorps in #9378
Extract the common logic for upsert metadata manager by @Jackie-Jiang in #9435
Make minion task metadata manager methods more generic by @saurabhd336 in #9436
Always pass clientId to kafka's consumer properties by @navina in #9444
Adaptive Server Selection by @vvivekiyer in #9311
Refine IndexHandler methods a bit to make them reentrant by @klsince in #9440
use MinionEventObserver to track finer grained task progress status on worker by @klsince in #9432
Allow spaces in input file paths by @KKcorps in #9426
Add support for gracefully handling the errors while transformations by @KKcorps in #9377
Cache Deleted Segment Names in Server to Avoid SegmentMissingError by @ankitsultana in #9423
Handle Invalid timestamps by @KKcorps in #9355
refine minion worker event observer to track finer grained progress for tasks by @klsince in #9449
spark-connector should use v2/brokers endpoint by @itschrispeck in #9451
Remove netty server query support from presto-pinot-driver to remove pinot-core and pinot-segment-local dependencies by @xiangfu0 in #9455
Adaptive Server Selection: Address pending review comments by @vvivekiyer in #9462
track progress from within segment processor framework by @klsince in #9457
Decouple ser/de from DataTable by @Jackie-Jiang in #9468
collect file info like mtime, length while listing files for free by @klsince in #9466
Extract record keys, headers and metadata from Stream sources by @navina in #9224
[pinot-spark-connector] Bump spark connector max inbound message size by @cbalci in #9475
refine the minion task progress api a bit by @klsince in #9482
add parsing for AT TIME ZONE by @agavra in #9477
Eliminate explosion of metrics due to gapfill queries by @elonazoulay in #9490
ForwardIndexHandler: Change compressionType during segmentReload by @vvivekiyer in #9454
Introduce Segment AssignmentStrategy Interface by @GSharayu in #9309
Add query interruption flag check to broker groupby reduction by @jasperjiaguo in #9499
adding optional client payload by @walterddr in #9465
[feature] distinct from scalar functions by @61yao in #9486
Check data table version on server only for null handling by @Jackie-Jiang in #9508
Add docId and column name to segment read exception by @KKcorps in #9512
Sort scanning based operators by cardinality in AndDocIdSet evaluation by @jasperjiaguo in #9420
Do not fail CI when codecov upload fails by @Jackie-Jiang in #9522
[Upsert] persist validDocsIndex snapshot for Pinot upsert optimization by @deemoliu in #9062
broker filter by @dongxiaoman in #9391
[feature] coalesce scalar by @61yao in #9487
Allow setting custom time boundary for hybrid table queries by @saurabhd336 in #9356
[GHA] add cache timeout by @walterddr in #9524
Optimize PinotHelixResourceManager.hasTable() by @Jackie-Jiang in #9526
Include exception when upsert metadata manager cannot be created by @Jackie-Jiang in #9532
allow to config task expire time by @klsince in #9530
expose task finish time via debug API by @klsince in #9534
Remove the wrong warning log in KafkaPartitionLevelConsumer by @Jackie-Jiang in #9536
starting http server for minion worker conditionally by @klsince in #9542
Make StreamMessage generic and a bug fix by @vvivekiyer in #9544
Improve primary key serialization performance by @KKcorps in #9538
[Upsert] Skip removing upsert metadata when shutting down the server by @Jackie-Jiang in #9551
add array element at function by @walterddr in #9554
Handle the case when enableNullHandling is true and an aggregation function is used w/ a column that has an empty null bitmap by @nizarhejazi in #9566
Support segment storage format without forward index by @somandal in #9333
Adding SegmentNameGenerator type inference if not explicitly set in config by @timsants in #9550
add version information to JMX metrics & component logs by @agavra in #9578
remove unused RecordTransform/RecordFilter classes by @agavra in #9607
Support rewriting forward index upon changing compression type for existing raw MV column by @vvivekiyer in #9510
Support Avro's Fixed data type by @sajjad-moradi in #9642
[feature] [kubernetes] add loadBalancerSourceRanges to service-external.yaml for controller and broker by @jameskelleher in #9494
Limit up to 10 unavailable segments to be printed in the query exception by @Jackie-Jiang in #9617
remove more unused filter code by @agavra in #9620
Do not cache record reader in segment by @Jackie-Jiang in #9604
make first part of user agent header configurable by @rino-kadijk in #9471
optimize order by sorted ASC, unsorted and order by DESC cases by @gortiz in #8979
Enhance cluster config update API to handle non-string values properly by @Jackie-Jiang in #9635
Reverts recommender REST API back to PUT (reverts PR #9326) by @yuanbenson in #9638
Remove invalid pruner names from server config by @Jackie-Jiang in #9646
Using usageHelp instead of deprecated help in picocli commands by @navina in #9608
Handle unique query id on server by @Jackie-Jiang in #9648
stateless group marker missing several by @walterddr in #9673
Support reloading consuming segment using force commit by @Jackie-Jiang in #9640
Improve star-tree to use star-node when the predicate matches all the non-star nodes by @Jackie-Jiang in #9667
add FetchPlanner interface to decide what column index to prefetch by @klsince in #9668
Improve star-tree traversal using ArrayDeque by @Jackie-Jiang in #9688
Handle errors in combine operator by @Jackie-Jiang in #9689
return different error code if old version is not on master by @SabrinaZhaozyf in #9686
Support creating dictionary at runtime for an existing column by @vvivekiyer in #9678
check mutable segment explicitly instead of checking existence of indexDir by @klsince in #9718
Remove leftover file before downloading segmentTar by @npawar in #9719
add index key and size map to segment metadata by @walterddr in #9712
Use ideal state as source of truth for segment existence by @Jackie-Jiang in #9735
Close Filesystem on exit with Minion Tasks by @KKcorps in #9681
render the tables list even as the table sizes are loading by @jadami10 in #9741
Add Support for IP Address Function by @SabrinaZhaozyf in #9501
bubble up error messages from broker by @agavra in #9754
Add support to disable the forward index for existing columns by @somandal in #9740
show table metadata info in aggregate index size form by @walterddr in #9733
Preprocess immutable segments from REALTIME table conditionally when loading them by @klsince in #9772
revert default timeout nano change in QueryConfig by @agavra in #9790
AdaptiveServerSelection: Update stats for servers that have not responded by @vvivekiyer in #9801
Add null value index for default column by @KKcorps in #9777
[MergeRollupTask] include partition info into segment name by @zhtaoxiang in #9815
Adding a consumer lag as metric via a periodic task in controller by @navina in #9800
Deserialize Hyperloglog objects more optimally by @priyen in #9749
Download offline segments from peers by @wirybeaver in #9710
Thread Level Usage Accounting and Query Killing on Server by @jasperjiaguo in #9727
Add max merger and min mergers for partial upsert by @deemoliu in #9665
#9518 added pinot helm 0.2.6 with secure version pinot 0.11.0 by @bagipriyank in #9519
Combine the read access for replication config by @snleee in #9849
add v1 ingress in helm chart by @jhisse in #9862
Optimize AdaptiveServerSelection for replicaGroup based routing by @vvivekiyer in #9803
Do not sort the instances in InstancePartitions by @Jackie-Jiang in #9866
Merge new columns in existing record with default merge strategy by @navina in #9851
Support disabling dictionary at runtime for an existing column by @vvivekiyer in #9868
support BOOL_AND and BOOL_OR aggregate functions by @agavra in #9848
Use Pulsar AdminClient to delete unused subscriptions by @navina in #9859
add table sort function for table size by @jadami10 in #9844
In Kafka consumer, seek offset only when needed by @Jackie-Jiang in #9896
fallback if no broker found for the specified table name by @klsince in #9914
Allow liveness check during server shutting down by @Jackie-Jiang in #9915
Allow segment upload via Metadata in MergeRollup Minion task by @KKcorps in #9825
Add back the Helix workaround for missing IS change by @Jackie-Jiang in #9921
Allow uploading real-time segments via CLI by @KKcorps in #9861
Add capability to update and delete table config via CLI by @KKcorps in #9852
default to TAR if push mode is not set by @klsince in #9935
load startree index via segment reader interface by @klsince in #9828
Allow collections for MV transform functions by @saurabhd336 in #9908
Construct new IndexLoadingConfig when loading completed real-time segments by @vvivekiyer in #9938
Make GET /tableConfigs backwards compatible in case schema does not match raw table name by @timsants in #9922
feat: add compressed file support for ORCRecordReader by @etolbakov in #9884
Add Variance and Standard Deviation Aggregation Functions by @snleee in #9910
enable MergeRollupTask on real-time tables by @zhtaoxiang in #9890
Update cardinality when converting raw column to dict based by @vvivekiyer in #9875
Add back auth token for UploadSegmentCommand by @timsants in #9960
Improving gz support for avro record readers by @snleee in #9951
Default column handling of noForwardIndex and regeneration of forward index on reload path by @somandal in #9810
[Feature] Support coalesce literal by @61yao in #9958
Ability to initialize S3PinotFs with serverSideEncryption properties when passing client directly by @npawar in #9988
handle pending minion tasks properly when getting the task progress status by @klsince in #9911
allow gauge stored in metric registry to be updated by @zhtaoxiang in #9961
support case-insensitive query options in SET syntax by @agavra in #9912
pin versions-maven-plugin to 2.13.0 by @jadami10 in #9993
Pulsar Connection handler should not spin up a consumer / reader by @navina in #9893
Handle in-memory segment metadata for index checking by @Jackie-Jiang in #10017
Support the cross-account access using IAM role for S3 PinotFS by @snleee in #10009
report minion task metadata last update time as metric by @zhtaoxiang in #9954
support SKEWNESS and KURTOSIS aggregates by @agavra in #10021
emit minion task generation time and error metrics by @zhtaoxiang in #10026
Use the same default time value for all replicas by @Jackie-Jiang in #10029
Reduce the number of segments to wait for convergence when rebalancing by @saurabhd336 in #10028

UI Update & Improvement

Allow hiding query console tab based on cluster config (#9261)
Allow hiding pinot broker swagger UI by config (#9343)
Add UI to show fine-grained minion task progress (#9488)
Add UI to track segment reload progress (#9521)
Show minion task runtime config details in UI (#9652)
Redefine the segment status (#9699)
Show an option to reload the segments during edit schema (#9762)
Load schema UI async (#9781)
Fix blank screen when redirect to unknown app route (#9888)

Library version upgrade

Upgrade h3 lib from 3.7.2 to 4.0.0 to lower glibc requirement (#9335)
Upgrade ZK version to 3.6.3 (#9612)
Upgrade snakeyaml from 1.30 to 1.33 (#9464)
Upgrade RoaringBitmap from 0.9.28 to 0.9.35 (#9730)
Upgrade spotless-maven-plugin from 2.9.0 to 2.28.0 (#9877)
Upgrade decode-uri-component from 0.2.0 to 0.2.2 (#9941)

BugFixes

Fix bug with logging request headers by @abhs50 in #9247
Fix a UT that only shows up on host with more cores by @klsince in #9257
Fix message count by @Jackie-Jiang in #9271
Fix issue with auth AccessType in Schema REST endpoints by @sajjad-moradi in #9293
Fix PerfBenchmarkRunner to skip the tmp dir by @Jackie-Jiang in #9298
Fix thrift deserializer thread safety issue by @saurabhd336 in #9299
Fix transformation to string for BOOLEAN and TIMESTAMP by @Jackie-Jiang in #9287
[hotfix] Add VARBINARY column to switch case branch by @walterddr in #9313
Fix annotation for "/recommender" endpoint by @sajjad-moradi in #9326
Fix jdk8 build issue due to missing pom dependency by @somandal in #9351
Fix pom to use pinot-common-jdk8 for pinot-connector jkd8 java client by @somandal in #9353
Fix log to reflect job type by @KKcorps in #9381
[Bugfix] schema update bug fix by @MeihanLi in #9382
fix histogram null pointer exception by @jasperjiaguo in #9428
Fix thread safety issues with SDF (WIP) by @saurabhd336 in #9425
Bug fix: failure status in ingestion jobs doesn't reflect in exit code by @KKcorps in #9410
Fix skip segment logic in MinMaxValueBasedSelectionOrderByCombineOperator by @Jackie-Jiang in #9434
Fix the bug of hybrid table request using the same request id by @Jackie-Jiang in #9443
Fix the range check for range index on raw column by @Jackie-Jiang in #9453
Fix Data-Correctness Bug in GTE Comparison in BinaryOperatorTransformFunction by @ankitsultana in #9461
extend PinotFS impls with listFilesWithMetadata and some bugfix by @klsince in #9478
fix null transform bound check by @walterddr in #9495
Fix JsonExtractScalar when no value is extracted by @Jackie-Jiang in #9500
Fix AddTable for real-time tables by @npawar in #9506
Fix some type convert scalar functions by @Jackie-Jiang in #9509
fix spammy logs for ConfluentSchemaRegistryRealtimeClusterIntegrationTest [MINOR] by @agavra in #9516
Fix timestamp index on column of preserved key by @Jackie-Jiang in #9533
Fix record extractor when ByteBuffer can be reused by @Jackie-Jiang in #9549
Fix explain plan ALL_SEGMENTS_PRUNED_ON_SERVER node by @somandal in #9572
Fix time validation when data type needs to be converted by @Jackie-Jiang in #9569
UI: fix incorrect task finish time by @jayeshchoudhary in #9557
Fix the bug where uploaded segments cannot be deleted on real-time table by @Jackie-Jiang in #9579
[bugfix] correct the dir for building segments in FileIngestionHelper by @zhtaoxiang in #9591
Fix NonAggregationGroupByToDistinctQueryRewriter by @Jackie-Jiang in #9605
fix distinct result return by @walterddr in #9582
Fix GcsPinotFS by @lfernandez93 in #9556
fix DataSchema thread-safe issue by @walterddr in #9619
Bug fix: Add missing table config fetch for /tableConfigs list all by @timsants in #9603
Fix re-uploading segment when the previous upload failed by @Jackie-Jiang in #9631
Fix string split which should be on whole separator by @Jackie-Jiang in #9650
Fix server request sent delay to be non-negative by @Jackie-Jiang in #9656
bugfix: Add missing BIG_DECIMAL support for GenericRow serde by @timsants in #9661
Fix extra restlet resource test which should be stateless by @Jackie-Jiang in #9674
AdaptiveServerSelection: Fix timer by @vvivekiyer in #9697
fix PinotVersion to be compatible with prometheus by @agavra in #9701
Fix the setup for ControllerTest shared cluster by @Jackie-Jiang in #9704
[hotfix]groovy class cache leak by @walterddr in #9716
Fix TIMESTAMP index handling in SegmentMapper by @Jackie-Jiang in #9722
Fix the server admin endpoint cache to reflect the config changes by @Jackie-Jiang in #9734
[bugfix] fix case-when issue by @walterddr in #9702
[bugfix] Let StartControllerCommand also handle "pinot.zk.server", "pinot.cluster.name" in default conf/pinot-controller.conf by @thangnd197 in #9739
[hotfix] semi-join opt by @walterddr in #9779
Fixing the rebalance issue for real-time table with tier by @snleee in #9780
UI: show segment debug details when segment is in bad state by @jayeshchoudhary in #9700
Fix the replication in segment assignment strategy by @GSharayu in #9816
fix potential fd leakage for SegmentProcessorFramework by @klsince in #9797
Fix NPE when reading ZK address from controller config by @Jackie-Jiang in #9751
have query table list show search bar; fix InstancesTables filter by @jadami10 in #9742
[pinot-spark-connector] Fix empty data table handling in GRPC reader by @cbalci in #9837
[bugfix] fix mergeRollupTask metrics by @zhtaoxiang in #9864
Bug fix: Get correct primary key count by @KKcorps in #9876
Fix issues for real-time table reload by @Jackie-Jiang in #9885
UI: fix segment status color remains same in different table page by @jayeshchoudhary in #9891
Fix bloom filter creation on BYTES by @Jackie-Jiang in #9898
[hotfix] broker selection not using table name by @walterddr in #9902
Fix race condition when 2 segment upload occurred for the same segment by @jackjlli in #9905
fix timezone_hour/timezone_minute functions by @agavra in #9949
[Bugfix] Move brokerId extraction to BaseBrokerStarter by @jackjlli in #9965
Fix ser/de for StringLongPair by @Jackie-Jiang in #9985
bugfix dir check for HadoopPinotFS.copyFromLocalDir by @klsince in #9979
Bugfix: Use correct exception import in TableRebalancer. by @mayankshriv in #10025
Fix NPE in AbstractMetrics From Race Condition by @ankitsultana in #10022