Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 484 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

latest

Loading...

Basics

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

For Users

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Concepts

Explore the fundamental concepts of Apache Pinot™ as a distributed OLAP database.

Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-latency queries on large datasets through a set of common data model abstractions. Delivering on these goals requires several foundational architectural commitments, including:

  • Storing data in columnar form to support high-performance scanning

  • Sharding of data to scale both storage and computation

  • A distributed architecture designed to scale capacity linearly

  • A tabular data model read by SQL queries

To learn about Pinot components, terminology, and gain a conceptual understanding of how data is stored in Pinot, review the following sections:

  • Pinot storage model

  • Pinot architecture

  • Pinot components

General

This page has a collection of frequently asked questions of a general nature with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, make a pull request.

How does Apache Pinot use deep storage?

When data is pushed to Apache Pinot, Pinot makes a backup copy of the data and stores it on the configured deep-storage (S3/GCP/ADLS/NFS/etc). This copy is stored as tar.gz Pinot segments. Note, that Pinot servers keep a (untarred) copy of the segments on their local disk as well. This is done for performance reasons.

How does Pinot use Zookeeper?

Pinot uses Apache Helix for cluster management, which in turn is built on top of Zookeeper. Helix uses Zookeeper to store the cluster state, including Ideal State, External View, Participants, and so on. Pinot also uses Zookeeper to store information such as Table configurations, schemas, Segment Metadata, and so on.

Why am I getting "Could not find or load class" error when running Quickstart using 0.8.0 release?

Check the JDK version you are using. You may be getting this error if you are using an older version than the current Pinot binary release was built on. If so, you have two options: switch to the same JDK release as Pinot was built with or download the source code for the Pinot release and build it locally.

How to change TimeZone when running Pinot?

Pinot uses the local timezone by default. To change the timezone, set the pinot.timezone value in the .conf config file. It is set once for all Pinot components (Controller, Broker, Server, Minion). See the following sample configuration:

pinot.timezone=UTC

0.9.3

Summary

This is a bug fixing release contains:

  • Update Log4j to 2.17.0 to address CVE-2021-45105 (#7933)

The release is based on the release 0.9.2 with the following cherry-picks:

  • 93c0404

Multi-stage query

Learn more about multi-stage query engine and how to troubleshoot issues.

The general explanation of the multi-stage query engine is provided in the Multi-stage query engine reference documentation. This section provides a deep dive into the multi-stage query engine. Most of the concepts explained here are related to the internals of the multi-stage query engine and users don't need to know about them in order to write queries. However, understanding these concepts can help you to take advantage of the engine's capabilities and to troubleshoot issues.

Query Syntax

Query Pinot using supported syntax.

Query Pinot using supported syntax.

0.12.1

Summary

This is a bug-fixing release contains:

  • use legacy case-when format (https://github.com/apache/pinot/pull/10291)

The release is based on the release 0.12.0 with the following cherry-picks:

  • 6f5a8fc883e1d576117fdb92f09103067672aaca

0.1.0

Frequently Asked Questions (FAQs)

This page lists pages with frequently asked questions with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, make a pull request.

Running on public clouds

This page links to multiple quick start guides for deploying Pinot to different public cloud providers.

These quickstart guides show you how to run an Apache Pinot cluster using Kubernetes on different public cloud providers.

Components

Discover the core components of Apache Pinot, enabling efficient data processing and analytics. Unleash the power of Pinot's building blocks for high-performance data-driven applications.

Apache Pinot™ is a database designed to deliver highly concurrent, ultra-low-latency queries on large datasets through a set of common data model abstractions. Delivering on these goals requires several foundational architectural commitments, including:

  • Storing data in columnar form to support high-performance scanning

  • Sharding of data to scale both storage and computation

  • A distributed architecture designed to scale capacity linearly

  • A tabular data model read by SQL queries

Components

Learn about the major components and logical abstractions used in Pinot.

Operator reference

Developer reference

Segment retention

In this Apache Pinot concepts guide, we'll learn how segment retention works.

Segments in Pinot tables have a retention time, after which the segments are deleted. Typically, offline tables retain segments for a longer period of time than real-time tables.

The removal of segments is done by the retention manager. By default, the retention manager runs once every 6 hours.

The retention manager purges two types of segments:

  • Expired segments: Segments whose end time has exceeded the retention period.

  • Replaced segments: Segments that have been replaced as part of the

There are a couple of scenarios where segments in offline tables won't be purged:

  • If the segment doesn't have an end time. This would happen if the segment doesn't contain a time column.

  • If the segment's table has a segmentIngestionType of REFRESH.

If the retention period isn't specified, segments aren't purged from tables.

The retention manager initially moves these segments into a Deleted Segments area, from where they will eventually be permanently removed.

Controller

Discover the controller component of Apache Pinot, enabling efficient data and query management.

The Pinot controller schedules and reschedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, the Pinot controller schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (for example, ingest of and ). The Pinot controller can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

The controller exposes a for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.

The Pinot controller is responsible for the following:

  • Maintaining global metadata (e.g., configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store.

  • Hosting the Helix Controller and managing other Pinot components (brokers, servers, minions)

  • Maintaining the mapping of which servers are responsible for which segments. This mapping is used by the servers to download the portion of the segments that they are responsible for. This mapping is also used by the broker to decide which servers to route the queries to.

  • Serving admin endpoints for viewing, creating, updating, and deleting configs, which are used to manage and operate the cluster.

  • Serving endpoints for segment uploads, which are used in offline data pushes. They are responsible for initializing real-time consumption and coordination of persisting real-time segments into the segment store periodically.

  • Undertaking other management activities such as managing retention of segments, validations.

For redundancy, there can be multiple instances of Pinot controllers. Pinot expects that all controllers are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or .

Running the periodic task manually

The controller runs several periodic tasks in the background, to perform activities such as management and validation. Each periodic task has to define the run frequency and default frequency. Each task runs at its own schedule or can also be triggered manually if needed. The task runs on the lead controller for each table.

For period task configuration details, see .

Use the GET /periodictask/names API to fetch the names of all the periodic tasks running on your Pinot cluster.

To manually run a named periodic task, use the GET /periodictask/run API:

The Log Request Id (api-09630c07) can be used to search through pinot-controller log file to see log entries related to execution of the Periodic task that was manually run.

If tableName (and its type OFFLINE or REALTIME) is not provided, the task will run against all tables.

Starting a controller

Make sure you've . If you're using Docker, make sure to . To start a controller:

Introduction

Apache Pinot is a real-time distributed OLAP datastore purpose-built for low-latency, high-throughput analytics, and perfect for user-facing analytical workloads.

Apache Pinot™ is a real-time distributed online analytical processing (OLAP) datastore. Use Pinot to ingest and immediately query data from streaming or batch data sources (including, Apache Kafka, Amazon Kinesis, Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage).

We'd love to hear from you! to ask questions, troubleshoot, and share feedback.

Apache Pinot includes the following:

  • Ultra low-latency analytics even at extremely high throughput.

  • Columnar data store with several smart indexing and pre-aggregation techniques.

  • Scaling up and out with no upper bound.

  • Consistent performance based on the size of your cluster and an expected query per second (QPS) threshold.

It's perfect for user-facing real-time analytics and other analytical use cases, including internal dashboards, anomaly detection, and ad hoc data exploration.

User-facing real-time analytics

User-facing analytics refers to the analytical tools exposed to the end users of your product. In a user-facing analytics application, all users receive personalized analytics on their devices, resulting in hundreds of thousands of queries per second. Queries triggered by apps may grow quickly in proportion to the number of active users on the app, as many as millions of events per second. Data generated in Pinot is immediately available for analytics in latencies under one second.

User-facing real-time analytics requires the following:

  • Fresh data. The system needs to be able to ingest data in real time and make it available for querying, also in real time.

  • Support for high-velocity, highly dimensional event data from a wide range of actions and from multiple sources.

  • Low latency. Queries are triggered by end users interacting with apps, resulting in hundreds of thousands of queries per second with arbitrary patterns.

  • Reliability and high availability.

  • Scalability.

  • Low cost to serve.

Why Pinot?

Pinot is designed to execute OLAP queries with low latency. It works well where you need fast analytics, such as aggregations, on both mutable and immutable data.

User-facing, real-time analytics

Pinot was originally built at LinkedIn to power rich interactive real-time analytics applications, such as , , , and many more. is another example of a user-facing analytics app built with Pinot.

Real-time dashboards for business metrics

Pinot can perform typical analytical operations such as slice and dice, drill down, roll up, and pivot on large scale multi-dimensional data. For instance, at LinkedIn, Pinot powers dashboards for thousands of business metrics. Connect various business intelligence (BI) tools such as , , or to visualize data in Pinot.

Enterprise business intelligence

For analysts and data scientists, Pinot works well as a highly-scalable data platform for business intelligence. Pinot converges big data platforms with the traditional role of a data warehouse, making it a suitable replacement for analysis and reporting.

Enterprise application development

For application developers, Pinot works well as an aggregate store that sources events from streaming data sources, such as Kafka, and makes it available for a query using SQL. You can also use Pinot to aggregate data across a microservice architecture into one easily queryable view of the domain.

Pinot prevent any possibility of sharing ownership of database tables across microservice teams. Developers can create their own query models of data from multiple systems of record depending on their use case and needs. As with all aggregate stores, query models are eventually consistent.

Get started

If you're new to Pinot, take a look at our Getting Started guide:

To start importing data into Pinot, see how to import batch and stream data:

To start querying data in Pinot, check out our Query guide:

Learn

For a conceptual overview that explains how Pinot works, check out the Concepts guide:

To understand the distributed systems architecture that explains Pinot's operating model, take a look at our basic architecture section:

Time boundary

Learn about time boundaries in hybrid tables.

Learn about time boundaries in hybrid tables. Hybrid tables are when we have offline and real-time tables with the same name.

When querying these tables, the Pinot broker decides which records to read from the offline table and which to read from the real-time table. It does this using the time boundary.

How is the time boundary determined?

The time boundary is determined by looking at the maximum end time of the offline segments and the segment ingestion frequency specified for the offline table.

If it's set to hourly, then:

Otherwise:

It is possible to force the hybrid table to use max(all offline segments' end time) by calling the API (V 0.12.0+)

Note that this will not automatically update the time boundary as more segments are added to the offline table, and must be called each time a segment with more recent end time is uploaded to the offline table. You can revert back to using the derived time boundary by calling API:

Querying

When a Pinot broker receives a query for a hybrid table, the broker sends a time boundary annotated version of the query to the offline and real-time tables.

For example, if we executed the following query:

The broker would send the following query to the offline table:

And the following query to the real-time table:

The results of the two queries are merged by the broker before being returned to the client.

Getting Started

This section contains quick start guides to help you get up and running with Pinot.

Running Pinot

To simplify the getting started experience, Apache Pinot™ ships with quick start guides that launch Pinot components in a single process and import pre-built datasets.

For a full list of these guides, see .

Deploy to a public cloud

Data import examples

Getting data into Pinot is easy. Take a look at these two quick start guides which will help you get up and running with sample data for offline and real-time .

Server

Uncover the efficient data processing and storage capabilities of Apache Pinot's server component, optimizing performance for data-driven applications.

Pinot servers provide the primary storage for and perform the computation required to execute queries. A production Pinot cluster contains many servers. In general, the more servers, the more data the cluster can retain in tables, the lower latency the cluster can deliver on queries, and the more concurrent queries the cluster can process.

Servers are typically segregated into real-time and offline workloads, with "real-time" servers hosting only real-time tables, and "offline" servers hosting only offline tables. This is a ubiquitous operational convention, not a difference or an explicit configuration in the server process itself. There are two types of servers:

Offline

Offline servers are responsible for downloading segments from the segment store, to host and serve queries off. When a new segment is uploaded to the controller, the controller decides the servers (as many as replication) that will host the new segment and notifies them to download the segment from the segment store. On receiving this notification, the servers download the segment file and load the segment onto the server, to server queries off them.

Real-time

Real-time servers directly ingest from a real-time stream (such as Kafka or EventHubs). Periodically, they make segments of the in-memory ingested data, based on certain thresholds. This segment is then persisted onto the segment store.

Pinot servers are modeled as Helix participants, hosting Pinot tables (referred to as resources in Helix terminology). Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more Helix partitions of one or more helix resources (i.e. one or more segments of one or more tables).

Starting a server

Make sure you've . If you're using Docker, make sure to . To start a server:

0.9.1

Summary

This release fixes the major issue of and a major bug fixing of pinot admin exit code issue().

The release is based on the release 0.9.0 with the following cherry-picks:

FST index

The FST index supports regex queries on text. Decreases on-disk index by 4-6 times.

  • Only supports regex queries

  • Only supported on stored or completed Pinot segments (no consuming segments).

  • Only supported on dictionary-encoded columns.

  • Works better for prefix queries

Note: Lucene is case sensitive as such when using FST index based column(s) in query, user needs to ensure this is taken into account. For e.g Select * from table T where colA LIKE %Value% which has a FST index on colA will only return rows containing string "Value" but not "value".

For more information on the FST construction and code, see .

Enable the FST index

To enable the FST index on a dictionary-encoded column, include the following configuration:

The FST index generates one FST index file (.lucene.fst). If the inverted index is enabled, this is further able to take advantage of that.

Case-insensitive FST index (IFST)

The case-insensitive FST index (IFST) provides the same functionality as the standard FST index but with case-insensitive matching. This eliminates the need to handle case sensitivity manually in queries.

  • Supports case-insensitive regex queries

  • Only supported on stored or completed Pinot segments (no consuming segments).

  • Only supported on dictionary-encoded columns.

  • Works better for prefix queries with case-insensitive matching

Enable the case-insensitive FST index

To enable the case-insensitive FST index on a dictionary-encoded column, include the following configuration:

The case-insensitive FST index generates one FST index file (.lucene.ifst) and provides case-insensitive matching for regex queries without requiring manual case handling in your queries.

For more information about enabling the FST index, see ways to .

0.9.2

Summary

This is a bug fixing release contains:

  • Upgrade log4j to 2.16.0 to fix ()

  • Upgrade swagger-ui to 3.23.11 to fix ()

  • Fix the bug that RealtimeToOfflineTask failed to progress with large time bucket gaps ().

The release is based on the release 0.9.1 with the following cherry-picks:

Multistage Lite Mode

Introduces the new Multistage Engine Lite Mode

MSE Lite Mode is included in Pinot 1.4 and is currently in Beta.

Multistage Engine (MSE) Lite Mode is a new Query Mode that aims to enable safe access to the MSE for all Pinot users. One of the risks with running regular MSE queries is that users can easily write queries that scan a lot of records or run significantly expensive operations. Such queries can impact the reliability of a tenant and create friction in onboarding new use-cases. Lite Mode aims to address this problem. It is based on the observation that most of the users need access to advanced SQL features like Window Functions, Subqueries, etc., and aren't interested in scanning a lot of data or running fully Distributed Joins.

Design

MSE Lite Mode has the following key characteristics:

  • Users can still use all MSE query features like Window Functions, Subqueries, Joins, etc.

  • But, the maximum number of rows returned by a Leaf Stage will be set to a user configurable value. The default value is 100,000.

  • Query execution follows a scatter-gather paradigm, similar to the Single-stage Engine. This is different from regular MSE that uses shuffles across Pinot Servers.

  • Leaf stage(s) are run in the Servers, and all other operators are run using a single thread in the Broker.

Leaf Stage in a Multistage Engine query usually refers to Table Scan, an optional Project, an optional Filter and an optional Aggregate Plan Node.

At present, all joins in MSE Lite Mode are run in the Broker. This may change with the next release, since Colocated Joins can theoretically be run in the Servers.

Enabling Lite Mode

To use Lite Mode, you can use the following query options.

Range index

This page describes configuring the range index for Apache Pinot

Range indexing allows you to get better performance for queries that involve filtering over a range.

It would be useful for a query like the following:

A range index is a variant of an , where instead of creating a mapping from values to columns, we create mapping of a range of values to columns. You can use the range index by setting the following config in the .

Range index is supported for dictionary encoded columns of any type as well as raw encoded columns of a numeric type. Note that the range index can also be used on a dictionary encoded time column using STRING type, since Pinot only supports datetime formats that are in lexicographical order.

A good thumb rule is to use a range index when you want to apply range predicates on metric columns that have a very large number of unique values. This is because using an inverted index for such columns will create a very large index that is inefficient in terms of storage and performance.

Random + broadcast join strategy

In order to execute joins, Pinot creates virtual partitions at query time. The more general way to do so is to assign a random partition to each row of the table. Each partition is then assigned to a server, and the join is executed in a distributed manner.

This partition technique can be applied to one of the tables in the join, but not to both. Otherwise, the result wouldn't be correct as some of the pairs of rows would be lost and never joined. Therefore what Pinot does is partition one of the tables and broadcast the other one.

This technique is used by Pinot when no other technique (like semantic virtual partition or colocated joins) can be used. For example, on queries like:

As always, Pinot assumes that the right table is the smallest, so that is the one that is broadcasted. When this technique is used, the number of rows that are shuffled can be upper-bounded by count(A) + count(B) * number of servers .

Join strategies

In order to execute a join, all the rows of the tables to be joined need to be in the same place. In classical databases like Postgres this is not a problem, as there is usually a single server (or all servers have all the data). But in distributed databases like Pinot, where rows of the tables are distributed across servers, data needs to be shuffled between servers (at least in the general case). This data shuffle is expensive and can be a bottleneck for the query performance.

The most simple way to execute the join would be to move all data into a single server, as shown in the diagram below.

This approach may work for small tables, but it would not scale for large tables that do not fit into a single server. Pinot assumes this is going to be the common case, so it never uses this technique. It is shown here only to help understand the shuffling problem.

What Pinot does is to create virtual partitions at query time. These virtual partitions are created in such a way that Pinot can guarantee that rows that need to be joined are sent to the same server but at the same time it tries to minimize the amount of data that needs to be shuffled between servers.

There are several strategies Pinot can use to reduce data shuffle. Some of them are so effective that they can be used to execute the join without any data shuffle at all, but they are only applicable in some cases.

The strategies, in order of effectiveness, are:

These techniques are explained in more detail in the their own pages. More join strategies will be added in the future. They are listed in the GitHub issue .

General
Pinot On Kubernetes FAQ
Ingestion FAQ
Query FAQ
Operations FAQ
Running on Azure
Running on GCP
Running on AWS
timeBoundary = Maximum end time of offline segments - 1 hour
timeBoundary = Maximum end time of offline segments - 1 day
curl -X POST \
  "http://localhost:9000/tables/{tableName}/timeBoundary" \
  -H "accept: application/json"
curl -X DELETE \
  "http://localhost:9000/tables/{tableName}/timeBoundary" \
  -H "accept: application/json"
SELECT count(*)
FROM events
SELECT count(*)
FROM events_OFFLINE
WHERE timeColumn <= $timeBoundary
SELECT count(*)
FROM events_REALTIME
WHERE timeColumn > $timeBoundary
SET useMultistageEngine=true;
SET usePhysicalOptimizer=true;  -- enables the new Physical MSE Query Optimizer
SET useLiteMode=true;           -- enables Lite Mode
merge rollup task.
curl -X GET "http://localhost:9000/periodictask/names" -H "accept: application/json"

[
  "RetentionManager",
  "OfflineSegmentIntervalChecker",
  "RealtimeSegmentValidationManager",
  "BrokerResourceValidationManager",
  "SegmentStatusChecker",
  "SegmentRelocator",
  "StaleInstancesCleanupTask",
  "TaskMetricsEmitter"
]
curl -X GET "http://localhost:9000/periodictask/run?taskname=SegmentStatusChecker&tableName=jsontypetable&type=OFFLINE" -H "accept: application/json"

{
  "Log Request Id": "api-09630c07",
  "Controllers notified": true
}
docker run \
    --network=pinot-demo \
    --name pinot-controller \
    -p 9000:9000 \
    -d ${PINOT_IMAGE} StartController \
    -zkAddress pinot-zookeeper:2181
bin/pinot-admin.sh StartController \
  -zkAddress localhost:2181 \
  -clusterName PinotCluster \
  -controllerPort 9000
real-time tables
offline tables
REST API endpoint
ADLS
its own configuration
Controller configuration reference
set up Zookeeper
pull the Pinot Docker image
CVE-2021-44228
#7798
e44d2e4
af2858a
"fieldConfigList":[
{
"name":"text_col_1",
"encodingType":"DICTIONARY",
"indexType":"FST"
}
]
{
  "fieldConfigList": [
    {
      "name": "notes",
      "encodingType": "DICTIONARY",
      "indexes": {
        "ifst": {
          "enabled": true
        }
      }
    }
  ]
}
Lucene documentation
enable indexes
CVE-2021-45046
#7903
CVE-2019-17495
#7902
#7814
9ed6498
50e1613
767aa8a
SELECT COUNT(*) 
FROM baseballStats 
WHERE hits > 11
{
    "tableIndexConfig": {
        "rangeIndexColumns": [
            "column_name",
            ...
        ],
        ...
    }
}
inverted index
table configuration
SELECT A.col1, B.col2
FROM A
JOIN B
ON A.col2 > B.col3 or A.col4 < B.col4
Dotted arrows mean shuffle while solid arrows mean in-server transfer
Lookup joins
Colocated joins
Query time partition join
Random + broadcast join
#14518
Dotted arrows mean shuffle while solid arrows mean in-server transfer

Native text index

This page talks about native text indices and corresponding search functionality in Apache Pinot.

Deprecated

This index is deprecated, and subject to be removed after releasing 1.4.0. Please use Lucene based Text Index.

Native text index

Pinot supports text indexing and search by building Lucene indices as sidecars to the main Pinot segments. While this is a great technique, it essentially limits the avenues of optimizations that can be done for Pinot specific use cases of text search.

How is Pinot different?

Pinot, like any other database/OLAP engine, does not need to conform to the entire full text search domain-specific language (DSL) that is traditionally used by full-text search (FTS) engines like ElasticSearch and Solr. In traditional SQL text search use cases, the majority of text searches belong to one of three patterns: prefix wildcard queries (like pino*), postfix or suffix wildcard queries (like *inot), and term queries (like pinot).

Native text indices in Pinot

In Pinot, native text indices are built from the ground up. They use a custom text-indexing engine, coupled with Pinot's powerful inverted indices, to provide a fast text search experience.

The benefits are that native text indices are 80-120% faster than Lucene-based indices for the text search use cases mentioned above. They are also 40% smaller on disk.

Native text indices support real-time text search. For REALTIME tables, native text indices allow data to be indexed in memory in the text index, while concurrently supporting text searches on the same index.

Historically, most text indices depend on the in-memory text index being written to first and then sealed, before searches are possible. This limits the freshness of the search, being near-real-time at best.

Native text indices come with a custom in-memory text index, which allows for real-time indexing and search.

Searching Native Text Indices

The function, TEXT\_CONTAINS, supports text search on native text indices.

SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS (<column_name>, <search_expression>)

Examples:

SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS (<column_name>, "foo.*")
SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS (<column_name>, ".*bar")
SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS (<column_name>, "foo")

TEXT\_CONTAINS can be combined using standard boolean operators

SELECT COUNT(*) FROM Foo WHERE TEXT_CONTAINS ("col1", "foo") AND TEXT_CONTAINS ("col2", "bar")

Note: TEXT\_CONTAINS supports regex and term queries and will work only on native indices. TEXT\_CONTAINS supports standard regex patterns (as used by LIKE in SQL Standard), so there might be some syntatical differences from Lucene queries.

Creating Native Text Indices

Native text indices are created using field configurations. To indicate that an index type is native, specify it using properties in the field configuration:

"fieldConfigList":[
  {
     "name":"text_col_1",
     "encodingType":"RAW",
     "indexTypes": ["TEXT"],
     "properties":{"fstType":"native"}
  }
]

Segment threshold

Learn how segment thresholds work in Pinot.

The segment threshold determines when a segment is committed in real-time tables.

When data is first ingested from a streaming provider like Kafka, Pinot stores the data in a consuming segment.

This segment is on the disk of the server(s) processing a particular partition from the streaming provider.

However, it's not until a segment is committed that the segment is written to the deep store. The segment threshold decides when that should happen.

Why is the segment threshold important?

The segment threshold is important because it ensures segments are a reasonable size.

When queries are processed, smaller segments may increase query latency due to more overhead (number of threads spawned, meta data processing, and so on).

Larger segments may cause servers to run out of memory. When a server is restarted, the consuming segment must start consuming from the first row again, causing a lag between Pinot and the streaming provider.

Mark Needham explains the segment threshold

Understanding Stages

Learn more about multi-stage stages and how to extract stages from query plans.

Deep dive into stages

As explained in the Multi-stage query engine reference documentation, the multi-stage query engine breaks down a query into multiple stages. Each stage corresponds to a subset of the query plan and is executed independently. Stages are connected in a tree-like structure where the output of one stage is the input to another stage. The stage that is at the root of the tree sends the final results to the client. The stages that are at the leaves of the tree read from the tables. The intermediate stages process the data and send it to the next stage.

When the broker receives a query, it generates a query plan. This is a tree-like structure where each node is an operator. The plan is then optimized, moving and changing nodes to generate a plan that is semantically equivalent (it returns the same rows) but more efficient. During this phase the broker colors the nodes of the plan, assigning them to a stage. The broker also assigns a parallelism to each stage and defines which servers are going to execute each stage. For example, if a stage has a parallelism of 10, then at most 10 servers will execute that stage in parallel. One single server can execute multiple stages in parallel and it can even execute multiple instances of the same stage in parallel.

Stages are identified by their stage ID, which is a unique identifier for each stage. In the current implementation the stage ID is a number and the root stage has a stage ID of 0, although this may change in the future.

The current implementation has some properties that are worth mentioning:

  • The leaf stages execute a slightly modified version of the single-stage query engine. Therefore these stages cannot execute joins or aggregations, which are always executed in the intermediate stages.

  • Intermediate stages execute operations using a new query execution engine that has been created for the multi-stage query engine. This is why some of the functions that are supported in the single-stage query engine are not supported in the multi-stage query engine and vice versa.

  • An intermediate stage can only have one join, one window function or one set operation. If a query has more than one of these operations, the broker will create multiple stages, each with one of these operations.

Extracting Stages from Query Plans

As explained in Explain Plan (Multi-Stage), you can use the EXPLAIN PLAN syntax to obtain the logical plan of a query. This logical plan can be used to extract the stages of the query.

For example, if the query is:

explain plan for
select customer.c_address, orders.o_shippriority
from customer
join orders
    on customer.c_custkey = orders.o_custkey
limit 10

A possible output of the EXPLAIN PLAN command is:

LogicalSort(offset=[0], fetch=[10])
  PinotLogicalSortExchange(distribution=[hash], collation=[[]], isSortOnSender=[false], isSortOnReceiver=[false])
    LogicalSort(fetch=[10])
      LogicalProject(c_address=[$0], o_shippriority=[$3])
        LogicalJoin(condition=[=($1, $2)], joinType=[inner])
          PinotLogicalExchange(distribution=[hash[1]])
            LogicalProject(c_address=[$4], c_custkey=[$6])
              LogicalTableScan(table=[[default, customer]])
          PinotLogicalExchange(distribution=[hash[0]])
            LogicalProject(o_custkey=[$5], o_shippriority=[$10])
              LogicalTableScan(table=[[default, orders]])

As it happens with all queries, the logical plan forms a tree-like structure. In this default explain format, the tree-like structure is represented with indentation. The root of the tree is the first line, which is the last operator to be executed and marks the root stage. The boundary between stages are the PinotLogicalExchange operators. In the example above, there are four stages:

  • The root stage starts with the LogicalSort operator in the root of operators and ends with the PinotLogicalSortExchange operator. This is the last stage to be executed and the only one that is executed in the broker, which will directly send the result to the client once it is computed.

  • The next stage starts with this PinotLogicalSortExchange operator and includes the LogicalSort operator, the LogicalProject operator, the LogicalJoin operator and the two PinotLogicalExchange operators. This stage clearly is not a root stage and it is not reading data from the segments, so it is not a leaf stage. Therefore it has to be an intermediate stage.

  • The join has two children, which are the PinotLogicalExchange operators. In this specific case, both sides are very similar. They start with a PinotLogicalExchange operator and end with a LogicalTableScan operator. All stages that end with a LogicalTableScan operator are leaf stages.

Now that we have identified the stages, we can understand what each stage is doing by understanding multi-stage explain plans.

Recipes

Here you will find a collection of ready-made sample applications and examples for real-world data

Quick Start Examples
tables
Running Pinot locally
Running Pinot in Docker
Running in Kubernetes
Running on Azure
Running on GCP
Running on AWS
Batch import example
Stream ingestion example
Usage: StartServer
	-serverHost               <String>                      : Host name for controller. (required=false)
	-serverPort               <int>                         : Port number to start the server at. (required=false)
	-serverAdminPort          <int>                         : Port number to serve the server admin API at. (required=false)
	-dataDir                  <string>                      : Path to directory containing data. (required=false)
	-segmentDir               <string>                      : Path to directory containing segments. (required=false)
	-zkAddress                <http>                        : Http address of Zookeeper. (required=false)
	-clusterName              <String>                      : Pinot cluster name. (required=false)
	-configFileName           <Config File Name>            : Broker Starter Config file. (required=false)
	-help                                                   : Print this message. (required=false)
docker run \
    --network=pinot-demo \
    --name pinot-server \
    -d ${PINOT_IMAGE} StartServer \
    -zkAddress pinot-zookeeper:2181
bin/pinot-admin.sh StartServer \
    -zkAddress localhost:2181
segments
set up Zookeeper
pull the Pinot Docker image
Join us in our Slack channel
Who Viewed Profile
Company Analytics
Talent Insights
UberEats Restaurant Manager
Superset
Tableau
PowerBI
tenants
Getting Started
Import Data
Query
Concepts
Architecture
Cluster
Controller
Broker
Server
Minion
Tenant
Table
Schema
Segment

Broker

Discover how Apache Pinot's broker component optimizes query processing, data retrieval, and enhances data-driven applications.

Pinot brokers take query requests from client processes, scatter them to applicable servers, gather the results, and return results to the client. The controller shares cluster metadata with the brokers, which allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.

A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.

Broker interaction with other components

Pinot brokers are modeled as Helix spectators. They need to know the location of each segment of a table (and each replica of the segments) and route requests to the appropriate server that hosts the segments of the table being queried.

The broker ensures that all the rows of the table are queried exactly once so as to return correct, consistent results for a query. The brokers may optimize to prune some of the segments as long as accuracy is not sacrificed.

Helix provides the framework by which spectators can learn the location in which each partition of a resource (i.e. participant) resides. The brokers use this mechanism to learn the servers that host specific segments of a table.

In the case of hybrid tables, the brokers ensure that the overlap between real-time and offline segment data is queried exactly once, by performing offline and real-time federation.

Let's take this example, we have real-time data for five days - March 23 to March 27, and offline data has been pushed until Mar 25, which is two days behind real-time. The brokers maintain this time boundary.

Suppose, we get a query to this table : select sum(metric) from table. The broker will split the query into 2 queries based on this time boundary – one for offline and one for real-time. This query becomes select sum(metric) from table_REALTIME where date >= Mar 25 and select sum(metric) from table_OFFLINE where date < Mar 25

The broker merges results from both these queries before returning the result to the client.

Starting a broker

Make sure you've set up Zookeeper. If you're using Docker, make sure to pull the Pinot Docker image. To start a broker:

docker run \
    --network=pinot-demo \
    --name pinot-broker \
    -d ${PINOT_IMAGE} StartBroker \
    -zkAddress pinot-zookeeper:2181
bin/pinot-admin.sh StartBroker \
  -zkAddress localhost:2181 \
  -clusterName PinotCluster \
  -brokerPort 7000

Inverted index

This page describes configuring the inverted index for Apache Pinot

We can define the forward index as a mapping from document IDs (also known as rows) to values. Similarly, an inverted index establishes a mapping from values to a set of document IDs, making it the "inverted" version of the forward index. When you frequently use a column for filtering operations like EQ (equal), IN (membership check), GT (greater than), etc., incorporating an inverted index can significantly enhance query performance.

Pinot supports two distinct types of inverted indexes: bitmap inverted indexes and sorted inverted indexes. Bitmap inverted indexes represent the actual inverted index type, whereas the sorted type is automatically available when the column is sorted. Both types of indexes necessitate the enabling of a dictionary for the respective column.

Bitmap inverted index

When a column is not sorted, and an inverted index is enabled for that column, Pinot maintains a mapping from each value to a bitmap of rows. This design ensures that value lookup operations take constant time, providing efficient querying capabilities.

When an inverted index is enabled for a column, Pinot maintains a map from each value to a bitmap of rows, which makes value lookup take constant time. If you have a column that is frequently used for filtering, adding an inverted index will improve performance greatly. You can create an inverted index on a multi-value column.

Inverted indexes are disabled by default and can be enabled for a column by specifying the configuration within the table configuration:

inverted index defined in tableConfig
{
  "fieldConfigList": [
    {
      "name": "theColumnName",
      "indexes": {
        "inverted": {}
      }
    }
  ],
  ...
}

The older way to configure inverted indexes can also be used, although it is not actually recommended:

old way to define inverted index in tableConfig
{
    "tableIndexConfig": {
        "invertedIndexColumns": [
            "theColumnName",
            ...
        ],
        ...
    }
}

When the index is created

By default, bitmap inverted indexes are not generated when the segment is initially created; instead, they are created when the segment is loaded by Pinot. This behavior is governed by the table configuration option indexingConfig.createInvertedIndexDuringSegmentGeneration, which is set to false by default.

Sorted inverted index

As explained in the forward index section, a column that is both sorted and equipped with a dictionary is encoded in a specialized manner that serves the purpose of implementing both forward and inverted indexes. Consequently, when these conditions are met, an inverted index is effectively created without additional configuration, even if the configuration suggests otherwise. This sorted version of the forward index offers a lookup time complexity of log(n) and leverages data locality.

For instance, consider the following example: if a query includes a filter on the memberId column, Pinot will perform a binary search on memberId values to find the range pair of docIds for corresponding filtering value. If the query needs to scan values for other columns after filtering, values within the range docId pair will be located together, which means we can benefit from data locality.

_images/sorted-inverted.png

A sorted inverted index indeed offers superior performance compared to a bitmap inverted index, but it's important to note that it can only be applied to sorted columns. In cases where query performance with a regular inverted index is unsatisfactory, especially when a large portion of queries involve filtering on the same column (e.g., _memberId_), using a sorted index can substantially enhance query performance.

Release notes

The following summarizes Apache Pinot™ releases, from the latest one to the earliest one.

Note

Before upgrading from one version to another one, read the release notes. While the Pinot committers strive to keep releases backward-compatible and introduce new features in a compatible manner, your environment may have a unique combination of configurations/data/schema that may have been somehow overlooked. Before you roll out a new release of Pinot on your cluster, it is best that you run the compatibility test suite that Pinot provides. The tests can be easily customized to suit the configurations and tables in your pinot cluster(s). As a good practice, you should build your own test suite, mirroring the table configurations, schema, sample data, and queries that are used in your cluster.

1.3.0 (February 2025)

1.2.0 (August 2024)

1.1.0 (March 2024)

1.0.0 (September 2023)

0.12.1 (March 2023)

0.12.0 (December 2022)

0.11.0 (September 2022)

0.10.0 (March 2022)

0.9.3 (December 2021)

0.9.2 (December 2021)

0.9.1 (December 2021)

0.9.0 (November 2021)

0.8.0 (August 2021)

0.7.1 (April 2021)

0.6.0 (November 2020)

0.5.0 (September 2020)

0.4.0 (June 2020)

0.3.0 (March 2020)

0.2.0 (November 2019)

0.1.0 (March 2019, First release)

Query

Learn how to query Apache Pinot using SQL or explore data using the web-based Pinot query console.

Explore query syntax:

Running on Azure

This quickstart guide helps you get started running Pinot on Microsoft Azure.

In this quickstart guide, you will set up a Kubernetes Cluster on

1. Tooling Installation

1.1 Install Kubectl

Follow this link () to install kubectl.

For Mac users

Check kubectl version after installation.

Quickstart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12

1.2 Install Helm

To install Helm, see .

For Mac users

Check helm version after installation.

This quickstart provides helm supports for helm v3.0.0 and v2.12.1. Pick the script based on your helm version.

1.3 Install Azure CLI

Follow this link () to install Azure CLI.

For Mac users

2. (Optional) Log in to your Azure account

This script will open your default browser to sign-in to your Azure Account.

3. (Optional) Create a Resource Group

Use the following script create a resource group in location eastus.

4. (Optional) Create a Kubernetes cluster(AKS) in Azure

This script will create a 3 node cluster named pinot-quickstart for demo purposes.

Modify the parameters in the following example command with your resource group and cluster details:

Once the command succeeds, the cluster is ready to be used.

5. Connect to an existing cluster

Run the following command to get the credential for the cluster pinot-quickstart that you just created:

To verify the connection, run the following:

6. Pinot quickstart

Follow this to deploy your Pinot demo.

7. Delete a Kubernetes Cluster

Troubleshooting Pinot

Find debug information in Pinot

Pinot offers various ways to assist with troubleshooting and debugging problems that might happen.

Start with the which will surface many of the commonly occurring problems. The debug api provides information such as tableSize, ingestion status, and error messages related to state transition in server.

The table debug API can be invoked via the Swagger UI, as in the following image:

It can also be invoked directly by accessing the URL as follows. The api requires the tableName, and can optionally take tableType (offline|realtime) and verbosity level.

Pinot also provides a variety of operational metrics that can be used for creating dashboards, alerting and .

Finally, all pinot components log debug information related to error conditions.

Debug a slow query or a query which keeps timing out

Use the following steps:

  1. If the query executes, look at the query result. Specifically look at numEntriesScannedInFilter and numDocsScanned.

    1. If numEntriesScannedInFilter is very high, consider adding indexes for the corresponding columns being used in the filter predicates. You should also think about partitioning the incoming data based on the dimension most heavily used in your filter queries.

    2. If numDocsScanned is very high, that means the selectivity for the query is low and lots of documents need to be processed after the filtering. Consider refining the filter to increase the selectivity of the query.

  2. If the query is not executing, you can extend the query timeout by appending a timeoutMs parameter to the query, for example, select * from mytable limit 10 option(timeoutMs=60000). Then repeat step 1, as needed.

  3. Look at garbage collection (GC) stats for the corresponding Pinot servers. If a particular server seems to be running full GC all the time, you can do a couple of things such as

    1. Increase Java Virtual Machine (JVM) heap (java -Xmx<size>).

    2. Consider using off-heap memory for segments.

    3. Decrease the total number of segments per server (by partitioning the data in a more efficient way).

HDFS as Deep Storage

This guide shows how to set up HDFS as deep storage for a Pinot segment.

To use HDFS as deep storage you need to include HDFS dependency jars and plugins.

Server Setup

Configuration

Executable

Controller Setup

Configuration

Executable

Broker Setup

Configuration

Executable

Troubleshooting

If you receive an error that says No FileSystem for scheme"hdfs", the problem is likely to be a class loading issue.

To fix, try adding the following property to core-site.xml:

fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem

And then export /opt/pinot/lib/hadoop-common-<release-version>.jar in the classpath.

Deep Store

Leverage Apache Pinot's deep store component for efficient large-scale data storage and management, enabling impactful data processing and analysis.

The deep store (or deep storage) is the permanent store for files.

It is used for backup and restore operations. New nodes in a cluster will pull down a copy of segment files from the deep store. If the local segment files on a server gets damaged in some way (or accidentally deleted), a new copy will be pulled down from the deep store on server restart.

The deep store stores a compressed version of the segment files and it typically won't include any indexes. These compressed files can be stored on a local file system or on a variety of other file systems. For more details on supported file systems, see .

Note: Deep store by itself is not sufficient for restore operations. Pinot stores metadata such as table config, schema, segment metadata in Zookeeper. For restore operations, both Deep Store as well as Zookeeper metadata are required.

How do segments get into the deep store?

There are several different ways that segments are persisted in the deep store.

For offline tables, the batch ingestion job writes the segment directly into the deep store, as shown in the diagram below:

The ingestion job then sends a notification about the new segment to the controller, which in turn notifies the appropriate server to pull down that segment.

For real-time tables, by default, a segment is first built-in memory by the server. It is then uploaded to the lead controller (as part of the Segment Completion Protocol sequence), which writes the segment into the deep store, as shown in the diagram below:

Having all segments go through the controller can become a system bottleneck under heavy load, in which case you can use the peer download policy, as described in .

When using this configuration, the server will directly write a completed segment to the deep store, as shown in the diagram below:

Configuring the deep store

For hands-on examples of how to configure the deep store, see the following tutorials:

Optimizing joins

Tips and tricks that can be used to optimize joins

Read the page for a detailed explanation of how joins are implemented.

The order of input relations matter

Apache Pinot does not rely on table statistics to optimize the join order. Instead, it prioritizes the input relations from right to left (based on the order of the tables in the SQL query). This relation is fully consumed to create an in-memory hash table and may be broadcast to all workers. It is less expensive to do a join between a large table and a small table than the other way around, therefore it's important to specify the smaller relation as the right input

Here left means the first relation in the explain plan and right the second one. In SQL, when two tables are joined, the left relation is the first one to specify and the right the second one. But this gets more complicated when three or more tables are joined. It is strongly recommended to use the explain plan to be sure about which input is left and right.

For example, this query:

is more efficient than:

Predicate push-down

Usually it is faster to filter data before joining it. Pinot automatically pushes down predicates to the individual tables before joining them when it can prove the change doesn't break semantics.

For example, consider the following query:

Is automatically transformed by Pinot into:

This optimization not only reduces the amount of data that needs to be shuffled and joined but also opens the possibility of using indexes to speed up the query.

Remember that sometimes the predicate push-down is not possible. One example is when one of the inputs is a subquery with a limit like:

In this case, although Pinot will push down the predicate into the subquery, it won't be able to push it down into the table scan of the subquery because it would break the semantics of the original limit.

Therefore the final query will be

This new query is equivalent to the original one and reduce the amount of data that needs to be shuffled and joined but cannot use indexes to speed up the query. In case you want to apply the filter before the limit, you can rewrite the query as:

This optimization can be easily seen in the explain plan, where the filter operator will be pushed as one of the sides of the join.

Optimizing semi-join to use indexes

Semi-joins are a special case of joins where the result of the join is not the result of the join itself but the rows of the first table that have a match in the second table.

Queries using semi-joins are usually not written as such but as a query with a subquery in the WHERE clause like:

Or

In order to use indexes Pinot needs to know the actual values on the subquery at optimization time. Therefore what Pinot does internally is to execute the subquery first and then replace the subquery with the actual values in the main query.

For example, if the subquery in the previous example returns the values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, the query is transformed into:

Which can then be optimized using indexes.

Currently, this optimization cannot be seen in the Pinot explain plan.

Rewriting joins as semi-joins

Sometimes, inner or left joins can be converted into semi-joins. Specifically, the requirements are:

  1. Only columns from the left side must be projected.

  2. The join condition must be equality.

  3. The right side must be unique.

For example, the following two queries are equivalent:

But they won't be equivalent if instead of distinct_orders we were using orders . This is because a join repeats rows from the left side if there are repetitions on the right side, while a semi-join never repeats rows.

Pinot applies this optimization automatically if the three conditions explained above are fulfilled. Given that columns in Pinot cannot be marked as unique, the only way to indicate Pinot that the right-hand side is unique is to apply a SQL expression that guarantees that, for example DISTINCT or GROUP BY the columns used in the join condition. Sometimes, it is just easier to rewrite the original SQL to substitute the JOIN with a WHERE EXISTS .

Reduce data shuffle

Pinot supports different types of . It is important to understand them and try to use when possible. This data shuffle is expensive and can be a bottleneck for the query performance. Remember to use stageStats (specially and ) and different explain plan modes to understand how your data is being shuffled.

1.3.0
1.2.0
1.1.0
1.0.0
0.12.1
0.12.0
0.11.0
0.10.0
0.9.3
0.9.2
0.9.1
0.9.0
0.8.0
0.7.1
0.6.0
0.5.0
0.4.0
0.3.0
0.2.0
0.1.0
pinot.server.instance.enable.split.commit=true
pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
pinot.server.storage.factory.hdfs.hadoop.conf.path=/path/to/hadoop/conf/directory/
pinot.server.segment.fetcher.protocols=file,http,hdfs
pinot.server.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
pinot.set.instance.id.to.hostname=true
pinot.server.instance.dataDir=/path/in/local/filesystem/for/pinot/data/server/index
pinot.server.instance.segmentTarDir=/path/in/local/filesystem/for/pinot/data/server/segment
pinot.server.grpc.enable=true
pinot.server.grpc.port=8090
export HADOOP_HOME=/path/to/hadoop/home
export HADOOP_VERSION=2.7.1
export HADOOP_GUAVA_VERSION=11.0.2
export HADOOP_GSON_VERSION=2.2.4
export GC_LOG_LOCATION=/path/to/gc/log/file
export PINOT_VERSION=0.10.0
export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin/
export SERVER_CONF_DIR=/path/to/pinot/conf/dir/
export ZOOKEEPER_ADDRESS=localhost:2181


export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
export JAVA_OPTS="-Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-server.log"
${PINOT_DISTRIBUTION_DIR}/bin/start-server.sh  -zkAddress ${ZOOKEEPER_ADDRESS} -configFileName ${SERVER_CONF_DIR}/server.conf
controller.data.dir=hdfs://path/in/hdfs/for/controller/segment
controller.local.temp.dir=/tmp/pinot/
controller.zk.str=<ZOOKEEPER_HOST:ZOOKEEPER_PORT>
controller.enable.split.commit=true
controller.access.protocols.http.port=9000
controller.helix.cluster.name=PinotCluster
pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
pinot.controller.storage.factory.hdfs.hadoop.conf.path=/path/to/hadoop/conf/directory/
pinot.controller.segment.fetcher.protocols=file,http,hdfs
pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
controller.vip.port=9000
controller.port=9000
pinot.set.instance.id.to.hostname=true
pinot.server.grpc.enable=true
export HADOOP_HOME=/path/to/hadoop/home
export HADOOP_VERSION=2.7.1
export HADOOP_GUAVA_VERSION=11.0.2
export HADOOP_GSON_VERSION=2.2.4
export GC_LOG_LOCATION=/path/to/gc/log/file
export PINOT_VERSION=0.10.0
export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin/
export SERVER_CONF_DIR=/path/to/pinot/conf/dir/
export ZOOKEEPER_ADDRESS=localhost:2181


export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
export JAVA_OPTS="-Xms8G -Xmx12G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-controller.log"
${PINOT_DISTRIBUTION_DIR}/bin/start-controller.sh -configFileName ${SERVER_CONF_DIR}/controller.conf
pinot.set.instance.id.to.hostname=true
pinot.server.grpc.enable=true
export HADOOP_HOME=/path/to/hadoop/home
export HADOOP_VERSION=2.7.1
export HADOOP_GUAVA_VERSION=11.0.2
export HADOOP_GSON_VERSION=2.2.4
export GC_LOG_LOCATION=/path/to/gc/log/file
export PINOT_VERSION=0.10.0
export PINOT_DISTRIBUTION_DIR=/path/to/apache-pinot-${PINOT_VERSION}-bin/
export SERVER_CONF_DIR=/path/to/pinot/conf/dir/
export ZOOKEEPER_ADDRESS=localhost:2181


export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
export JAVA_OPTS="-Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:${GC_LOG_LOCATION}/gc-pinot-broker.log"
${PINOT_DISTRIBUTION_DIR}/bin/start-broker.sh -zkAddress ${ZOOKEEPER_ADDRESS} -configFileName  ${SERVER_CONF_DIR}/broker.conf
brew install kubernetes-cli
kubectl version
brew install kubernetes-helm
helm version
brew update && brew install azure-cli
az login
AKS_RESOURCE_GROUP=pinot-demo
AKS_RESOURCE_GROUP_LOCATION=eastus
az group create --name ${AKS_RESOURCE_GROUP} \
                --location ${AKS_RESOURCE_GROUP_LOCATION}
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks create --resource-group ${AKS_RESOURCE_GROUP} \
              --name ${AKS_CLUSTER_NAME} \
              --node-count 3
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks get-credentials --resource-group ${AKS_RESOURCE_GROUP} \
                       --name ${AKS_CLUSTER_NAME}
kubectl get nodes
AKS_RESOURCE_GROUP=pinot-demo
AKS_CLUSTER_NAME=pinot-quickstart
az aks delete --resource-group ${AKS_RESOURCE_GROUP} \
              --name ${AKS_CLUSTER_NAME}
Azure Kubernetes Service (AKS)
https://kubernetes.io/docs/tasks/tools/install-kubectl
Installing Helm
https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest
Kubernetes quickstart
select largeTable.col1, smallTable.col2
from largeTable 
cross join smallTable
select largeTable.col1, smallTable.col2
from smallTable 
cross join largeTable
SELECT customer.c_address, orders.o_shippriority
FROM customer
JOIN orders
    ON customer.c_custkey = orders.o_custkey
WHERE customer.c_nationkey = 1
SELECT customer.c_address, orders.o_shippriority
FROM (customer WHERE c_nationkey = 1) as customer
JOIN orders
    ON customer.c_custkey = orders.o_custkey
SELECT customer.c_address, orders.o_shippriority
FROM (select * from customer LIMIT 10) as customer
JOIN orders
    ON customer.c_custkey = orders.o
WHERE customer.c_nationkey = 1
SELECT customer.c_address, orders.o_shippriority
FROM (select * from 
        (select * from customer LIMIT 10) as temp where WHERE temp.c_nationkey = 1
     ) as customer
JOIN orders
    ON customer.c_custkey = orders.o
SELECT customer.c_address, orders.o_shippriority
FROM (select * from customer WHERE temp.c_nationkey = 1 LIMIT 10) as customer
JOIN orders
    ON customer.c_custkey = orders.o
SELECT customer.c_address, customer.c_nationkey
FROM customer
WHERE EXISTS (SELECT 1 FROM orders WHERE customer.c_custkey = orders.o_custkey)
SELECT customer.c_address, customer.c_nationkey
FROM customer
WHERE c_custkey IN (SELECT o_custkey FROM orders)
SELECT customer.c_address, customer.c_nationkey
FROM customer
WHERE customer.c_custkey IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
SELECT customer.c_address, customer.c_nationkey
FROM customer
JOIN (SELECT DISTINCT o_custkey FROM orders) AS distinct_orders
on customer.c_custkey = distinct_orders.o_custkey
SELECT customer.c_address, customer.c_nationkey
FROM customer
WHERE c_custkey IN (SELECT o_custkey FROM orders)
join operator
join strategies
mailbox send
mailbox receive
Querying Pinot
Query Options
Query Quotas
JSON Functions
Aggregation Functions
Unique Count and cardinality Estimation Functions
Explain Plan (Single-Stage)
Filtering with IdSet
GapFill Function For Time-Series Dataset
Grouping Algorithm
JOINs
JOINs
Lookup UDF Join
Transformation Functions
User-Defined Functions (UDFs)
Window Functions
Window Functions
Pinot Query Console

Query FAQ

This page has a collection of frequently asked questions about queries with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, make a pull request.

Querying

I get the following error when running a query, what does it mean?

{'errorCode': 410, 'message': 'BrokerResourceMissingError'}

This implies that the Pinot Broker assigned to the table specified in the query was not found. A common root cause for this is a typo in the table name in the query. Another uncommon reason could be if there wasn't actually a broker with required broker tenant tag for the table.

What are all the fields in the Pinot query's JSON response?

See this page explaining the Pinot response format: https://docs.pinot.apache.org/users/api/querying-pinot-using-standard-sql/response-format.

SQL Query fails with "Encountered 'timestamp' was expecting one of..."

"timestamp" is a reserved keyword in SQL. Escape timestamp with double quotes.

select "timestamp" from myTable

Other commonly encountered reserved keywords are date, time, table.

Filtering on STRING column WHERE column = "foo" does not work?

For filtering on STRING columns, use single quotes:

SELECT COUNT(*) from myTable WHERE column = 'foo'

ORDER BY using an alias doesn't work?

The fields in the ORDER BY clause must be one of the group by clauses or aggregations, BEFORE applying the alias. Therefore, this will not work:

SELECT count(colA) as aliasA, colA from tableA GROUP BY colA ORDER BY aliasA

But, this will work:

SELECT count(colA) as sumA, colA from tableA GROUP BY colA ORDER BY count(colA)

Does pagination work in GROUP BY queries?

No. Pagination only works for SELECTION queries.

How do I increase timeout for a query ?

You can add this at the end of your query: option(timeoutMs=X). Tthe following example uses a timeout of 20 seconds for the query:

SELECT COUNT(*) from myTable option(timeoutMs=20000)

You can also use SET "timeoutMs" = 20000; SELECT COUNT(*) from myTable.

For changing the timeout on the entire cluster, set this property pinot.broker.timeoutMs in either broker configs or cluster configs (using the POST /cluster/configs API from Swagger).

How do I cancel a query?

See Query Cancellation

How do I optimize my Pinot table for doing aggregations and group-by on high cardinality columns ?

In order to speed up aggregations, you can enable metrics aggregation on the required column by adding a metric field in the corresponding schema and setting aggregateMetrics to true in the table configuration. You can also use a star-tree index config for columns like these (see here for more about star-tree).

How do I verify that an index is created on a particular column ?

There are two ways to verify this:

  1. Log in to a server that hosts segments of this table. Inside the data directory, locate the segment directory for this table. In this directory, there is a file named index_map which lists all the indexes and other data structures created for each segment. Verify that the requested index is present here.

  2. During query: Use the column in the filter predicate and check the value of numEntriesScannedInFilter. If this value is 0, then indexing is working as expected (works for Inverted index).

Does Pinot use a default value for LIMIT in queries?

Yes, Pinot uses a default value of LIMIT 10 in queries. The reason behind this default value is to avoid unintentionally submitting expensive queries that end up fetching or processing a lot of data from Pinot. Users can always overwrite this by explicitly specifying a LIMIT value.

Does Pinot cache query results?

Pinot does not cache query results. Each query is computed in its entirety. Note though, running the same or similar query multiple times will naturally pull in segment pages into memory making subsequent calls faster. Also, for real-time systems, the data is changing in real-time, so results cannot be cached. For offline-only systems, caching layer can be built on top of Pinot, with invalidation mechanism built-in to invalidate the cache when data is pushed into Pinot.

I'm noticing that the first query is slower than subsequent queries. Why is that?

Pinot memory maps segments. It warms up during the first query, when segments are pulled into the memory by the OS. Subsequent queries will have the segment already loaded in memory, and hence will be faster. The OS is responsible for bringing the segments into memory, and also removing them in favor of other segments when other segments not already in memory are accessed.

How do I determine if the star-tree index is being used for my query?

The query execution engine will prefer to use the star-tree index for all queries where it can be used. The criteria to determine whether the star-tree index can be used is as follows:

  • All aggregation function + column pairs in the query must exist in the star-tree index.

  • All dimensions that appear in filter predicates and group-by should be star-tree dimensions.

For queries where above is true, a star-tree index is used. For other queries, the execution engine will default to using the next best index available.

curl -X GET "http://localhost:9000/debug/tables/airlineStats?verbosity=0" -H "accept: application/json"
debug API
monitoring
Swagger - Table Debug Api
segment
server
File Systems
Decoupling Controller from the Data Path
Use OSS as Deep Storage for Pinot
Use S3 as Deep Storage for Pinot
Batch job writing a segment into the deep store
Server sends segment to Controller, which writes segments into the deep store
Server writing a segment into the deep store

Cluster

Learn to build and manage Apache Pinot clusters, uncovering key components for efficient data processing and optimized analysis.

A Pinot cluster is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see Physical architecture.

A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop:

  • Controller: Maintains cluster metadata and manages cluster resources.

  • Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.

  • Broker: Accepts queries from client processes and forwards them to servers for processing.

  • Server: Provides storage for segment files and compute for query processing.

  • (Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).

The simplest possible Pinot cluster consists of four components: a server, a broker, a controller, and a Zookeeper node. In production environments, these components typically run on separate server instances, and scale out as needed for data volume, load, availability, and latency. Pinot clusters in production range from fewer than ten total instances to more than 1,000.

Pinot uses Apache Zookeeper as a distributed metadata store and Apache Helix for cluster management.

Helix is a cluster management solution that maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. Helix constantly monitors the cluster to ensure that the right hardware resources are allocated for the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.

Cluster configuration

For details of cluster configuration settings, see Cluster configuration reference.

Cluster components

Helix divides nodes into logical components based on their responsibilities:

Participant

Participants are the nodes that host distributed, partitioned resources

Pinot servers are modeled as participants. For details about server nodes, see Server.

Spectator

Spectators are the nodes that observe the current state of each participant and use that information to access the resources. Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).

Pinot brokers are modeled as spectators. For details about broker nodes, see Broker.

Controller

The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability.

Pinot controllers are modeled as controllers. For details about controller nodes, see Controller.

Logical view

Another way to visualize the cluster is a logical view, where:

  • A cluster contains tenants

  • Tenants contain tables

  • Tables contain segments

Set up a Pinot cluster

Typically, there is only one cluster per environment/data center. There is no need to create multiple Pinot clusters because Pinot supports tenants.

To set up a cluster, see one of the following guides:

  • Running Pinot in Docker

  • Running Pinot locally

Tenant

Discover the tenant component of Apache Pinot, which facilitates efficient data isolation and resource management within Pinot clusters.

Every table is associated with a tenant, or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., brokers and servers) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data in separate workloads from being stored or processed on the same physical hardware.

By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster. If the cluster is planned to have multiple tenants, consider setting cluster.tenant.isolation.enable=false so that servers and brokers won't be tagged with DefaultTenant automatically while added into the cluster.

To support multi-tenancy, Pinot has first-class support for tenants. Every table is associated with a server tenant and a broker tenant, which controls the nodes used by the table as servers and brokers. Multi-tenancy lets Pinot group all tables belonging to a particular use case under a single tenant name.

The concept of tenants is very important when the multiple use cases are using Pinot and there is a need to provide quotas or some sort of isolation across tenants. For example, consider we have two tables Table A and Table B in the same Pinot cluster.

Defining tenants for tables

We can configure Table A with server tenant Tenant A and Table B with server tenant Tenant B. We can tag some of the server nodes for Tenant A and some for Tenant B. This will ensure that segments of Table A only reside on servers tagged with Tenant A, and segment of Table B only reside on servers tagged with Tenant B. The same isolation can be achieved at the broker level, by configuring broker tenants to the tables.

Table isolation using tenants

No need to create separate clusters for every table or use case!

Tenant configuration

This tenant is defined in the tenants section of the table config.

This section contains two main fields broker and server , which decide the tenants used for the broker and server components of this table.

"tenants": {
  "broker": "brokerTenantName",
  "server": "serverTenantName"
}

In the above example:

  • The table will be served by brokers that have been tagged as brokerTenantName_BROKER in Helix.

  • If this were an offline table, the offline segments for the table will be hosted in Pinot servers tagged in Helix as serverTenantName_OFFLINE

  • If this were a real-time table, the real-time segments (both consuming as well as completed ones) will be hosted in pinot servers tagged in Helix as serverTenantName_REALTIME.

Create a tenant

Broker tenant

Here's a sample broker tenant config. This will create a broker tenant sampleBrokerTenant by tagging three untagged broker nodes as sampleBrokerTenant_BROKER.

sample-broker-tenant.json
{
     "tenantRole" : "BROKER",
     "tenantName" : "sampleBrokerTenant",
     "numberOfInstances" : 3
}

To create this tenant use the following command. The creation will fail if number of untagged broker nodes is less than numberOfInstances.

Follow instructions in Getting Pinot to get Pinot locally, and then

bin/pinot-admin.sh AddTenant \
    -name sampleBrokerTenant 
    -role BROKER 
    -instanceCount 3 -exec
curl -i -X POST -H 'Content-Type: application/json' -d @sample-broker-tenant.json localhost:9000/tenants

Check out the table config in the Rest API to make sure it was successfully uploaded.

Server tenant

Here's a sample server tenant config. This will create a server tenant sampleServerTenant by tagging 1 untagged server node as sampleServerTenant_OFFLINE and 1 untagged server node as sampleServerTenant_REALTIME.

sample-server-tenant.json
{
     "tenantRole" : "SERVER",
     "tenantName" : "sampleServerTenant",
     "offlineInstances" : 1,
     "realtimeInstances" : 1
}

To create this tenant use the following command. The creation will fail if number of untagged server nodes is less than offlineInstances + realtimeInstances.

Follow instructions in Getting Pinot to get Pinot locally, and then

bin/pinot-admin.sh AddTenant \
    -name sampleServerTenant \
    -role SERVER \
    -offlineInstanceCount 1 \
    -realtimeInstanceCount 1 -exec
curl -i -X POST -H 'Content-Type: application/json' -d @sample-server-tenant.json localhost:9000/tenants

Check out the table config in the Rest API to make sure it was successfully uploaded.

Timestamp index

Use a timestamp index to speed up your time query with different granularities

This feature is supported from Pinot 0.11+.

Background

The TIMESTAMP data type introduced in the Pinot 0.8.0 release stores value as millisecond epoch long value.

Typically, users won't need this low level granularity for analytics queries. Scanning the data and time value conversion can be costly for big data.

A common query pattern for timestamp columns is filtering on a time range and then grouping by using different time granularities(days/month/etc).

Typically, this requires the query executor to extract values, apply the transform functions then do filter/groupBy, with no leverage on the dictionary or index.

This was the inspiration for the Pinot timestamp index, which is used to improve the query performance for range query and group by queries on TIMESTAMP columns.

Supported data type

A TIMESTAMP index can only be created on the TIMESTAMP data type.

Timestamp Index

You can configure the granularity for a Timestamp data type column. Then:

  1. Pinot will pre-generate one column per time granularity using a forward index and range index. The naming convention is $${ts_column_name}$${ts_granularity}, where the timestamp column ts with granularities DAY, MONTH will have two extra columns generated: $ts$DAY and $ts$MONTH.

  2. Query overwrite for predicate and selection/group by: 2.1 GROUP BY: Functions like dateTrunc('DAY', ts) will be translated to use the underly column $ts$DAY to fetch data. 2.2 PREDICATE: range index is auto-built for all granularity columns.

Example query usage:

select count(*), 
       datetrunc('WEEK', ts) as tsWeek 
from airlineStats 
WHERE datetrunc('WEEK', ts) > fromDateTime('2014-01-16', 'yyyy-MM-dd') 
group by tsWeek
limit 10

Some preliminary benchmarking shows the query performance across 2.7 billion records improved from 45 secs to 4.2 secs using a timestamp index and a query like this:

select dateTrunc('YEAR', event_time) as y, 
       dateTrunc('MONTH', event_time) as m,  
       sum(pull_request_commits) 
from githubEvents 
group by y, m
limit 1000
Option(timeoutMs=3000000)
Without Timestamp Index

vs.

With Timestamp Index

Usage

The timestamp index is configured on a per column basis inside the fieldConfigList section in the table configuration.

Specify the timestampConfig field. This object must contain a field called granularities, which is an array with at least one of the following values:

  • MILLISECOND

  • SECOND

  • MINUTE

  • HOUR

  • DAY

  • WEEK

  • MONTH

  • QUARTER

  • YEAR

Sample config:

{
  "fieldConfigList": [
    {
      "name": "ts",
      "timestampConfig": {
        "granularities": [
          "DAY",
          "WEEK",
          "MONTH"
        ]
      }
    }
    ...
  ]
  ...
}

Pinot On Kubernetes FAQ

This page has a collection of frequently asked questions about Pinot on Kubernetes with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, make a pull request.

How to increase server disk size on AWS

The following is an example using Amazon Elastic Kubernetes Service (Amazon EKS).

1. Update Storage Class

In the Kubernetes (k8s) cluster, check the storage class: in Amazon EKS, it should be gp2.

Then update StorageClass to ensure:

allowVolumeExpansion: true

Once StorageClass is updated, it should look like this:

2. Update PVC

Once the storage class is updated, then we can update the PersistentVolumeClaim (PVC) for the server disk size.

Now we want to double the disk size for pinot-server-3.

The following is an example of current disks:

The following is the output of data-pinot-server-3:

PVC data-pinot-server-3

Now, let's change the PVC size to 2T by editing the server PVC.

kubectl edit pvc data-pinot-server-3 -n pinot

Once updated, the specification's PVC size is updated to 2T, but the status's PVC size is still 1T.

3. Restart pod to let it reflect

Restart the pinot-server-3 pod:

Recheck the PVC size:

Query time partition join strategy

Although in the general case both tables cannot be partitioned without breaking the join semantics, there are some cases where it is possible. For example, if the join condition is an equality between two columns like ON A.col2 = B.col3, it is possible to assign a partition function to each table that guarantees that partition(A.col2) <> partition(B.col3) => A.col2 <> B.col3. The most common case is to use a hash function as a partition function. The corollary of this property is that rows that end up in different servers after shuffling did not need to be joined.

Dotted arrows mean shuffle while solid arrows mean in-server transfer

This technique is used by Pinot whenever it can infer it is possible, like when the join condition is an equality between two columns or a conjunction of equalities, for example:

SELECT A.col1, B.col2 
FROM A
JOIN B
ON A.col2 = B.col3 AND Ab.col5 = B.col2

When this technique is used, the number of rows that are shuffled is count(A) + count(B).

Schema

Explore the Schema component in Apache Pinot, vital for defining the structure and data types of Pinot tables, enabling efficient data processing and analysis.

Each table in Pinot is associated with a schema. A schema defines:

  • Fields in the table with their data types.

  • Whether the table uses column-based or table-based null handling. For more information, see .

The schema is stored in Zookeeper along with the table configuration.

Schema naming in Pinot follows typical database table naming conventions, such as starting names with a letter, not ending with an underscore, and using only alphanumeric characters

Categories

A schema also defines what category a column belongs to. Columns in a Pinot table can be categorized into three categories:

Category
Description

Pinot does not enforce strict rules on which of these categories columns belong to, rather the categories can be thought of as hints to Pinot to do internal optimizations.

For example, metrics may be stored without a dictionary and can have a different default null value.

The categories are also relevant when doing segment merge and rollups. Pinot uses the dimension and time fields to identify records against which to apply merge/rollups.

Metrics aggregation is another example where Pinot uses dimensions and time are used as the key, and automatically aggregates values for the metric columns.

For configuration details, see .

Date and time fields

Since Pinot doesn't have a dedicated DATETIME datatype support, you need to input time in either STRING, LONG, or INT format. However, Pinot needs to convert the date into an understandable format such as epoch timestamp to do operations. You can refer to for more details on supported formats.

Creating a schema

First, Make sure your and running.

Let's create a schema and put it in a JSON file. For this example, we have created a schema for flight data.

For more details on constructing a schema file, see the .

Then, we can upload the sample schema provided above using either a Bash command or REST API call.

Check out the schema in the to make sure it was successfully uploaded

Bloom filter

This page describes configuring the Bloom filter for Apache Pinot

When a column is configured to use this filter, Pinot creates one Bloom filter per segment. The Bloom filter help to prune segments that do not contain any record matching an EQUALITY or IN predicate.

Note: Support for IN clause is limited to <= 10 values in the predicate, this is to ensure pruning overhead is minimal.

This is useful for query patterns like below where Bloom Filter is defined on playerID column in the table:

Details

A Bloom filter is a probabilistic data structure used to definitively determine if an element is not present in a dataset, but it cannot be employed to determine if an element is present in the dataset. This limitation arises because Bloom filters may produce false positives but never yield false negatives.

An intriguing aspect of these filters is the existence of a mathematical formula that establishes a relationship between their size, the cardinality of the dataset they index, and the rate of false positives.

In Pinot, this cardinality corresponds to the number of unique values expected within each segment. If necessary, the false positive rate and the index size can be configured.

Configuration

Bloom filters are deactivated by default, implying that columns will not be indexed unless they are explicitly configured within the .

There are 3 optional parameters to configure the Bloom filter:

Parameter
Default
Description

The lower the fpp (false positive probability), the greater the accuracy of the Bloom filter, but this reduction in fpp will also lead to an increase in the index size. It's important to note that maxSizeInBytes takes precedence over fpp. If maxSizeInBytes is set to a value greater than 0 and the calculated size of the Bloom filter, based on the specified fpp, exceeds this size limit, Pinot will adjust the fpp to ensure that the Bloom filter size remains within the specified limit.

Similar to other indexes, a Bloom filter can be explicitly deactivated by setting the special parameter disabled to true.

Example

For example the following table config enables the Bloom filter in the playerId column using the default values:

In case some parameter needs to be customized, they can be included in fieldConfigList.indexes.bloom. Remember that even the example customizes all parameters, you can just modify the ones you need.

Older configuration

Use default settings

To use default values, include the name of the column in tableIndexConfig.bloomFilterColumns.

For example:

Customized parameters

To specify custom parameters, add a new entry in tableIndexConfig.bloomFilterConfig object. The key should be the name of the column and the value should be an object similar to the one that can be used in the Bloom section of fieldConfigList.

For example:

Pinot Data Explorer

Pinot Data Explorer is a user-friendly interface in Apache Pinot for interactive data exploration, querying, and visualization.

Once you have set up a cluster, you can start exploring the data and the APIs using the Pinot Data Explorer.

Navigate to in your browser to open the Data Explorer UI.

Cluster Manager

The first screen that you'll see when you open the Pinot Data Explorer is the Cluster Manager. The Cluster Manager provides a UI to operate and manage your cluster.

If you want to view the contents of a server, click on its instance name. You'll then see the following:

To view the baseballStats table, click on its name, which will show the following screen:

From this screen, we can edit or delete the table, edit or adjust its schema, as well as several other operations.

For example, if we want to add yearID to the list of inverted indexes, click on Edit Table, add the extra column, and click Save:

Query Console

Let's run some queries on the data in the Pinot cluster. Navigate to to see the querying interface.

We can see our baseballStats table listed on the left (you will see meetupRSVP or airlineStats if you used the streaming or the hybrid ). Click on the table name to display all the names along with the data types of the columns of the table.

You can also execute a sample query select * from baseballStats limit 10 by typing it in the text box and clicking the Run Query button.

Cmd + Enter can also be used to run the query when focused on the console.

Here are some sample queries you can try:

Pinot supports a subset of standard SQL. For more information, see .

Rest API

The contains all the APIs that you will need to operate and manage your cluster. It provides a set of APIs for Pinot cluster management including health check, instances management, schema and table management, data segments management.

Let's check out the tables in this cluster by going to , click Try it out, and then click Execute. We can see thebaseballStats table listed here. We can also see the exact cURL call made to the controller API.

You can look at the configuration of this table by going to , click Try it out, type baseballStats in the table name, and then click Execute.

Let's check out the schemas in the cluster by going to , click Try it out, and then click Execute. We can see a schema called baseballStats in this list.

Take a look at the schema by going to , click Try it out, type baseballStats in the schema name, and then click Execute.

Finally, let's check out the data segments in the cluster by going to , click Try it out, type in baseballStats in the table name, and then click Execute. There's 1 segment for this table, called baseballStats_OFFLINE_0.

To learn how to upload your own data and schema, see or .

Vector index

Overview

Apache Pinot now supports a Vector Index for efficient similarity searches over high-dimensional vector embeddings. This feature introduces the capability to store and query float array columns (multi-valued) using a vector similarity algorithm.

Key Features

  • Vector Index is implemented using HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor (ANN) search.

  • Adds support for a predicate and function:

    • VECTOR_SIMILARITY(v1, v2, [optional topK]) to retrieve the topK closest vectors based on similarity.

    • The similarity function can be used as part of a query to filter and rank results.

Examples

Below is an example schema designed for a use case involving product reviews with vector embeddings for each review.

Schema

In this schema:

• The embedding column is a multi-valued float array designed to store high-dimensional vector embeddings (e.g., 1536 dimensions from an NLP model).

• Other fields, such as ProductId, UserId, and Text, store metadata and review text.

Table Config

To enable the Vector Index, configure the table with the appropriate fieldConfigList. The embedding column is specified to use the Vector Index with HNSW for similarity searches.

Explanation of Properties:

  1. vectorIndexType:

Specifies the type of vector index to use. Currently supports HNSW.

  1. vectorDimension:

Defines the dimensionality of the vectors stored in the column. (e.g., 1536 for typical embeddings from models like OpenAI or BERT).

  1. vectorDistanceFunction:

Specifies the distance metric for similarity computation. Options include:

  • INNER_PRODUCT:

    • Computes the inner product (dot product) of the two vectors.

    • Typically used when vectors are normalized and higher scores indicate greater similarity.

  • L2:

    • Measures the Euclidean distance between vectors.

    • Suitable for tasks where spatial closeness in high-dimensional space indicates similarity.

  • L1:

    • Measures the Manhattan distance between vectors (sum of absolute differences of coordinates).

    • Useful for some scenarios where simpler distance metrics are preferred.

  • COSINE:

    • Measures cosine similarity, which considers the angle between vectors.

    • Ideal for normalized vectors where orientation matters more than magnitude.

  1. version:

Specifies the version of the Vector Index implementation.

Query

VECTOR_SIMILARITY:

A predicate that retrieves the top k closest vectors to the query vector.

Inputs:

  • embedding: The vector column.

  • Query vector (literal array).

  • Optional topK parameter (default: 10).

Running on AWS

This quickstart guide helps you get started running Pinot on Amazon Web Services (AWS).

In this quickstart guide, you will set up a Kubernetes Cluster on

1. Tooling Installation

1.1 Install Kubectl

To install kubectl, see .

For Mac users

Check kubectl version after installation.

Quickstart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12

1.2 Install Helm

Follow this link () to install helm.

For Mac users

Check helm version after installation.

This quickstart provides helm supports for helm v3.0.0 and v2.12.1. Pick the script based on your helm version.

1.3 Install AWS CLI

Follow this link () to install AWS CLI.

For Mac users

1.4 Install Eksctl

Follow this link () to install AWS CLI.

For Mac users

2. (Optional) Log in to your AWS account

For first-time AWS users, register your account at .

Once you have created the account, go to to create a user and create access keys under Security Credential tab.

Environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY will override the AWS configuration stored in file ~/.aws/credentials

3. (Optional) Create a Kubernetes cluster(EKS) in AWS

The script below will create a 1 node cluster named pinot-quickstart in us-west-2 with a t3.xlarge machine for demo purposes:

For k8s 1.23+, run the following commands to allow the containers to provision their storage:

Use the following command to monitor the cluster status:

Once the cluster is in ACTIVE status, it's ready to be used.

4. Connect to an existing cluster

Run the following command to get the credential for the cluster pinot-quickstart that you just created:

To verify the connection, run the following:

5. Pinot quickstart

Follow this to deploy your Pinot demo.

6. Delete a Kubernetes Cluster

Running on GCP

This quickstart guide helps you get started running Pinot on Google Cloud Platform (GCP).

In this quickstart guide, you will set up a Kubernetes Cluster on

1. Tooling Installation

1.1 Install Kubectl

Follow this link () to install kubectl.

For Mac users

Check kubectl version after installation.

Quickstart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12

1.2 Install Helm

Follow this link () to install helm.

For Mac users

Check helm version after installation.

This quickstart provides helm supports for helm v3.0.0 and v2.12.1. Choose the script based on your helm version.

1.3 Install Google Cloud SDK

To install Google Cloud SDK, see

1.3.1 For Mac users

  • Install Google Cloud SDK

Restart your shell

2. (Optional) Initialize Google Cloud Environment

3. (Optional) Create a Kubernetes cluster(GKE) in Google Cloud

This script will create a 3 node cluster named pinot-quickstart in us-west1-b with n1-standard-2 machines for demo purposes.

Modify the parameters in the following example command with your gcloud details:

Use the following command do monitor cluster status:

Once the cluster is in RUNNING status, it's ready to be used.

4. Connect to an existing cluster

Run the following command to get the credential for the cluster pinot-quickstart that you just created:

To verify the connection, run the following:

5. Pinot quickstart

Follow this to deploy your Pinot demo.

6. Delete a Kubernetes Cluster

0.2.0

The 0.2.0 release is the first release after the initial one and includes several improvements, reported following.

New Features and Bug Fixes

  • Added support for Kafka 2.0

  • Table rebalancer now supports a minimum number of serving replicas during rebalance

  • Added support for UDF in filter predicates and selection

  • Added support to use hex string as the representation of byte array for queries (see PR )

  • Added support for parquet reader (see PR )

  • Introduced interface stability and audience annotations (see PR )

  • Refactor HelixBrokerStarter to separate constructor and start() - backwards incompatible (see PR )

  • Admin tool for listing segments with invalid intervals for offline tables

  • Migrated to log4j2 (see PR )

  • Added simple avro msg decoder

  • Added support for passing headers in Pinot client

  • Table rebalancer now supports a minimum number of serving replicas during rebalance

  • Support transform functions with AVG aggregation function (see PR )

  • Configurations additions/changes

    • Allow customized metrics prefix (see PR )

    • Controller.enable.batch.message.mode to false by default (see PR )

    • RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (see PR )

    • Config to control kafka fetcher size and increase default (see PR )

    • Added a percent threshold to consider startup of services (see PR )

    • Make SingleConnectionBrokerRequestHandler as default (see PR )

    • Always enable default column feature, remove the configuration (see PR )

    • Remove redundant default broker configurations (see PR )

    • Removed some config keys in server(see PR )

    • Add config to disable HLC realtime segment (see PR )

    • Make RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (see PR )

    • The following config variables are deprecated and will be removed in the next release:

      • pinot.broker.requestHandlerType will be removed, in favor of using the "singleConnection" broker request handler. If you have set this configuration, remove it and use the default type ("singleConnection") for broker request handler.

Work in Progress

  • We are in the process of separating Helix and Pinot controllers, so that administrators can have the option of running independent Helix controllers and Pinot controllers.

  • We are in the process of moving towards supporting SQL query format and results.

  • We are in the process of separating instance and segment assignment using instance pools to optimize the number of Helix state transitions in Pinot clusters with thousands of tables.

Other Notes

  • Task management does not work correctly in this release, due to bugs in Helix. We will upgrade to Helix 0.9.2 (or later) version to get this fixed.

  • You must upgrade to this release before moving onto newer versions of Pinot release. The protocol between Pinot-broker and Pinot-server has been changed and this release has the code to retain compatibility moving forward. Skipping this release may (depending on your environment) cause query errors if brokers are upgraded and servers are in the process of being upgraded.

  • As always, we recommend that you upgrade controllers first, and then brokers and lastly the servers in order to have zero downtime in production clusters.

  • Pull Request introduces a backwards incompatible change to Pinot broker. If you use the Java constructor on HelixBrokerStarter class, then you will face a compilation error with this version. You will need to construct the object and call start() method in order to start the broker.

  • Pull Request introduces a backwards incompatible change for log4j configuration. If you used a custom log4j configuration (log4j.xml), you need to write a new log4j2 configuration (log4j2.xml). In addition, you may need to change the arguments on the command line to start Pinot components.

    If you used Pinot-admin command to start Pinot components, you don't need any change. If you used your own commands to start pinot components, you will need to pass the new log4j2 config as a jvm parameter (i.e. substitute -Dlog4j.configuration or -Dlog4j.configurationFile argument with -Dlog4j2.configurationFile=log4j2.xml).

{
  "metricFieldSpecs": [],
  "dimensionFieldSpecs": [
    {
      "dataType": "STRING",
      "name": "ProductId"
    },
    {
      "dataType": "STRING",
      "name": "UserId"
    },
    {
      "dataType": "INT",
      "name": "Score"
    },
    {
      "dataType": "STRING",
      "name": "Summary"
    },
    {
      "dataType": "STRING",
      "name": "Text"
    },
    {
      "dataType": "STRING",
      "name": "combined"
    },
    {
      "dataType": "INT",
      "name": "n_tokens"
    },
    {
      "dataType": "FLOAT",
      "name": "embedding",
      "singleValueField": false
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "ts",
      "dataType": "TIMESTAMP",
      "format": "1:MILLISECONDS:TIMESTAMP",
      "granularity": "1:SECONDS"
    }
  ],
  "schemaName": "fineFoodReviews"
}
{
  ...
  "fieldConfigList": [
    {
      "encodingType": "RAW",
      "indexType": "VECTOR",
      "name": "embedding",
      "properties": {
        "vectorIndexType": "HNSW",
        "vectorDimension": 1536,
        "vectorDistanceFunction": "COSINE",
        "version": 1
      }
    }
  ]
}
SELECT ProductId, 
       UserId, 
       l2_distance(embedding, ARRAY[-0.0013143676, -0.011042999, ...]) AS l2_dist, 
       n_tokens, 
       combined
FROM fineFoodReviews
WHERE VECTOR_SIMILARITY(embedding, ARRAY[-0.0013143676, -0.011042999, ...], 5)  
ORDER BY l2_dist ASC 
LIMIT 10;

Dimension

Dimension columns are typically used in slice and dice operations for answering business queries. Some operations for which dimension columns are used:

  • GROUP BY - group by one or more dimension columns along with aggregations on one or more metric columns

  • Filter clauses such as WHERE

Metric

These columns represent the quantitative data of the table. Such columns are used for aggregation. In data warehouse terminology, these can also be referred to as fact or measure columns.

Some operation for which metric columns are used:

  • Aggregation - SUM, MIN, MAX, COUNT, AVG etc

  • Filter clause such as WHERE

DateTime

This column represents time columns in the data. There can be multiple time columns in a table, but only one of them can be treated as primary. The primary time column is the one that is present in the segment config. The primary time column is used by Pinot to maintain the time boundary between offline and real-time data in a hybrid table and for retention management. A primary time column is mandatory if the table's push type is APPEND and optional if the push type is REFRESH .

Common operations that can be done on time column:

  • GROUP BY

  • Filter clauses such as WHERE

flights-schema.json
{
  "schemaName": "flights",
  "enableColumnBasedNullHandling": true,
  "dimensionFieldSpecs": [
    {
      "name": "flightNumber",
      "dataType": "LONG",
      "notNull": true
    },
    {
      "name": "tags",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": "null"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "price",
      "dataType": "DOUBLE",
      "notNull": true,
      "defaultNullValue": 0
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "millisSinceEpoch",
      "dataType": "LONG",
      "format": "EPOCH",
      "granularity": "15:MINUTES"
    },
    {
      "name": "hoursSinceEpoch",
      "dataType": "INT",
      "notNull": true,
      "format": "EPOCH|HOURS",
      "granularity": "1:HOURS"
    },
    {
      "name": "dateString",
      "dataType": "STRING",
      "format": "SIMPLE_DATE_FORMAT|yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
  ]
}
bin/pinot-admin.sh AddSchema -schemaFile flights-schema.json -exec

OR

bin/pinot-admin.sh AddTable -schemaFile flights-schema.json -tableFile flights-table.json -exec
curl -F [email protected]  localhost:9000/schemas
Null value support
Schema configuration reference
DateTime field spec configs
cluster is up
Schema configuration reference
Rest API
SELECT COUNT(*) 
FROM baseballStats 
WHERE playerID = 12345

OR

SELECT COUNT(*) 
FROM baseballStats 
WHERE playerID IN(12345, 45668, 56789)

fpp

0.05

False positive probability of the Bloom filter (from 0 to 1).

maxSizeInBytes

0 (unlimited)

Maximum size of the Bloom filter.

loadOnHeap

false

Whether to load the Bloom filter using heap memory or off-heap memory.

Configured in tableConfig fieldConfigList
{
  "tableName": "somePinotTable",
  "fieldConfigList": [
    {
      "name": "playerID",
      "encodingType": "RAW",
      "indexes": {
        "bloom": {}
      }
    },
    ...
  ],
  ...
}
Configured in tableConfig fieldConfigList
{
  "tableName": "somePinotTable",
  "fieldConfigList": [
    {
      "name": "playerID",
      "encodingType": "RAW",
      "indexes": {
        "bloom": {
          "fpp": 0.01,
          "maxSizeInBytes": 1000000,
          "loadOnHeap": true
        }
      }
    },
    ...
  ],
  ...
}
Part of a tableConfig
{
  "tableName": "somePinotTable",
  "tableIndexConfig": {
    "bloomFilterColumns": [
      "playerID",
      ...
    ],
    ...
  },
  ...
}
Part of a tableConfig
{
  "tableIndexConfig": {
    "bloomFilterConfigs": {
      "playerID": {
        "fpp": 0.01,
        "maxSizeInBytes": 1000000,
        "loadOnHeap": true
      },
      ...
    },
    ...
  },
  ...
}
table configuration
brew install kubernetes-cli
kubectl version
brew install kubernetes-helm
helm version
curl "https://d1vvhvl2y92vvt.cloudfront.net/awscli-exe-macos.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
brew tap weaveworks/tap
brew install weaveworks/tap/eksctl
aws configure
EKS_CLUSTER_NAME=pinot-quickstart
eksctl create cluster \
--name ${EKS_CLUSTER_NAME} \
--version 1.16 \
--region us-west-2 \
--nodegroup-name standard-workers \
--node-type t3.xlarge \
--nodes 1 \
--nodes-min 1 \
--nodes-max 1
eksctl utils associate-iam-oidc-provider --region=us-east-2 --cluster=pinot-quickstart --approve

eksctl create iamserviceaccount \
  --name ebs-csi-controller-sa \
  --namespace kube-system \
  --cluster pinot-quickstart \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --approve \
  --role-only \
  --role-name AmazonEKS_EBS_CSI_DriverRole

eksctl create addon --name aws-ebs-csi-driver --cluster pinot-quickstart --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force
EKS_CLUSTER_NAME=pinot-quickstart
aws eks describe-cluster --name ${EKS_CLUSTER_NAME} --region us-west-2
EKS_CLUSTER_NAME=pinot-quickstart
aws eks update-kubeconfig --name ${EKS_CLUSTER_NAME}
kubectl get nodes
EKS_CLUSTER_NAME=pinot-quickstart
aws eks delete-cluster --name ${EKS_CLUSTER_NAME}
Amazon Elastic Kubernetes Service (Amazon EKS)
Install kubectl
https://helm.sh/docs/using_helm/#installing-helm
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html#install-tool-bundled
https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html#installing-eksctl
https://aws.amazon.com/
AWS Identity and Access Management (IAM)
Kubernetes quickstart
brew install kubernetes-cli
kubectl version
brew install kubernetes-helm
helm version
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
GCLOUD_PROJECT=[your gcloud project name]
GCLOUD_ZONE=us-west1-b
GCLOUD_CLUSTER=pinot-quickstart
GCLOUD_MACHINE_TYPE=n1-standard-2
GCLOUD_NUM_NODES=3
gcloud container clusters create ${GCLOUD_CLUSTER} \
  --num-nodes=${GCLOUD_NUM_NODES} \
  --machine-type=${GCLOUD_MACHINE_TYPE} \
  --zone=${GCLOUD_ZONE} \
  --project=${GCLOUD_PROJECT}
gcloud compute instances list
GCLOUD_PROJECT=[your gcloud project name]
GCLOUD_ZONE=us-west1-b
GCLOUD_CLUSTER=pinot-quickstart
gcloud container clusters get-credentials ${GCLOUD_CLUSTER} --zone ${GCLOUD_ZONE} --project ${GCLOUD_PROJECT}
kubectl get nodes
GCLOUD_ZONE=us-west1-b
gcloud container clusters delete pinot-quickstart --zone=${GCLOUD_ZONE}
Google Kubernetes Engine(GKE)
https://kubernetes.io/docs/tasks/tools/install-kubectl
https://helm.sh/docs/using_helm/#installing-helm
Install the gcloud CLI
Kubernetes quickstart
#4041
#3852
#4063
#4100
#4139
#4557
#4392
#3928
#3946
#3869
#4011
#4048
#4074
#4106
#4222
#4235
#3946
#4100
#4139
select playerName, max(hits) 
from baseballStats 
group by playerName 
order by max(hits) desc
select sum(hits), sum(homeRuns), sum(numberOfGames) 
from baseballStats 
where yearID > 2010
select * 
from baseballStats 
order by league
http://localhost:9000
Query Console
quick start
Pinot Query Language
Pinot Admin UI
Table -> List all tables in cluster
Tables -> Get/Enable/Disable/Drop a table
Schema -> List all schemas in the cluster
Schema -> Get a schema
Segment -> List all segments
Batch Ingestion
Stream ingestion
Pinot Cluster Manager
Pinot Server
baseballStats Table
Edit Table
List all tables in cluster
List all schemas in the cluster
baseballStats Schema

Pinot storage model

Apache Pinot™ uses a variety of terms which can refer to either abstractions that model the storage of data or infrastructure components that drive the functionality of the system, including:

  • Tables to store data

  • Segments to partition data

  • Tenants to isolate data

  • Clusters to manage data

Pinot has a distributed systems architecture that scales horizontally. Pinot expects the size of a table to grow infinitely over time. To achieve this, all data needs to be distributed across multiple nodes. Pinot achieves this by breaking data into smaller chunks known as segments (similar to shards/partitions in HA relational databases). Segments can also be seen as time-based partitions.

Table

Similar to traditional databases, Pinot has the concept of a table—a logical abstraction to refer to a collection of related data. As is the case with relational database management systems (RDBMS), a table is a construct that consists of columns and rows (documents) that are queried using SQL. A table is associated with a schema, which defines the columns in a table as well as their data types.

As opposed to RDBMS schemas, multiple tables can be created in Pinot (real-time or batch) that inherit a single schema definition. Tables are independently configured for concerns such as indexing strategies, partitioning, tenants, data sources, and replication.

Pinot stores data in tables. A Pinot table is conceptually identical to a relational database table with rows and columns. Columns have the same name and data type, known as the table's schema.

Pinot schemas are defined in a JSON file. Because that schema definition is in its own file, multiple tables can share a single schema. Each table can have a unique name, indexing strategy, partitioning, data sources, and other metadata.

Pinot table types include:

  • real-time: Ingests data from a streaming source like Apache Kafka®

  • offline: Loads data from a batch source

  • hybrid: Loads data from both a batch source and a streaming source

Segment

Pinot tables are stored in one or more independent shards called segments. A small table may be contained by a single segment, but Pinot lets tables grow to an unlimited number of segments. There are different processes for creating segments (see ingestion). Segments have time-based partitions of table data, and are stored on Pinot servers that scale horizontally as needed for both storage and computation.

Tenant

To support multi-tenancy, Pinot has first class support for tenants. A table is associated with a tenant. This allows all tables belonging to a particular logical namespace to be grouped under a single tenant name and isolated from other tenants. This isolation between tenants provides different namespaces for applications and teams to prevent sharing tables or schemas. Development teams building applications do not have to operate an independent deployment of Pinot. An organization can operate a single cluster and scale it out as new tenants increase the overall volume of queries. Developers can manage their own schemas and tables without being impacted by any other tenant on a cluster.

Every table is associated with a tenant, or a logical namespace that restricts where the cluster processes queries on the table. A Pinot tenant takes the form of a text tag in the logical tenant namespace. Physical cluster hardware resources (i.e., brokers and servers) are also associated with a tenant tag in the common tenant namespace. Tables of a particular tenant tag will only be scheduled for storage and query processing on hardware resources that belong to the same tenant tag. This lets Pinot cluster operators assign specified workloads to certain hardware resources, preventing data from separate workloads from being stored or processed on the same physical hardware.

By default, all tables, brokers, and servers belong to a tenant called DefaultTenant, but you can configure multiple tenants in a Pinot cluster.

Cluster

A Pinot cluster is a collection of the software processes and hardware resources required to ingest, store, and process data. For detail about Pinot cluster components, see Physical architecture.

Physical architecture

A Pinot cluster consists of the following processes, which are typically deployed on separate hardware resources in production. In development, they can fit comfortably into Docker containers on a typical laptop.

  • Controller: Maintains cluster metadata and manages cluster resources.

  • Zookeeper: Manages the Pinot cluster on behalf of the controller. Provides fault-tolerant, persistent storage of metadata, including table configurations, schemas, segment metadata, and cluster state.

  • Broker: Accepts queries from client processes and forwards them to servers for processing.

  • Server: Provides storage for segment files and compute for query processing.

  • (Optional) Minion: Computes background tasks other than query processing, minimizing impact on query latency. Optimizes segments, and builds additional indexes to ensure performance (even if data is deleted).

The simplest possible Pinot cluster consists of four components: a server, a broker, a controller, and a Zookeeper node. In production environments, these components typically run on separate server instances, and scale out as needed for data volume, load, availability, and latency. Pinot clusters in production range from fewer than ten total instances to more than 1,000.

Pinot uses Apache Zookeeper as a distributed metadata store and and Apache Helix for cluster management.

Helix is a cluster management solution created by the authors of Pinot. Helix maintains a persistent, fault-tolerant map of the intended state of the Pinot cluster. It constantly monitors the cluster to ensure that the right hardware resources are allocated to implement the present configuration. When the configuration changes, Helix schedules or decommissions hardware resources to reflect the new configuration. When elements of the cluster change state catastrophically, Helix schedules hardware resources to keep the actual cluster consistent with the ideal represented in the metadata. From a physical perspective, Helix takes the form of a controller process plus agents running on servers and brokers.

Controller

A controller is the core orchestrator that drives the consistency and routing in a Pinot cluster. Controllers are horizontally scaled as an independent component (container) and has visibility of the state of all other components in a cluster. The controller reacts and responds to state changes in the system and schedules the allocation of resources for tables, segments, or nodes. As mentioned earlier, Helix is embedded within the controller as an agent that is a participant responsible for observing and driving state changes that are subscribed to by other components.

The Pinot controller schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of real-time tables and offline tables). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

The controller exposes a REST API endpoint for cluster-wide administrative operations as well as a web-based query console to execute interactive SQL queries and perform simple administrative tasks.

Server

Servers host segments (shards) that are scheduled and allocated across multiple nodes and routed on an assignment to a tenant (there is a single tenant by default). Servers are independent containers that scale horizontally and are notified by Helix through state changes driven by the controller. A server can either be a real-time server or an offline server.

A real-time and offline server have very different resource usage requirements, where real-time servers are continually consuming new messages from external systems (such as Kafka topics) that are ingested and allocated on segments of a tenant. Because of this, resource isolation can be used to prioritize high-throughput real-time data streams that are ingested and then made available for query through a broker.

Broker

Pinot brokers take query requests from client processes, scatter them to applicable servers, gather the results, and return them to the client. The controller shares cluster metadata with the brokers that allows the brokers to create a plan for executing the query involving a minimal subset of servers with the source data and, when required, other servers to shuffle and consolidate results.

A production Pinot cluster contains many brokers. In general, the more brokers, the more concurrent queries a cluster can process, and the lower latency it can deliver on queries.

Pinot minion

Pinot minion is an optional component that can be used to run background tasks such as "purge" for GDPR (General Data Protection Regulation). As Pinot is an immutable aggregate store, records containing sensitive private data need to be purged on a request-by-request basis. Minion provides a solution for this purpose that complies with GDPR while optimizing Pinot segments and building additional indices that guarantees performance in the presence of the possibility of data deletion. One can also write a custom task that runs on a periodic basis. While it's possible to perform these tasks on the Pinot servers directly, having a separate process (Minion) lessens the overall degradation of query latency as segments are impacted by mutable writes.

A Pinot minion is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minon tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.

Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function with or without minions, they are typically present to support routine tasks like batch data ingest.

\

Create and update a table configuration

Create and edit a table configuration in the Pinot UI or with the API.

In Apache Pinot, create a table by creating a JSON file, generally referred to as your table config. Update, add, or delete parameters as needed, and then reload the file.

Create a Pinot table configuration

Before you create a Pinot table configuration, you must first have a running Pinot cluster with broker and server tenants.

  1. Create a plaintext file table configuration locally using settings from for your use case. You may find it useful to download and then modify it. An example from among these is included at the end of this page in

  2. Use the Pinot API to upload your table config file: POST @fileName.json URL:9000/tables

Update a Pinot table configuration

To modify your Pinot table configuration, use the Pinot UI or the API.

Any time you make a change to your table config, you may need to do one or more of the following, depending on the change.

Simple changes only require updating and saving your modified table configuration file. These include:

  • Changing the data or segment retention time

  • Changing the realtime settings

To update existing data and segments, after you update and save the changes to the table config file, do the following as applicable:

When you add or modify indexes or the table schema, perform a . To all segments:

  • In the Pinot UI, from the table page, click Reload All Segments.

  • Using the Pinot API, send POST /segments/{tableName}/reload.

When you re-partition data, perform a segment . To refresh, replace an existing segment with a new one by uploading a segment reusing the existing filename. Use the Pinot API, send POST /segments?tableName={yourTableName}.

When you change the transform function used to populate a derived field or increase the number of partitions in an upsert-enabled table, perform a table re-bootstrap. One way to do this is to delete and recreate the table:

  • Using the Pinot API, first send DELETE /tables/{tableName} followed by POST /tables with the new table configuration.

When you change the stream topic or change the Kafka cluster containing the Kafka topic you want to consume from, perform a real-time ingestion pause and resume. To pause and resume real-time ingestion:

  • Using the Pinot API, first send POST /tables/{tableName}/pauseConsumption followed by POST /tables/{tableName}/resumeConsumption.

Update a Pinot table in the UI

To update a table configuration in the Pinot UI, do the following:

  1. In the Cluster Manager click the Tenant Name of the tenant that hosts the table you want to modify.

  2. Click the Table Name in the list of tables in the tenant.

  3. Click the Edit Table button. This creates a pop-up window containing the table configuration. Edit the contents in this window. Click Save when you are done.

Update a Pinot table using the API

To update a table configuration using the Pinot API, do the following:

  1. Get the current table configuration with GET /tables/{tableName}.

  2. Modify the file locally.

  3. Upload the edited file with PUT /table/{tableName} fileName.json.

Example Pinot table configuration file

This example comes from the . This table configuration defines a table called airlineStats_OFFLINE, which you can interact with by running the example.

Stream ingestion example

The Docker instructions on this page are still WIP

This example assumes you have set up your cluster using .

Data Stream

First, we need to set up a stream. Pinot has out-of-the-box real-time ingestion support for Kafka. Other streams can be plugged in for use, see .

Let's set up a demo Kafka cluster locally, and create a sample topic transcript-topic.

Start Kafka

Create a Kafka Topic

Start Kafka

Start Kafka cluster on port 9876 using the same Zookeeper from the quick-start examples.

Create a Kafka topic

Download the latest . Create a topic.

Creating a schema

If you followed , you have already pushed a schema for your sample table. If not, see to learn how to create a schema for your sample data.

Creating a table configuration

If you followed , you pushed an offline table and schema. To create a real-time table configuration for the sample use this table configuration for the transcript table. For a more detailed overview about table, see .

Uploading your schema and table configuration

Next, upload the table and schema to the cluster. As soon as the real-time table is created, it will begin ingesting from the Kafka topic.

Loading sample data into stream

Use the following sample JSON file for transcript table data in the following step.

Push the sample JSON file into the Kafka topic, using the Kafka script from the Kafka download.

Ingesting streaming data

As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Browse to the running in your Pinot instance (we use localhost in this link as an example) to examine the real-time data.

Physical Optimizer

Describes the new Multistage Engine Physical Query Optimizer

MSE Physical Optimizer is included in Pinot 1.4 and is currently in Beta.

We have added a new query optimizer in the Multistage Engine that computes and tracks precise Data Distribution across the entire plan before running some critical optimizations like Sort Pushdown, Aggregate Split/Pushdown, etc.

One of the biggest features of this Optimizer is that it can eliminate Shuffles or simplify Exchanges, when applicable, for arbitrarily complex queries, without requiring any Query Hints.

To enable this Optimizer for your MSE query, you can use the following Query Options:

Key Features

The examples below are based on the COLOCATED_JOIN Quickstart.

Automatic Colocated Joins and Shuffle Simplification

Consider the query below which consists of 3 Joins. With the new query optimizer, the entire query can run without any cross-server data exchange, since the data is partitioned by userUUID into a compatible number of partitions (see the "Setting Up Table Data Distribution" section below).

The query plan for this query is shown below. You can see that the entire query leverages IDENTITY_EXCHANGE, which is a 1:1 Exchange as defined in Exchange Types below.

Shuffle Simplification with Different Servers / Partition Count

The new optimizer can simplify shuffles even if:

  • The Servers used by either side of a Join are different

  • The Partition Count for the join inputs are different

In the example below, we have a Join performed across two tables: orange (left) and green (right).

The orange table has 4 partitions and the green table has 2 partitions. The servers selected for the Orange and Green tables are [S0, S1] and [S0, S2] respectively. The Join is performed in the servers [S0, S1], because Physical Optimizer by default uses the same Workers as the leftmost input operator.

If the hash-function used for partitioning the two tables is the same, we can leverage an Identity Exchange and skip re-partitioning the data on either side of the join. This is because S0 will consist of records from partitions and of the Orange table, which together contain all records that would make up partition modulo 2. i.e.

Note that Identity Exchange does not imply that the servers in the sender and receiver will be the same. It only implies that there will be a 1:1 mapping from senders to receivers. In the example below, the data transfer from S2 to S1 will be over the network.

Automatically Skip Aggregate Exchange

To evaluate something like GROUP BY userUUID accurately you would need to distribute records based on the userUUID column. The old query optimizer would add a Partitioning Exchange under each Aggregate, unless one used the query hint is_partitioned_by_group_by_keys.

The Physical Optimizer can detect when data is already partitioned by the required column, and will automatically skip adding an Exchange. This has two advantages:

  • We avoid unnecessary Data Exchanges

  • We avoid splitting the Aggregate, since by default when an Aggregate exists on top of an Exchange, a copy of the Aggregate is added under the Exchange (unless is_skip_leaf_stage_group_by query hint is set)

This optimization can be seen in action in the query example shared above. Since data is already partitioned by userUUID, all aggregations are run in DIRECT mode, i.e. without splitting the aggregate into multiple aggregates.

Segment / Server Pruning

Similar to the Single Stage Engine, if you have enabled segmentPrunerTypes in your table's Routing config, the Physical Optimizer will prune segments and servers using time, partition or other pruner types for the Leaf Stage. e.g. the following query will only select segments which satisfy the following constraint:

If partitioning is done in a way that segments corresponding to a given partition are present on only 1 server, then the entire query above will run within a single server, simulating shard-local execution from other systems.

Solve Constant Queries in Pinot Broker

Apache Calcite is capable of detecting Filter Expressions that will always evaluate to False. In such cases, the query plan may not have any Table Scans at all. Physical Optimizer solves such queries within the Broker itself, without involving any servers.

Worker Assignment

At present, Worker Assignment follows these simple rules:

  • Leaf Stage will have workers assigned based on Table Scan and Filters, using the Routing configs set in the Table Config.

  • Other Stages will use the same workers as the left-most input stage.

  • Some Plan Nodes, such as Sort(fetch=..), may require data to be collected in a single Worker. In such a case, that stage will be run on a single Worker, which will be randomly selected from one of the input workers.

Limitations

Some of the features of the existing MSE query optimizer are not yet available in the Physical Optimizer. We aim to add support for most these in Pinot 1.5:

  • Spools.

  • Dynamic filters for semi-join

JOINs

Pinot supports JOINs, including left, right, full, semi, anti, lateral, and equi JOINs. Use JOINs to connect two table to generate a unified view, based on a related column between the tables.

This page explains the syntax used to write join. In order to get a more in deep knowledge of how joins work it is recommended to read and also from Star Tree.

Important: To query using JOINs, you must

INNER JOIN

The inner join selects rows that have matching values in both tables.

Syntax

Example of inner join

Joins a table containing user transactions with a table containing promotions shown to the users, to show the spending for every userID.

LEFT JOIN

A left join returns all values from the left relation and the matched values from the right table, or appends NULL if there is no match. Also referred to as a left outer join.

Syntax:

RIGHT JOIN

A right join returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also referred to as a right outer join.

Syntax:

FULL JOIN

A full join returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join.

Syntax:

CROSS JOIN

A cross join returns the Cartesian product of two relations. If no WHERE clause is used along with CROSS JOIN, this produces a result set that is the number of rows in the first table multiplied by the number of rows in the second table. If a WHERE clause is included with CROSS JOIN, it functions like an .

Syntax:

SEMI JOIN

Semi-join returns rows from the first table where matches are found in the second table. Returns one copy of each row in the first table for which a match is found.

Syntax:

Some subqueries, like the following are also implemented as a semi-join under the hood:

ANTI JOIN

Anti-join returns rows from the first table where no matches are found in the second table. Returns one copy of each row in the first table for which no match is found.

Syntax:

Some subqueries, like the following are also implemented as an anti-join under the hood:

Equi join

An equi join uses an equality operator to match a single or multiple column values of the relative tables.

Syntax:

ASOF JOIN

An ASOF JOIN selects rows from two tables based on a "closest match" algorithm.

Syntax:

The comparison operator in the MATCH_CONDITION can be one out of - <, >, <=, >=. Similar to an inner join, an ASOF join first calculate the set of matching rows in the right table for each row in the left table based on the ON condition. But instead of returning all of these rows, the only one returned is the closest match (if one exists) based on the match condition. Note that the two columns in the MATCH_CONDITION should be of the same type.

The join condition in ON is mandatory and has to be a conjunction of equality comparisons (i.e., non-equi join conditions and clauses joined with OR aren't allowed). ON true can be used in case the join should only be performed using the MATCH_CONDITION.

LEFT ASOF JOIN

A LEFT ASOF JOIN is similar to the ASOF JOIN, with the only difference being that all rows from the left table are returned, even those without a match in the right table with the unmatched rows being padded with NULL values (similar to the difference between an INNER JOIN and a LEFT JOIN).

Syntax:

{
  "OFFLINE": {
    "tableName": "airlineStats_OFFLINE",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "timeType": "DAYS",
      "replication": "1",
      "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
      "timeColumnName": "DaysSinceEpoch",
      "segmentPushType": "APPEND",
      "minimizeDataMovement": false
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant"
    },
    "tableIndexConfig": {
      "rangeIndexVersion": 2,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "loadMode": "MMAP",
      "enableDefaultStarTree": false,
      "starTreeIndexConfigs": [
        {
          "dimensionsSplitOrder": [
            "AirlineID",
            "Origin",
            "Dest"
          ],
          "skipStarNodeCreationForDimensions": [],
          "functionColumnPairs": [
            "COUNT__*",
            "MAX__ArrDelay"
          ],
          "maxLeafRecords": 10
        },
        {
          "dimensionsSplitOrder": [
            "Carrier",
            "CancellationCode",
            "Origin",
            "Dest"
          ],
          "skipStarNodeCreationForDimensions": [],
          "functionColumnPairs": [
            "MAX__CarrierDelay",
            "AVG__CarrierDelay"
          ],
          "maxLeafRecords": 10
        }
      ],
      "enableDynamicStarTreeCreation": true,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false,
      "optimizeDictionary": false,
      "optimizeDictionaryForMetrics": false,
      "noDictionarySizeRatioThreshold": 0
    },
    "metadata": {
      "customConfigs": {}
    },
    "fieldConfigList": [
      {
        "name": "ts",
        "encodingType": "DICTIONARY",
        "indexType": "TIMESTAMP",
        "indexTypes": [
          "TIMESTAMP"
        ],
        "timestampConfig": {
          "granularities": [
            "DAY",
            "WEEK",
            "MONTH"
          ]
        }
      }
    ],
    "ingestionConfig": {
      "transformConfigs": [
        {
          "columnName": "ts",
          "transformFunction": "fromEpochDays(DaysSinceEpoch)"
        },
        {
          "columnName": "tsRaw",
          "transformFunction": "fromEpochDays(DaysSinceEpoch)"
        }
      ],
      "continueOnError": false,
      "rowTimeValueCheck": false,
      "segmentTimeValueCheck": true
    },
    "tierConfigs": [
      {
        "name": "hotTier",
        "segmentSelectorType": "time",
        "segmentAge": "3130d",
        "storageType": "pinot_server",
        "serverTag": "DefaultTenant_OFFLINE"
      },
      {
        "name": "coldTier",
        "segmentSelectorType": "time",
        "segmentAge": "3140d",
        "storageType": "pinot_server",
        "serverTag": "DefaultTenant_OFFLINE"
      }
    ],
    "isDimTable": false
  }
}
the available properties
an example from the Pinot GitHub
Example Pinot table config file
consumption rate limiter
segment reload
reload
refresh
Apache Pinot Quickstart Examples
SELECT myTable.column1,myTable.column2,myOtherTable.column1,....
FROM mytable INNER JOIN table2
ON table1.matching_column = myOtherTable.matching_column;
SELECT 
  p.userID, t.spending_val

FROM promotion AS p JOIN transaction AS t 
  ON p.userID = t.userID

WHERE
  p.promotion_val > 10
  AND t.transaction_type IN ('CASH', 'CREDIT')  
  AND t.transaction_epoch >= p.promotion_start_epoch
  AND t.transaction_epoch < p.promotion_end_epoch  
SELECT myTable.column1,table1.column2,myOtherTable.column1,....
FROM myTable LEFT JOIN myOtherTable
ON myTable.matching_column = myOtherTable.matching_column;
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1 
RIGHT JOIN table2
ON table1.matching_column = table2.matching_column;
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1 
FULL JOIN table2
ON table1.matching_column = table2.matching_column;
SELECT * 
FROM table1 
CROSS JOIN table2;
SELECT myTable.column1, myOtherTable.column1
 FROM myOtherTable
 WHERE EXISTS [ join_criteria ]
SELECT table1.strCol
 FROM  table1
 WHERE table1.intCol IN (select table2.anotherIntCol from table2 where ...)
SELECT myTable.column1, myOtherTable.column1
 FROM myOtherTable
 WHERE NOT EXISTS [ join_criteria ]
SELECT table1.strCol
 FROM  table1
 WHERE table1.intCol NOT IN (select table2.anotherIntCol from table2 where ...)
SELECT *
FROM table1 
JOIN table2
[ON (join_condition)]

OR

SELECT column_list 
FROM table1, table2....
WHERE table1.column_name =
table2.column_name; 
SELECT * FROM table1 ASOF JOIN table2 
MATCH_CONDITION(table1.col1 <comparison_operator> table2.col1))
ON table1.col2 = table2.col2;
SELECT * FROM table1 LEFT ASOF JOIN table2 
MATCH_CONDITION(table1.col1 <comparison_operator> table2.col1))
ON table1.col2 = table2.col2;
Optimizing joins
this blog
use Pinot's multi-stage query engine (v2).
INNER JOIN
SET useMultistageEngine=true;
SET usePhysicalOptimizer=true;
SET useMultistageEngine = true;
SET usePhysicalOptimizer = true;

WITH filtered_users AS (
  SELECT 
    userUUID
  FROM userAttributes
  WHERE userUUID NOT IN (
    SELECT 
      userUUID
    FROM userGroups
      WHERE groupUUID = 'group-1'
  )
  AND userUUID IN (
    SELECT
      userUUID
    FROM userGroups
      WHERE groupUUID = 'group-2'
  )
)
SELECT 
  userUUID,
  SUM(tripAmount)
FROM userFactEvents
WHERE
  userUUID IN (
    SELECT userUUID FROM filtered_users
  )
GROUP BY userUUID
PhysicalExchange(exchangeStrategy=[SINGLETON_EXCHANGE])
  PhysicalAggregate(group=[{1}], agg#0=[$SUM0($0)], aggType=[DIRECT])
    PhysicalJoin(condition=[=($1, $2)], joinType=[semi])
      PhysicalExchange(exchangeStrategy=[IDENTITY_EXCHANGE])
        PhysicalProject(tripAmount=[$7], userUUID=[$10])
          PhysicalTableScan(table=[[default, userFactEvents]])
      PhysicalJoin(condition=[=($0, $1)], joinType=[semi])
        PhysicalProject(userUUID=[$0])
          PhysicalFilter(condition=[IS NOT TRUE($3)])
            PhysicalJoin(condition=[=($1, $2)], joinType=[left])
              PhysicalExchange(exchangeStrategy=[IDENTITY_EXCHANGE])
                PhysicalProject(userUUID=[$6], userUUID0=[$6])
                  PhysicalTableScan(table=[[default, userAttributes]])
              PhysicalExchange(exchangeStrategy=[IDENTITY_EXCHANGE])
                PhysicalAggregate(group=[{0}], agg#0=[MIN($1)], aggType=[DIRECT])
                  PhysicalProject(userUUID=[$4], $f1=[true])
                    PhysicalFilter(condition=[=($3, _UTF-8'group-1')])
                      PhysicalTableScan(table=[[default, userGroups]])
        PhysicalExchange(exchangeStrategy=[IDENTITY_EXCHANGE])
          PhysicalProject(userUUID=[$4])
            PhysicalFilter(condition=[=($3, _UTF-8'group-2')])
              PhysicalTableScan(table=[[default, userGroups]])
P0P_0P0​
P2P_2P2​
P0P_0P0​
(P0∪P2)mod4=(P0)mod2{(P_0 \cup P_2)}_{mod 4} = (P_0)_{mod 2}(P0​∪P2​)mod4​=(P0​)mod2​
segmentPartition = Murmur("user-1") % numPartitions
SET useMultistageEngine = true;
SET usePhysicalOptimizer = true;

WITH user_events AS (
  SELECT
    productCode, tripAmount
  FROM
    userFactEvents
  WHERE
    userUUID = 'user-1'
  ORDER BY
    ts
  DESC
  LIMIT 100
)
SELECT
  productCode,
  SUM(tripAmount)
FROM
  user_events
GROUP BY productCode
    
SET useMultistageEngine = true;
SET usePhysicalOptimizer = true;

SELECT
  COUNT(*)
FROM
  userFactEvents
WHERE
  userUUID = 'user-1' AND userUUID = 'user-2'

Filtering with IdSet

Learn how to write fast queries for looking up IDs in a list of values.

Filtering with IdSet is only supported with the single-stage query engine (v1).

A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but using IN doesn't perform well with large lists of IDs. For large lists of IDs, we recommend using an IdSet.

Functions

ID_SET

ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )

This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:

  • INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.

  • LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.

  • Other types - Bloom Filter

The following parameters are used to configure the Bloom Filter:

  • expectedInsertions - Number of expected insertions for the BloomFilter, must be positive

  • fpp - False positive probability to use for the BloomFilter. Must be positive and less than 1.0.

Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.

IN_ID_SET

IN_ID_SET(columnName, base64EncodedIdSet)

This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.

IN_SUBQUERY

IN_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.

IN__PARTITIONED__SUBQUERY

IN_PARTITIONED_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.

This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.

The query passed to IN_SUBQUERY can be run on any table - they aren't restricted to the table used in the parent query.

The query passed to IN__PARTITIONED__SUBQUERY must be run on the same table as the parent query.

Examples

Create IdSet

You can create an IdSet of the values in the yearID column by running the following:

SELECT ID_SET(yearID)
FROM baseballStats
WHERE teamID = 'WS1'
idset(yearID)

ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc=

When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:

SELECT ID_SET(playerName, 'expectedInsertions=10')
FROM baseballStats
WHERE teamID = 'WS1'
idset(playerName)

AwIBBQAAAAL/////////////////////

SELECT ID_SET(playerName, 'expectedInsertions=100')
FROM baseballStats
WHERE teamID = 'WS1'
idset(playerName)

AwIBBQAAAAz///////////////////////////////////////////////9///////f///9/////7///////////////+/////////////////////////////////////////////8=

We can also configure the fpp parameter:

SELECT ID_SET(playerName, 'expectedInsertions=100;fpp=0.01')
FROM baseballStats
WHERE teamID = 'WS1'
idset(playerName)

AwIBBwAAAA/////////////////////////////////////////////////////////////////////////////////////////////////////////9///////////////////////////////////////////////7//////8=

Filter by values in IdSet

We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for _yearID_s in the IdSet, run the following:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_ID_SET(
 yearID,   
 'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
  ) = 1 
GROUP BY yearID

Filter by values not in IdSet

To return rows for _yearID_s not in the IdSet, run the following:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_ID_SET(
  yearID,   
  'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
  ) = 0 
GROUP BY yearID

Filter on broker

To filter rows for _yearID_s in the IdSet on a Pinot Broker, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 1
GROUP BY yearID  

To filter rows for _yearID_s not in the IdSet on a Pinot Broker, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 0
GROUP BY yearID  

Filter on server

To filter rows for _yearID_s in the IdSet on a Pinot Server, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_PARTITIONED_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 1
GROUP BY yearID  

To filter rows for _yearID_s not in the IdSet on a Pinot Server, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_PARTITIONED_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 0
GROUP BY yearID  

/tmp/pinot-quick-start/transcript-table-realtime.json
{
  "tableName": "transcript",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "timestampInEpoch",
    "timeType": "MILLISECONDS",
    "schemaName": "transcript",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "transcript-topic",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.broker.list": "kafka:9092",
      "realtime.segment.flush.threshold.rows": "0",
      "realtime.segment.flush.threshold.time": "24h",
      "realtime.segment.flush.threshold.segment.size": "50M",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}
docker run \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-streaming-table-creation \
    apachepinot/pinot:latest AddTable \
    -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
    -tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
    -controllerHost manual-pinot-controller \
    -controllerPort 9000 \
    -exec
bin/pinot-admin.sh AddTable \
    -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
    -tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
    -exec
/tmp/pinot-quick-start/rawData/transcript.json
{"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"Maths","score":3.8,"timestampInEpoch":1571900400000}
{"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"History","score":3.5,"timestampInEpoch":1571900400000}
{"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Maths","score":3.2,"timestampInEpoch":1571900400000}
{"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Chemistry","score":3.6,"timestampInEpoch":1572418800000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Geography","score":3.8,"timestampInEpoch":1572505200000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"English","score":3.5,"timestampInEpoch":1572505200000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Maths","score":3.2,"timestampInEpoch":1572678000000}
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Physics","score":3.6,"timestampInEpoch":1572678000000}
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"Maths","score":3.8,"timestampInEpoch":1572678000000}
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"English","score":3.5,"timestampInEpoch":1572678000000}
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"History","score":3.2,"timestampInEpoch":1572854400000}
{"studentID":212,"firstName":"Nick","lastName":"Young","gender":"Male","subject":"History","score":3.6,"timestampInEpoch":1572854400000}
bin/kafka-console-producer.sh \
    --broker-list localhost:9876 \
    --topic transcript-topic < /tmp/pinot-quick-start/rawData/transcript.json
Pinot in Docker
Pluggable Streams
docker run \
    --network pinot-demo --name=kafka \
    -e KAFKA_ZOOKEEPER_CONNECT=manual-zookeeper:2181/kafka \
    -e KAFKA_BROKER_ID=0 \
    -e KAFKA_ADVERTISED_HOST_NAME=kafka \
    -d bitnami/kafka:latest
docker exec \
  -t kafka \
  /opt/kafka/bin/kafka-topics.sh \
  --zookeeper manual-zookeeper:2181/kafka \
  --partitions=1 --replication-factor=1 \
  --create --topic transcript-topic
bin/pinot-admin.sh  StartKafka -zkAddress=localhost:2123/kafka -port 9876
Kafka
Batch upload sample data
Creating a schema
Batch upload sample data
Table
Query Console

Running Pinot locally

This quick start guide will help you bootstrap a Pinot standalone instance on your local machine.

In this guide, you'll learn how to download and install Apache Pinot as a standalone instance.

  • Download Apache Pinot

  • Set up a cluster

  • Start a Pinot component in debug mode with IntelliJ

Download Apache Pinot

First, download the Pinot distribution for this tutorial. You can either download a packaged release or build a distribution from the source code.

Prerequisites

  • Install with JDK 11 or 21. JDK 17 should work, but it is not officially supported.

  • For JDK 8 support, Pinot 0.12.1 is the last version compilable from the source code.

  • Pinot 1.0+ doesn't support JDK 8 anymore, build with JDK 11+

Note that some installations of the JDK do not contain the JNI bindings necessary to run all tests. If you see an error like java.lang.UnsatisfiedLinkError while running tests, you might need to change your JDK.

Download the distribution or build from source by selecting one of the following tabs:

Download the latest binary release from Apache Pinot, or use this command:

Extract the TAR file:

tar -zxvf apache-pinot-$PINOT_VERSION-bin.tar.gz

Navigate to the directory containing the launcher scripts:

cd apache-pinot-$PINOT_VERSION-bin

You can also find older versions of Apache Pinot at https://archive.apache.org/dist/pinot/. For example, to download Pinot 0.10.0, run the following command:

OLDER_VERSION="0.10.0"
wget https://archive.apache.org/dist/pinot/apache-pinot-$OLDER_VERSION/apache-pinot-$OLDER_VERSION-bin.tar.gz

Follow these steps to checkout code from Github and build Pinot locally

Prerequisites

Install Apache Maven 3.6 or higher

Check out Pinot:

Build Pinot:

If you're building with JDK 8, add Maven option -Djdk.version=8.

Navigate to the directory containing the setup scripts. Note that Pinot scripts are located under pinot-distribution/target, not the target directory under root.

Pinot can also be installed on Mac OS using the Brew package manager. For instructions on installing Brew, see the Brew documentation.

brew install pinot

Set up a cluster

Now that we've downloaded Pinot, it's time to set up a cluster. There are two ways to do this: through quick start or through setting up a cluster manually.

Quick start

Pinot comes with quick start commands that launch instances of Pinot components in the same process and import pre-built datasets.

For example, the following quick start command launches Pinot with a baseball dataset pre-loaded:

./bin/pinot-admin.sh QuickStart -type batch

For a list of all the available quick start commands, see the Quick Start Examples.

Manual cluster

If you want to play with bigger datasets (more than a few megabytes), you can launch each component individually.

The video below is a step-by-step walk through for launching the individual components of Pinot and scaling them to multiple instances.

You can find the commands that are shown in this video in the this Github repository.

The examples below assume that you are using Java 11+.

If you are using Java 8, add the following settings insideJAVA_OPTS. So, for example, instead of this:

export JAVA_OPTS="-Xms4G -Xmx8G"

Use the following:

export JAVA_OPTS="-Xms4G -Xmx8G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-controller.log"

Start Zookeeper

./bin/pinot-admin.sh StartZookeeper \
  -zkPort 2191

You can use Zooinspector to browse the Zookeeper instance.

Start Pinot Controller

export JAVA_OPTS="-Xms4G -Xmx8G"
./bin/pinot-admin.sh StartController \
    -zkAddress localhost:2191 \
    -controllerPort 9000

Start Pinot Broker

export JAVA_OPTS="-Xms4G -Xmx4G"
./bin/pinot-admin.sh StartBroker \
    -zkAddress localhost:2191

Start Pinot Server

export JAVA_OPTS="-Xms4G -Xmx16G"
./bin/pinot-admin.sh StartServer \
    -zkAddress localhost:2191

Start Pinot Minion

export JAVA_OPTS="-Xms4G -Xmx4G"
./bin/pinot-admin.sh StartMinion \
    -zkAddress localhost:2191

Start Kafka

./bin/pinot-admin.sh  StartKafka \ 
  -zkAddress=localhost:2191/kafka \
  -port 19092

Once your cluster is up and running, you can head over to Exploring Pinot to learn how to run queries against the data.

Setup cluster with config files

Users could start and customize the cluster by modifying the config files and start the components with config files:

./bin/pinot-admin.sh StartController -config conf/pinot-controller.conf
./bin/pinot-admin.sh StartBroker -config conf/pinot-broker.conf
./bin/pinot-admin.sh StartServer -config conf/pinot-server.conf
./bin/pinot-admin.sh StartMinion -config conf/pinot-minion.conf

Start a Pinot component in debug mode with IntelliJ

Set break points and inspect variables by starting a Pinot component with debug mode in IntelliJ.

The following example demonstrates server debugging:

  1. First, startzookeeper , controller, and broker using the steps described above.

  2. Then, use the following configuration under $PROJECT_DIR$\.run ) to start the server, replacing the metrics-core version and cluster name as needed. This commit is an example of how to use it.

<component name="ProjectRunConfigurationManager">
  <configuration default="false" name="HelixServerStarter" type="Application" factoryName="Application" nameIsGenerated="true">
    <classpathModifications>
      <entry path="$PROJECT_DIR$/pinot-plugins/pinot-metrics/pinot-yammer/target/classes" />
      <entry path="$MAVEN_REPOSITORY$/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar" />
    </classpathModifications>
    <option name="MAIN_CLASS_NAME" value="org.apache.pinot.server.starter.helix.HelixServerStarter" />
    <module name="pinot-server" />
    <extension name="coverage">
      <pattern>
        <option name="PATTERN" value="org.apache.pinot.server.starter.helix.*" />
        <option name="ENABLED" value="true" />
      </pattern>
    </extension>
    <method v="2">
      <option name="Make" enabled="true" />
    </method>
  </configuration>
</component>

0.5.0

This release includes many new features on Pinot ingestion and connectors, query capability and a revamped controller UI.

Summary

This release includes many new features on Pinot ingestion and connectors (e.g., support for filtering during ingestion which is configurable in table config; support for json during ingestion; proto buf input format support and a new Pinot JDBC client), query capability (e.g., a new GROOVY transform function UDF) and admin functions (a revamped Cluster Manager UI & Query Console UI). It also contains many key bug fixes. See details below.

The release was cut from the following commit: and the following cherry-picks:

Notable New Features

  • Allowing update on an existing instance config: PUT /instances/{instanceName} with Instance object as the pay-load ()

  • Add PinotServiceManager to start Pinot components ()

  • Support for protocol buffers input format. ()

  • Add GenericTransformFunction wrapper for simple ScalarFunctions () — Adding support to invoke any scalar function via GenericTransformFunction

  • Add Support for SQL CASE Statement ()

  • Support distinctCountRawThetaSketch aggregation that returns serialized sketch. ()

  • Add multi-value support to SegmentDumpTool () — add segment dump tool as part of the pinot-tool.sh script

  • Add json_format function to convert json object to string during ingestion. () — Can be used to store complex objects as a json string (which can later be queries using jsonExtractScalar)

  • Support escaping single quote for SQL literal () — This is especially useful for DistinctCountThetaSketch because it stores expression as literal E.g. DistinctCountThetaSketch(..., 'foo=''bar''', ...)

  • Support expression as the left-hand side for BETWEEN and IN clause ()

  • Add a new field IngestionConfig in TableConfig — FilterConfig: ingestion level filtering of records, based on filter function. () — TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release ().

  • Allow star-tree creation during segment load () — Introduced a new boolean config enableDynamicStarTreeCreation in IndexingConfig to enable/disable star-tree creation during segment load.

  • Support for Pinot clients using JDBC connection ()

  • Support customized accuracy for distinctCountHLL, distinctCountHLLMV functions by adding log2m value as the second parameter in the function. () —Adding cluster config: default.hyperloglog.log2m to allow user set default log2m value.

  • Add segment encryption on Controller based on table config ()

  • Add a constraint to the message queue for all instances in Helix, with a large default value of 100000. ()

  • Support order-by aggregations not present in SELECT () — Example: "select subject from transcript group by subject order by count() desc" This is equivalent to the following query but the return response should not contain count(). "select subject, count() from transcript group by subject order by count() desc"

  • Add geo support for Pinot queries () — Added geo-spatial data model and geospatial functions

  • Cluster Manager UI & Query Console UI revamp ( and ) — updated cluster manage UI and added table details page and segment details page

  • Add Controller API to explore Zookeeper ()

  • Support BYTES type for dictinctCount and group-by ( and ) —Add BYTES type support to DistinctCountAggregationFunction —Correctly handle BYTES type in DictionaryBasedAggregationOperator for DistinctCount

  • Support for ingestion job spec in JSON format ()

  • Improvements to RealtimeProvisioningHelper command () — Improved docs related to ingestion and plugins

  • Added GROOVY transform function UDF () — Ability to run a groovy script in the query as a UDF. e.g. string concatenation: SELECT GROOVY('{"returnType": "INT", "isSingleValue": true}', 'arg0 + " " + arg1', columnA, columnB) FROM myTable

Special notes

  • Changed the stream and metadata interface () — This PR concludes the work for the issue to extend offset support for other streams

  • TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release ().

  • Config key enable.case.insensitive.pql in Helix cluster config is deprecated, and replaced with enable.case.insensitive. ()

  • Change default segment load mode to MMAP. () —The load mode for segments currently defaults to heap.

Major Bug fixes

  • Fix bug in distinctCountRawHLL on SQL path ()

  • Fix backward incompatibility for existing stream implementations ()

  • Fix backward incompatibility in StreamFactoryConsumerProvider ()

  • Fix logic in isLiteralOnlyExpression. ()

  • Fix double memory allocation during operator setup ()

  • Allow segment download url in Zookeeper to be deep store uri instead of hardcoded controller uri ()

  • Fix a backward compatible issue of converting BrokerRequest to QueryContext when querying from Presto segment splits ()

  • Fix the issue that PinotSegmentToAvroConverter does not handle BYTES data type. ()

Backward Incompatible Changes

  • PQL queries with HAVING clause will no longer be accepted for the following reasons: () — HAVING clause does not apply to PQL GROUP-BY semantic where each aggregation column is ordered individually — The current behavior can produce inaccurate results without any notice — HAVING support will be added for SQL queries in the next release

  • Because of the standardization of the DistinctCountThetaSketch predicate strings, upgrade Broker before Server. The new Broker can handle both standard and non-standard predicate strings for backward-compatibility. ()

Querying Pinot

Learn how to query Pinot using SQL

SQL Interface

Pinot provides a SQL interface for querying, which uses the Calcite SQL parser to parse queries and the MYSQL_ANSI dialect. For details on the syntax, see the the . To find supported SQL operators, see .

Pinot 1.0

In Pinot 1.0, the multi-stage query engine supports inner join, left-outer, semi-join, and nested queries out of the box. It's optimized for in-memory process and latency. For more information, see how to .

Pinot also supports using simple Data Definition Language (DDL) to insert data into a table from file directly. For details, see . More DDL supports will be added in the future. But for now, the most common way for data definition is using the .

Note: For queries that require a large amount of data shuffling, require spill-to-disk, or are hitting any other limitations of the multi-stage query engine (v2), we still recommend using Presto.

Identifier vs Literal

In Pinot SQL:

  • Double quotes(") are used to force string identifiers, e.g. column names

  • Single quotes(') are used to enclose string literals. If the string literal also contains a single quote, escape this with a single quote e.g '''Pinot''' to match the string literal 'Pinot'

Misusing those might cause unexpected query results, like the following examples:

  • WHERE a='b' means the predicate on the column a equals to a string literal value 'b'

  • WHERE a="b" means the predicate on the column a equals to the value of the column b

If your column names use reserved keywords (e.g. timestamp or date) or special characters, you will need to use double quotes when referring to them in queries.

Note: Define decimal literals within quotes to preserve precision.

Example Queries

Selection

Aggregation

Grouping on Aggregation

Ordering on Aggregation

Filtering

For performant filtering of IDs in a list, see .

Filtering with NULL predicate

Selection (Projection)

Ordering on Selection

Pagination on Selection

Note that results might not be consistent if the ORDER BY column has the same value in multiple rows.

Wild-card match (in WHERE clause only)

The example below counts rows where the column airlineName starts with U:

Note: REGEXP_LIKE also supports case insensitive search using the i flag as the third parameter.

Case-When Statement

Pinot supports the CASE-WHEN-ELSE statement, as shown in the following two examples:

UDF

Pinot doesn't currently support injecting functions. Functions have to be implemented within Pinot, as shown below:

For more examples, see .

BYTES column

Pinot supports queries on BYTES column using hex strings. The query response also uses hex strings to represent bytes values.

The query below fetches all the rows for a given UID:

Stats

Learn more about multi-stage stats and how to use them to improve your queries.

Multi-stage stats (MSE) are more complex but also more expressive than single-stage stats. While in single-stage stats Apache Pinot returns a single set of statistics for the query, in multi-stage stats Apache Pinot returns a set of statistics for each operator of the query execution. These stats are collected by default and included in the response of any MSE query.

Each operator has its own set of statistics, which are collected during the execution of the query. See the section to learn more about the different operator types and their statistics.

Multi-stage stats visualizer

The recommended way to analyze the multi-stage stats is to use the visualizer included in the Pinot UI. It can be accessed by running a query in the Pinot controller UI and clicking on the Visual button.

Then, the view is changed to only show the multi-stage stats in a graph format like the following, where each node represents an operator. Inside each node, you can see the operator type and the statistics collected for that operator. Nodes are connected with edges that represent the relationship between the operators. Parent operators are above their children, and the edges' width represents the time spent on the child operator.

For example, the following query in ColocatedJoinQuickStart:

Creates the following graph:

Here we can see there are 5 stages (one for each MAILBOX_SEND operator). A significant part of the time is spent in HASH_JOIN on stage 1, followed by the read on userFactEvents. We can also see that stage 5, the one that reads from userFactEvents , returns 40000 rows while the other stage returns 2494 rows, so as explained in , it is better to have the smaller table on the right side of the join, so the query would be faster if written as:

By default, the visualizer will only show the most important stats. To show all the stats, click on Show details button in the bottom left corner of the visualizer.

The graph being drawn is usually a tree-like structure, but it can be a directed acyclic graph (DAG) in some cases, like when using .

The JSON format

The Pinot UI stats visualizer is a convenient way to see the multi-stage stats, but sometimes you may want to see the raw JSON format. For example, you may want to analyze the stats programmatically or use a different visualization tool. To do so, you can read the stageStats field in the JSON response of the query.

For example, the same query used in the previous section returns: Returns the following stageStats:

Each node in the tree represents an operation that is executed and the tree structure form is similar (but not equal) to the logical plan of the query that can be obtained with the EXPLAIN PLAN command.

The stats are always a tree structure when using the JSON format, even when are used. In that case, the spooled stages will be included more than once in the tree. You will need to create the DAG yourself by looking at the stage field for each operator and connect the operators with the same stage ID.

Visualize data with Redash

  1. Install Redash and start a running instance, following the .

  2. Configure Redash to query Pinot, by doing the following:

  3. Create visualizations, by doing the following:

Add pinot db dependency

Apache Pinot provides a Python client library pinotdb to query Pinot from Python applications. Install pinotdb inside the Redash worker instance to make network calls to Pinot.

  1. Navigate to the root directory where you’ve cloned Redash. Run the following command to get the name of the Redash worker container (by default, redash_worker_1):

docker-compose ps

  1. Run the following command (change redash_worker_1 to your own Redash worker container name, if applicable):

  1. Restart Docker.

Add Python data source for Pinot

  1. In Redash, select Settings > Data Sources.

  2. Select New Data Source, and then select Python from the list.

  3. On the Redash Settings - Data Source page, add Pinot as the name of the data source, enter pinotdb in the Modules to import prior to running the script field.

  4. Enter the following optional fields as needed:

    • AdditionalModulesPaths: Enter a comma-separated list of absolute paths on the Redash server to Python modules to make available when querying from Redash. Useful for private modules unavailable in pip.

    • AdditionalBuiltins: Specify additional built-in functions as needed. By default, Redash automatically includes 25 Python built-in functions.

  5. Click Save.

Start Pinot

Run the following command in a new terminal to spin up an Apache Pinot Docker container in the quick start mode with a baseball stats dataset built in.

Run a query in Redash

  1. In Redash, select Queries > New Query, and then select the Python data source you created in .

  2. Add Python code to query data. For more information, see the .

  3. Click Execute to run the query and view results.

You can also include libraries like Pandas to perform more advanced data manipulation on Pinot’s data and visualize the output with Redash.

For more information, see in Redash documentation.

Example Python queries

Query top 10 teams by total runs

The following query connects to Pinot and queries the baseballStats table to retrieve the top ten players with the highest scores. The results are transformed into a dictionary format supported by Redash.

Query top 10 teams by total runs

Query total strikeouts by year

Add a visualization and dashboard in Redash

Add a visualization

In Redash, after you've ran your query, click the New Visualization tab, and select the type of visualization your want to create, for example, Bar Chart. The Visualization Editor appears with your chart.

For example, you may want to create a bar chart to view the top 10 players with highest scores.

You may want to create a line chart to view the total variation in strikeouts over time.

For more information, see .

Add a dashboard

Create a dashboard with one or more visualizations (widgets).

  1. In Redash, go to Dashboards > New Dashboards.

  2. Add the widgets to your dashboard. For example, by adding the three visualizations from the above, you create a Baseball stats dashboard.

For more information, see in the Redash documentation.

Explain Plan (Single-Stage)

This page describes the explain plan in single stage query engine. In order to have a more general view of the different explain plan supported by Pinot, please see

Query execution within Pinot is modeled as a sequence of operators that are executed in a pipelined manner to produce the final result. The output of the EXPLAIN PLAN statement can be used to see how queries are being run or to further optimize queries.

Introduction

EXPLAN PLAN can be run in two modes: verbose and non-verbose (default) via the use of a query option. To enable verbose mode the query option explainPlanVerbose=true must be passed.

In the non-verbose EXPLAIN PLAN output above, the Operator column describes the operator that Pinot will run where as, the Operator_Id and Parent_Id columns show the parent-child relationship between operators.

This parent-child relationship shows the order in which operators execute. For example, FILTER_MATCH_ENTIRE_SEGMENT will execute before and pass its output to PROJECT. Similarly, PROJECT will execute before and pass its output to TRANSFORM_PASSTHROUGH operator and so on.

Although the EXPLAIN PLAN query produces tabular output, in this document, we show a tree representation of the EXPLAIN PLAN output so that parent-child relationship between operators are easy to see and user can visualize the bottom-up flow of data in the operator tree execution.

Note a special node with the Operator_Id and Parent_Id called PLAN_START(numSegmentsForThisPlan:1). This node indicates the number of segments which match a given plan. The EXPLAIN PLAN query can be run with the verbose mode enabled using the query option explainPlanVerbose=true which will show the varying deduplicated query plans across all segments across all servers.

EXPLAIN PLAN output should only be used for informational purposes because it is likely to change from version to version as Pinot is further developed and enhanced. Pinot uses a "Scatter Gather" approach to query evaluation (see for more details). At the Broker, an incoming query is split into several server-level queries for each backend server to evaluate. At each Server, the query is further split into segment-level queries that are evaluated against each segment on the server. The results of segment queries are combined and sent to the Broker. The Broker in turn combines the results from all the Servers and sends the final results back to the user. Note that if the EXPLAIN PLAN query runs without the verbose mode enabled, a single plan will be returned (the heuristic used is to return the deepest plan tree) and this may not be an accurate representation of all plans across all segments. Different segments may execute the plan in a slightly different way.

Reading the EXPLAIN PLAN output from bottom to top will show how data flows from a table to query results. In the example shown above, the FILTER_MATCH_ENTIRE_SEGMENT operator shows that all 977889 records of the segment matched the query. The DOC_ID_SET over the filter operator gets the set of document IDs matching the filter operator. The PROJECT operator over the DOC_ID_SET operator pulls only those columns that were referenced in the query. The TRANSFORM_PASSTHROUGH operator just passes the column data from PROJECT operator to the SELECT operator. At SELECT, the query has been successfully evaluated against one segment. Results from different data segments are then combined (COMBINE_SELECT) and sent to the Broker. The Broker combines and reduces the results from different servers (BROKER_REDUCE) into a final result that is sent to the user. The PLAN_START(numSegmentsForThisPlan:1) indicates that a single segment matched this query plan. If verbose mode is enabled many plans can be returned and each will contain a node indicating the number of matched segments.

The rest of this document illustrates the EXPLAIN PLAN output with examples and describe the operators that show up in the output of the EXPLAIN PLAN.

EXPLAIN PLAN using verbose mode for a query that evaluates filters with and without index

Since verbose mode is enabled, the EXPLAIN PLAN output returns two plans matching one segment each (assuming 2 segments for this table). The first EXPLAIN PLAN output above shows that Pinot used an inverted index to evaluate the predicate "playerID = 'aardsda01'" (FILTER_INVERTED_INDEX). The result was then fully scanned (FILTER_FULL_SCAN) to evaluate the second predicate "playerName = 'David Allan'". Note that the two predicates are being combined using AND in the query; hence, only the data that satsified the first predicate needs to be scanned for evaluating the second predicate. However, if the predicates were being combined using OR, the query would run very slowly because the entire "playerName" column would need to be scanned from top to bottom to look for values satisfying the second predicate. To improve query efficiency in such cases, one should consider indexing the "playerName" column as well. The second plan output shows a FILTER_EMPTY indicating that no matching documents were found for one segment.

EXPLAIN PLAN ON GROUP BY QUERY

The EXPLAIN PLAN output above shows how GROUP BY queries are evaluated in Pinot. GROUP BY results are created on the server (AGGREGATE_GROUPBY_ORDERBY) for each segment on the server. The server then combines segment-level GROUP BY results (COMBINE_GROUPBY_ORDERBY) and sends the combined result to the Broker. The Broker combines GROUP BY result from all the servers to produce the final result which is send to the user. Note that the COMBINE_SELECT operator from the previous query was not used here, instead a different COMBINE_GROUPBY_ORDERBY operator was used. Depending upon the type of query different combine operators such as COMBINE_DISTINCT and COMBINE_ORDERBY etc may be seen.

EXPLAIN PLAN OPERATORS

The root operator of the EXPLAIN PLAN output is BROKER_REDUCE. BROKER_REDUCE indicates that Broker is processing and combining server results into final result that is sent back to the user. BROKER_REDUCE has a COMBINE operator as its child. Combine operator combines the results of query evaluation from each segment on the server and sends the combined result to the Broker. There are several combine operators (COMBINE_GROUPBY_ORDERBY, COMBINE_DISTINCT, COMBINE_AGGREGATE, etc.) that run depending upon the operations being performed by the query. Under the Combine operator, either a Select (SELECT, SELECT_ORDERBY, etc.) or an Aggregate (AGGREGATE, AGGREGATE_GROUPBY_ORDERBY, etc.) can appear. Aggreate operator is present when query performs aggregation (count(*), min, max, etc.); otherwise, a Select operator is present. If the query performs scalar transformations (Addition, Multiplication, Concat, etc.), then one would see TRANSFORM operator appear under the SELECT operator. Often a TRANSFORM_PASSTHROUGH operator is present instead of the TRANSFORM operator. TRANSFORM_PASSTHROUGH just passes results from operators that appear lower in the operator execution heirarchy to the SELECT operator. DOC_ID_SET operator usually appear above FILTER operators and indicate that a list of matching document IDs are assessed. FILTER operators usually appear at the bottom of the operator heirarchy and show index use. For example, the presence of FILTER_FULL_SCAN indicates that index was not used (and hence the query is likely to run relatively slow). However, if the query used an index one of the indexed filter operators (FILTER_SORTED_INDEX, FILTER_RANGE_INDEX, FILTER_INVERTED_INDEX, FILTER_JSON_INDEX, etc.) will show up.

d1b4586
63a4fd4
a7f7f46
dafbef1
ced3a70
d902c1a
#PR4952
#PR5266
#PR5293
PR#5440
PR#5461
PR#5465
PR#5487
PR#5492
PR#5501
PR#5502
PR#5597
PR#5681
#PR5641
#PR5602
#PR5564
PR#5617
PR#5631
PR#5637
PR#5654
PR#5684
PR#5732
PR#5687
PR#5701
PR#5708
#PR5729
#PR5737
#PR5748
PR#5542
#5359
PR#5681
#PR5546
PR#5539
#5494
#5549
#5557
#5611
#5619
#5639
#5676
#5789
#PR5570
#PR5613
//default to limit 10
SELECT * 
FROM myTable 

SELECT * 
FROM myTable 
LIMIT 100
SELECT "date", "timestamp"
FROM myTable 
SELECT COUNT(*), MAX(foo), SUM(bar) 
FROM myTable
SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
LIMIT 50
SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
ORDER BY bar, MAX(foo) DESC 
LIMIT 50
SELECT COUNT(*) 
FROM myTable
  WHERE foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))
SELECT COUNT(*) 
FROM myTable
  WHERE foo IS NOT NULL
  AND foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))
SELECT * 
FROM myTable
  WHERE quux < 5
  LIMIT 50
SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 100
SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 50, 100
SELECT COUNT(*) 
FROM myTable
  WHERE REGEXP_LIKE(airlineName, '^U.*')
  GROUP BY airlineName LIMIT 10
SELECT
    CASE
      WHEN price > 30 THEN 3
      WHEN price > 20 THEN 2
      WHEN price > 10 THEN 1
      ELSE 0
    END AS price_category
FROM myTable
SELECT
  SUM(
    CASE
      WHEN price > 30 THEN 30
      WHEN price > 20 THEN 20
      WHEN price > 10 THEN 10
      ELSE 0
    END) AS total_cost
FROM myTable
SELECT COUNT(*)
FROM myTable
GROUP BY DATETIMECONVERT(timeColumnName, '1:MILLISECONDS:EPOCH', '1:HOURS:EPOCH', '1:HOURS')
SELECT * 
FROM myTable
WHERE UID = 'c8b3bce0b378fc5ce8067fc271a34892'
Calcite documentation
Class SqlLibraryOperators
enable and use the multi-stage query engine
programmatically access the multi-stage query engine
Controller Admin API
Filtering with IdSet
Transform Function in Aggregation Grouping
EXPLAIN PLAN FOR SELECT playerID, playerName FROM baseballStats

+---------------------------------------------|------------|---------|
| Operator                                    | Operator_Id|Parent_Id|
+---------------------------------------------|------------|---------|
|BROKER_REDUCE(limit:10)                      | 1          | 0       |
|COMBINE_SELECT                               | 2          | 1       |
|PLAN_START(numSegmentsForThisPlan:1)         | -1         | -1      |
|SELECT(selectList:playerID, playerName)      | 3          | 2       |
|TRANSFORM_PASSTHROUGH(playerID, playerName)  | 4          | 3       |
|PROJECT(playerName, playerID)                | 5          | 4       |
|DOC_ID_SET                                   | 6          | 5       |
|FILTER_MATCH_ENTIRE_SEGMENT(docs:97889)      | 7          | 6       |
+---------------------------------------------|------------|---------|
BROKER_REDUCE(limit:10)
└── COMBINE_SELECT
    └── PLAN_START(numSegmentsForThisPlan:1)
        └── SELECT(selectList:playerID, playerName)
            └── TRANSFORM_PASSTHROUGH(playerID, playerName)
                └── PROJECT(playerName, playerID)
                    └── DOC_ID_SET
                        └── FILTER_MATCH_ENTIRE_SEGMENT(docs:97889)
SET explainPlanVerbose=true;
EXPLAIN PLAN FOR
  SELECT playerID, playerName
    FROM baseballStats
   WHERE playerID = 'aardsda01' AND playerName = 'David Allan'

BROKER_REDUCE(limit:10)
└── COMBINE_SELECT
    └── PLAN_START(numSegmentsForThisPlan:1)
        └── SELECT(selectList:playerID, playerName)
            └── TRANSFORM_PASSTHROUGH(playerID, playerName)
                └── PROJECT(playerName, playerID)
                    └── DOC_ID_SET
                        └── FILTER_AND
                            ├── FILTER_INVERTED_INDEX(indexLookUp:inverted_index,operator:EQ,predicate:playerID = 'aardsda01')
                            └── FILTER_FULL_SCAN(operator:EQ,predicate:playerName = 'David Allan')
    └── PLAN_START(numSegmentsForThisPlan:1)
        └── SELECT(selectList:playerID, playerName)
            └── TRANSFORM_PASSTHROUGH(playerID, playerName)
                └── PROJECT(playerName, playerID)
                    └── DOC_ID_SET
                        └── FILTER_EMPTY
EXPLAIN PLAN FOR
  SELECT playerID, count(*)
    FROM baseballStats
   WHERE playerID != 'aardsda01'
   GROUP BY playerID

BROKER_REDUCE(limit:10)
└── COMBINE_GROUPBY_ORDERBY
    └── PLAN_START(numSegmentsForThisPlan:1)
        └── AGGREGATE_GROUPBY_ORDERBY(groupKeys:playerID, aggregations:count(*))
            └── TRANORM_PASSTHROUGH(playerID)
                └── PROJECT(playerID)
                    └── DOC_ID_SET
                        └── FILTER_INVERTED_INDEX(indexLookUp:inverted_index,operator:NOT_EQ,predicate:playerID != 'aardsda01')
Explain Plan
Pinot Architecture
select * 
from userAttributes a 
join userGroups g
on a.userUUID = g.userUUID
join userFactEvents fe
on fe.userUUID = g.userUUID
select *
from userFactEvents fe
join (
    select *
    from userAttributes a
    join userGroups g
    on a.userUUID = g.userUUID
) as g
on fe.userUUID = g.userUUID
{
  ...,
  "stageStats": {
    "type": "MAILBOX_RECEIVE",
    "executionTimeMs": 18,
    "emittedRows": 2494,
    "fanIn": 4,
    "rawMessages": 18,
    "deserializedBytes": 219393,
    "upstreamWaitMs": 80,
    "children": [
      {
        "type": "MAILBOX_SEND",
        "executionTimeMs": 75,
        "emittedRows": 2494,
        "stage": 1,
        "parallelism": 4,
        "fanOut": 1,
        "rawMessages": 14,
        "serializedBytes": 216854,
        "serializationTimeMs": 4,
        "children": [
          {
            "type": "HASH_JOIN",
            "executionTimeMs": 70,
            "emittedRows": 2494,
            "timeBuildingHashTableMs": 73,
            "children": [
              {
                "type": "MAILBOX_RECEIVE",
                "emittedRows": 2494,
                "fanIn": 4,
                "inMemoryMessages": 18,
                "rawMessages": 12,
                "deserializedBytes": 2085,
                "upstreamWaitMs": 131,
                "children": [
                  {
                    "type": "MAILBOX_SEND",
                    "executionTimeMs": 23,
                    "emittedRows": 2494,
                    "stage": 2,
                    "parallelism": 4,
                    "fanOut": 4,
                    "inMemoryMessages": 14,
                    "children": [
                      {
                        "type": "HASH_JOIN",
                        "executionTimeMs": 21,
                        "emittedRows": 2494,
                        "timeBuildingHashTableMs": 20,
                        "children": [
                          {
                            "type": "MAILBOX_RECEIVE",
                            "executionTimeMs": 1,
                            "emittedRows": 10000,
                            "fanIn": 2,
                            "inMemoryMessages": 6,
                            "rawMessages": 18,
                            "deserializedBytes": 221576,
                            "deserializationTimeMs": 3,
                            "upstreamWaitMs": 61,
                            "children": [
                              {
                                "type": "MAILBOX_SEND",
                                "executionTimeMs": 11,
                                "emittedRows": 10000,
                                "stage": 3,
                                "parallelism": 2,
                                "fanOut": 4,
                                "inMemoryMessages": 4,
                                "rawMessages": 12,
                                "serializedBytes": 220890,
                                "serializationTimeMs": 6,
                                "children": [
                                  {
                                    "type": "LEAF",
                                    "table": "userAttributes",
                                    "executionTimeMs": 8,
                                    "emittedRows": 10000,
                                    "numDocsScanned": 10000,
                                    "totalDocs": 10000,
                                    "numEntriesScannedPostFilter": 40000,
                                    "numSegmentsQueried": 4,
                                    "numSegmentsProcessed": 4,
                                    "numSegmentsMatched": 4,
                                    "threadCpuTimeNs": 4733524
                                  }
                                ]
                              }
                            ]
                          },
                          {
                            "type": "MAILBOX_RECEIVE",
                            "executionTimeMs": 7,
                            "emittedRows": 2494,
                            "fanIn": 2,
                            "inMemoryMessages": 10,
                            "rawMessages": 26,
                            "deserializedBytes": 46102,
                            "deserializationTimeMs": 3,
                            "upstreamWaitMs": 40,
                            "children": [
                              {
                                "type": "MAILBOX_SEND",
                                "executionTimeMs": 4,
                                "emittedRows": 2494,
                                "stage": 4,
                                "parallelism": 2,
                                "fanOut": 4,
                                "inMemoryMessages": 8,
                                "rawMessages": 20,
                                "serializedBytes": 45422,
                                "serializationTimeMs": 4,
                                "children": [
                                  {
                                    "type": "LEAF",
                                    "table": "userGroups",
                                    "executionTimeMs": 5,
                                    "emittedRows": 2494,
                                    "numDocsScanned": 2494,
                                    "totalDocs": 2494,
                                    "numEntriesScannedPostFilter": 4988,
                                    "numSegmentsQueried": 8,
                                    "numSegmentsProcessed": 8,
                                    "numSegmentsMatched": 8,
                                    "threadCpuTimeNs": 1423051
                                  }
                                ]
                              }
                            ]
                          }
                        ]
                      }
                    ]
                  }
                ]
              },
              {
                "type": "MAILBOX_RECEIVE",
                "executionTimeMs": 48,
                "emittedRows": 40000,
                "fanIn": 2,
                "inMemoryMessages": 10,
                "rawMessages": 30,
                "deserializedBytes": 1755012,
                "deserializationTimeMs": 7,
                "upstreamWaitMs": 133,
                "children": [
                  {
                    "type": "MAILBOX_SEND",
                    "executionTimeMs": 30,
                    "emittedRows": 40000,
                    "stage": 5,
                    "parallelism": 2,
                    "fanOut": 4,
                    "inMemoryMessages": 8,
                    "rawMessages": 24,
                    "serializedBytes": 1754652,
                    "serializationTimeMs": 15,
                    "children": [
                      {
                        "type": "LEAF",
                        "table": "userFactEvents",
                        "executionTimeMs": 21,
                        "emittedRows": 40000,
                        "numDocsScanned": 40000,
                        "totalDocs": 40000,
                        "numEntriesScannedPostFilter": 320000,
                        "numSegmentsQueried": 8,
                        "numSegmentsProcessed": 8,
                        "numSegmentsMatched": 8,
                        "threadCpuTimeNs": 32716947
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  },
  ...
}
Operator Types
Optimizing joins
spools
spools
What is Apache Pinot? (and User-Facing Analytics) by Tim Berglund
docker exec -it redash_worker_1 /bin/sh                                
pip install pinotdb
docker run \
  --name pinot-quickstart \
  -p 2123:2123 \
  -p 9000:9000 \
  -p 8000:8000 \
  apachepinot/pinot:0.9.3 QuickStart -type batch
from pinotdb import connect

conn = connect(host='host.docker.internal', port=8000, path='/query/sql', scheme='http')
curs = conn.cursor()
curs.execute("""
    select 
playerName, sum(runs) as total_runs
from baseballStats
group by playerName
order by total_runs desc
limit 10
""")

result = {}
result['columns'] = [
    {
      "name": "player_name",
      "type": "string",
      "friendly_name": "playerName"
    },
    {
      "name": "total_runs",
      "type": "integer",
      "friendly_name": "total_runs"
    }
  ]

rows = []

for row in curs:
    record = {}
    record['player_name'] = row[0]
    record['total_runs'] = row[1]


    rows.append(record)

result["rows"] = rows
from pinotdb import connect

conn = connect(host='host.docker.internal', port=8000, path='/query/sql', scheme='http')
curs = conn.cursor()
curs.execute("""
    select 
teamID, sum(runs) as total_runs
from baseballStats
group by teamID
order by total_runs desc
limit 10
""")

result = {}
result['columns'] = [
    {
      "name": "teamID",
      "type": "string",
      "friendly_name": "Team"
    },
    {
      "name": "total_runs",
      "type": "integer",
      "friendly_name": "Total Runs"
    }
  ]

rows = []

for row in curs:
    record = {}
    record['teamID'] = row[0]
    record['total_runs'] = row[1]


    rows.append(record)

result["rows"] = rows
from pinotdb import connect

conn = connect(host='host.docker.internal', port=8000, path='/query/sql', scheme='http')
curs = conn.cursor()
curs.execute("""
    select 
yearID, sum(strikeouts) as total_so
from baseballStats
group by yearID
order by yearID asc
limit 1000
""")

result = {}
result['columns'] = [
    {
      "name": "yearID",
      "type": "integer",
      "friendly_name": "Year"
    },
    {
      "name": "total_so",
      "type": "integer",
      "friendly_name": "Total Strikeouts"
    }
  ]

rows = []

for row in curs:
    record = {}
    record['yearID'] = row[0]
    record['total_so'] = row[1]


    rows.append(record)

result["rows"] = rows
Docker Based Developer Installation Guide
Add pinotdb dependency
Add a Python data source for Pinot
Start Pinot
Query in Redash
Add a visualization and dashboard in Redash
Add a Python data source for Pinot
Python query runner
Querying
Visualizations
three example queries
Dashboards
Bar chart configuration
Baseball stats dashboard

Ingestion FAQ

This page has a collection of frequently asked questions about ingestion with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, .

Data processing

What is a good segment size?

While Apache Pinot can work with segments of various sizes, for optimal use of Pinot, you want to get your segments sized in the 100MB to 500MB (un-tarred/uncompressed) range. Having too many (thousands or more) tiny segments for a single table creates overhead in terms of the metadata storage in Zookeeper as well as in the Pinot servers' heap. At the same time, having too few really large (GBs) segments reduces parallelism of query execution, as on the server side, the thread parallelism of query execution is at segment level.

Can multiple Pinot tables consume from the same Kafka topic?

Yes. Each table can be independently configured to consume from any given Kafka topic, regardless of whether there are other tables that are also consuming from the same Kafka topic.

If I add a partition to a Kafka topic, will Pinot automatically ingest data from this partition?

Pinot automatically detects new partitions in Kafka topics. It checks for new partitions whenever RealtimeSegmentValidationManager periodic job runs and starts consumers for new partitions.

You can configure the interval for this job using thecontroller.realtime.segment.validation.frequencyPeriod property in the controller configuration.

Does Pinot support partition pruning on multiple partition columns?

Pinot supports multi-column partitioning for offline tables. Map multiple columns under Pinot assigns the input data to each partition according to the partition configuration individually for each column.

The following example partitions the segment based on two columns, memberID and caseNumber. Note that each partition column is handled separately, so in this case the segment is partitioned on memberID (partition ID 1) and also partiitoned on caseNumber (partition ID 2).

For multi-column partitioning to work, you must also set routing.segementPrunerTypes as follows:

How do I enable partitioning in Pinot when using Kafka stream?

Set up partitioner in the Kafka producer:

The partitioning logic in the stream should match the partitioning config in Pinot. Kafka uses murmur2, and the equivalent in Pinot is the Murmur function.

Set the partitioning configuration as below using same column used in Kafka:

and also set:

To learn how partition works, see .

How do I store BYTES column in JSON data?

For JSON, you can use a hex encoded string to ingest BYTES.

How do I flatten my JSON Kafka stream?

See the function which can store a top level json field as a STRING in Pinot.

Then you can use these during query time, to extract fields from the json string.

NOTE This works well if some of your fields are nested json, but most of your fields are top level json keys. If all of your fields are within a nested JSON key, you will have to store the entire payload as 1 column, which is not ideal.

How do I escape Unicode in my Job Spec YAML file?

To use explicit code points, you must double-quote (not single-quote) the string, and escape the code point via "\uHHHH", where HHHH is the four digit hex code for the character. See for more details.

Is there a limit on the maximum length of a string column in Pinot?

By default, Pinot limits the length of a String column to 512 bytes. If you want to overwrite this value, you can set the maxLength attribute in the schema as follows:

When are new events queryable when getting ingested into a real-time table?

Events are available to queries as soon as they are ingested. This is because events are instantly indexed in memory upon ingestion.

The ingestion of events into the real-time table is not transactional, so replicas of the open segment are not immediately consistent. Pinot trades consistency for availability upon network partitioning (CAP theorem) to provide ultra-low ingestion latencies at high throughput.

However, when the open segment is closed and its in-memory indexes are flushed to persistent storage, all its replicas are guaranteed to be consistent, with the .

How to reset a CONSUMING segment stuck on an offset which has expired from the stream?

This typically happens if:

  1. The consumer is lagging a lot.

  2. The consumer was down (server down, cluster down), and the stream moved on, resulting in offset not found when consumer comes back up.

In case of Kafka, to recover, set property "auto.offset.reset":"earliest" in the streamConfigs section and reset the CONSUMING segment. See for more details about the configuration.

You can also also use the "Resume Consumption" endpoint with "resumeFrom" parameter set to "smallest" (or "largest" if you want). See for more details.

Indexing

How to set inverted indexes?

Inverted indexes are set in the tableConfig's tableIndexConfig -> invertedIndexColumns list. For more info on table configuration, see . For an example showing how to configure an inverted index, see .

Applying inverted indexes to a table configuration will generate an inverted index for all new segments. To apply the inverted indexes to all existing segments, see

How to apply an inverted index to existing segments?

  1. Add the columns you want to index to the tableIndexConfig-> invertedIndexColumns list. To update the table configuration use the Pinot Swagger API: .

  2. Invoke the reload API: .

Once you've done that, you can check whether the index has been applied by querying the segment metadata API at . Don't forget to include the names of the column on which you have applied the index.

The output from this API should look something like the following:

Can I retrospectively add an index to any segment?

Not all indexes can be retrospectively applied to existing segments.

If you want to add or change the or adjust you will need to manually re-load any existing segments.

How to create star-tree indexes?

Star-tree indexes are configured in the table config under the tableIndexConfig -> starTreeIndexConfigs (list) and enableDefaultStarTree (boolean). See here for more about how to configure star-tree indexes:

The new segments will have star-tree indexes generated after applying the star-tree index configurations to the table configuration.

Handling time in Pinot

How does Pinot’s real-time ingestion handle out-of-order events?

Pinot does not require ordering of event time stamps. Out of order events are still consumed and indexed into the "currently consuming" segment. In a pathological case, if you have a 2 day old event come in "now", it will still be stored in the segment that is open for consumption "now". There is no strict time-based partitioning for segments, but star-indexes and hybrid tables will handle this as appropriate.

See the for more details about how hybrid tables handle this. Specifically, the time-boundary is computed as max(OfflineTIme) - 1 unit of granularity. Pinot does store the min-max time for each segment and uses it for pruning segments, so segments with multiple time intervals may not be perfectly pruned.

When generating star-indexes, the time column will be part of the star-tree so the tree can still be efficiently queried for segments with multiple time intervals.

What is the purpose of a hybrid table not using max(OfflineTime) to determine the time-boundary, and instead using an offset?

This lets you have an old event up come in without building complex offline pipelines that perfectly partition your events by event timestamps. With this offset, even if your offline data pipeline produces segments with a maximum timestamp, Pinot will not use the offline dataset for that last chunk of segments. The expectation is if you process offline the next time-range of data, your data pipeline will include any late events.

Why are segments not strictly time-partitioned?

It might seem odd that segments are not strictly time-partitioned, unlike similar systems such as Apache Druid. This allows real-time ingestion to consume out-of-order events. Even though segments are not strictly time-partitioned, Pinot will still index, prune, and query segments intelligently by time intervals for the performance of hybrid tables and time-filtered data.

When generating offline segments, the segments generated such that segments only contain one time interval and are well partitioned by the time column.

Dictionary index

When dealing with extensive datasets, it's common for values to be repeated multiple times. To enhance storage efficiency and reduce query latencies, we strongly recommend employing a dictionary index for repetitive data. This is the reason Pinot enables dictionary encoding by default, even though it is advisable to disable it for columns with high cardinality.

Influence on other indexes

In Pinot, dictionaries serve as both an index and actual encoding. Consequently, when dictionaries are enabled, the behavior or layout of certain other indexes undergoes modification. The relationship between dictionaries and other indexes is outlined in the following table:

Index
Conditional
Description

Configuration

Deterministically enable or disable dictionaries

Unlike many other indexes, dictionary indexes are enabled by default, under the assumption that the count of unique values will be significantly lower than the number of rows.

If this assumption does not hold true, you can deactivate the dictionary for a specific column by setting the disabled property to true within indexes.dictionary:

Alternatively, the encodingType property can be changed. For example:

You may choose the option you prefer, but it's essential to maintain consistency, as Pinot will reject table configurations where the same column and index are defined in different locations.

Heuristically enable dictionaries

Most of the time the domain expert that creates the table knows whether a dictionary will be useful or not. For example, a column with random values or public IPs will probably have a large cardinality, so they can be immediately be targeted as raw encoded while columns like employee ids will have a small cardinality and therefore can be easily be recognized as good dictionary candidates. But sometimes the decision may not be clear. To help in these situations, Pinot can be configured to heuristically create the dictionary depending on the actual values and a relation factor.

When this heuristic is enabled, Pinot calculates a saving factor for each candidate column. This factor is the ratio between the forward index size encoded as raw and the same index encoded as a dictionary. If the saving factor for a candidate column is less than a saving ratio, the dictionary is not created.

In order to be considered as a candidate for the heuristic, a column must:

  • Be marked as dictionary encoded (columns marked as raw are always encoded as raw).

  • Be single valued (multi-valued columns are never considered by the heuristic).

  • Be of a fixed size type such as int, long, double, timestamp, etc. Variable size types like json, strings or bytes are never considered by the heuristic.

  • Not indexed by or (as they are only useful when cardinality is very large).

Optionally this feature can be applied only to metric columns, skipping dimension columns.

This functionality can be enabled within the indexingConfig object within the table configuration. The parameters that govern these heuristics are:

Parameter
Default
Description

It's important to emphasize that:

  • These parameters are configured for all columns within the table.

  • optimizeDictionary takes precedence over optimizeDictionaryForMetrics.

Parameters

Dictionaries can be configured with the following options

Parameter
Default
Description

Variable length dictionaries

The useVarLengthDictionary parameter only impacts columns with values that vary in the number of bytes they occupy. This includes column types that require a variable number of bytes, such as strings, bytes, or big decimals, and scenarios where not all values within a segment occupy the same number of bytes. For example, even strings in general require a variable number of bytes to be stored, if a segment contains only the values "a", "b", and "c" Pinot will identify that all values in the segment can be represented with the same number of bytes.

By default, useVarLengthDictionary is set to false, which means Pinot will calculate the length of the largest value contained within the segment. This length will then be used for all values. This approach ensures that all values can be stored efficiently, resulting in faster access and a more compressed layout when the lengths of values are similar.

If your dataset includes a few very large values and a multitude of very small ones, it is advisable to instruct Pinot to utilize variable-length encoding by setting useVarLengthDictionary to true. When variable encoding is employed, Pinot is required to store the length of each entry. Consequently, the cost of storing an entry becomes its actual size plus an additional 4 bytes for the offset.

On-heap dictionaries

Dictionary data is always stored off-heap. In general, it is recommended to keep dictionaries that way. However, in cases where the cardinality is small, and the on-heap memory usage is acceptable, you can copy them into memory by setting the onHeap parameter to true.

Remember: On-heap dictionaries are not recommended.

On-heap dictionaries can slightly reduce latency but will significantly increase the heap memory used by Pinot and increase garbage collection times, which may result in out of memory issues.

When off-heap dictionaries are used, data is deserialized each time it is accessed. This isn't a problem with primitive types (such as int or long), but with complex types (like strings or bytes), this means that the data is deserialized each time it is accessed. On-heap dictionaries solve this problem by keeping the data in memory in deserialized format so no allocations are needed at query time.

However, on-heap dictionaries have a cost in terms of memory usage and that cost is proportional to the number of segments that are accessed concurrently. It is important to note that, as with all other indexes, the dictionary scope is limited to segments. This means that if we have a table with 1,000 segments and a dictionary for a column, we may have 1,000 dictionaries in memory. This can be a waste of memory in cases where unique values are repeated across segments. To solve this problem, Pinot can retain a cache of the dictionary values and reuse them across segments. This cache is not shared between different tables or columns and its maximum size is controlled by the dictionary.intern.capacity option.

Only string and byte columns can be interned. Pinot ignores the intern configuration when used on columns with a different data type.

Here's an example of configuring a dictionary to use on-heap dictionaries with intern mode enabled:

Geospatial

This page talks about geospatial support in Pinot.

Pinot supports SQL/MM geospatial data and is compliant with the . This includes:

  • Geospatial data types, such as point, line and polygon;

  • Geospatial functions, for querying of spatial properties and relationships.

  • Geospatial indexing, used for efficient processing of spatial operations

Geospatial data types

Geospatial data types abstract and encapsulate spatial structures such as boundary and dimension. In many respects, spatial data types can be understood simply as shapes. Pinot supports the Well-Known Text (WKT) and Well-Known Binary (WKB) forms of geospatial objects, for example:

  • POINT (0, 0)

  • LINESTRING (0 0, 1 1, 2 1, 2 2)

  • POLYGON (0 0, 10 0, 10 10, 0 10, 0 0),(1 1, 1 2, 2 2, 2 1, 1 1)

  • MULTIPOINT (0 0, 1 2)

  • MULTILINESTRING ((0 0, 1 1, 1 2), (2 3, 3 2, 5 4))

  • MULTIPOLYGON (((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1)), ((-1 -1, -1 -2, -2 -2, -2 -1, -1 -1)))

  • GEOMETRYCOLLECTION(POINT(2 0),POLYGON((0 0, 1 0, 1 1, 0 1, 0 0)))

Geometry vs geography

It is common to have data in which the coordinates are geographics or latitude/longitude. Unlike coordinates in Mercator or UTM, geographic coordinates are not Cartesian coordinates.

  • Geographic coordinates do not represent a linear distance from an origin as plotted on a plane. Rather, these spherical coordinates describe angular coordinates on a globe.

  • Spherical coordinates specify a point by the angle of rotation from a reference meridian (longitude), and the angle from the equator (latitude).

You can treat geographic coordinates as approximate Cartesian coordinates and continue to do spatial calculations. However, measurements of distance, length and area will be nonsensical. Since spherical coordinates measure angular distance, the units are in degrees.

Pinot supports both geometry and geography types, which can be constructed by the corresponding functions as shown in . And for the geography types, the measurement functions such as ST_Distance and ST_Area calculate the spherical distance and area on earth respectively.

Geospatial functions

For manipulating geospatial data, Pinot provides a set of functions for analyzing geometric components, determining spatial relationships, and manipulating geometries. In particular, geospatial functions that begin with the ST_ prefix support the SQL/MM specification.

Following geospatial functions are available out of the box in Pinot:

Aggregations

This aggregate function returns a MULTI geometry or NON-MULTI geometry from a set of geometries. it ignores NULL geometries.

Constructors

  • Returns a geometry type object from WKT representation, with the optional spatial system reference.

  • Returns a geometry type object from WKB representation.

  • Returns a geometry type point object with the given coordinate values.

  • Returns a geometry type polygon object from .

  • Creates a geography instance from a

  • Returns a specified geography value from .

Measurements

  • ST_Area(Geometry/Geography g) → double For geometry type, it returns the 2D Euclidean area of a geometry. For geography, returns the area of a polygon or multi-polygon in square meters using a spherical model for Earth.

  • For geometry type, returns the 2-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units. For geography, returns the great-circle distance in meters between two SphericalGeography points. Note that g1, g2 shall have the same type.

  • Returns the type of the geometry as a string. e.g.: ST_Linestring, ST_Polygon,ST_MultiPolygon etc.

Outputs

  • Returns the WKB representation of the geometry.

  • Returns the WKT representation of the geometry/geography.

Conversion

  • Converts a Geometry object to a spherical geography object.

  • Converts a spherical geographical object to a Geometry object.

Relationship

  • Returns true if and only if no points of the second geometry/geography lie in the exterior of the first geometry/geography, and at least one point of the interior of the first geometry lies in the interior of the second geometry. Warning: ST_Contains on Geography only give close approximation

  • ST_Equals(Geometry, Geometry) → boolean Returns true if the given geometries represent the same geometry/geography.

  • ST_Within(Geometry, Geometry) → boolean Returns true if first geometry is completely inside second geometry.

Geospatial index

Geospatial functions are typically expensive to evaluate, and using geoindex can greatly accelerate the query evaluation. Geoindexing in Pinot is based on Uber’s , a hexagon-based hierarchical gridding.

A given geospatial location (longitude, latitude) can map to one hexagon (represented as H3Index). And its neighbors in H3 can be approximated by a ring of hexagons. To quickly identify the distance between any given two geospatial locations, we can convert the two locations in the H3Index, and then check the H3 distance between them. H3 distance is measured as the number of hexagons.

For example, in the diagram below, the red hexagons are within the 1 distance of the central hexagon. The size of the hexagon is determined by the resolution of the indexing. Check this table for the level of and the corresponding precision (measured in km).

How to use geoindex

To use the geoindex, first declare the geolocation field as bytes in the schema, as in the example of the .

Note the use of transformFunction that converts the created point into SphericalGeography format, which is needed by the ST_Distance function.

Next, declare the geospatial index in the you need to

  • Verify the dictionary is disabled (see how to ).

  • Enable the H3 index.

It is recommended to do the latter by using the indexes section:

Alternative the older way to configure H3 indexes is still supported:

The query below will use the geoindex to filter the Starbucks stores within 5km of the given point in the bay area.

How geoindex works

The Pinot geoindex accelerates query evaluation while maintaining accuracy. Currently, geoindex supports the ST_Distance function in the WHERE clause.

At the high level, geoindex is used for retrieving the records within the nearby hexagons of the given location, and then use ST_Distance to accurately filter the matched results.

As in the example diagram above, if we want to find all relevant points within a given distance around San Francisco (area within the red circle), then the algorithm with geoindex will:

  • First find the H3 distance x that contains the range (for example, within a red circle).

  • Then, for the points within the H3 distance (those covered by the hexagons completely within ), directly accept those points without filtering.

  • Finally, for the points contained in the hexagons of kRing(x) at the outer edge of the red circle H3 distance, the algorithm will filter them by evaluating the condition ST_Distance(loc1, loc2) < x to find only those that are within the circle.

bin/kafka-topics.sh --create --bootstrap-server localhost:9876 --replication-factor 1 --partitions 1 --topic transcript-topic
"tableIndexConfig": {
      ..
      "segmentPartitionConfig": {
        "columnPartitionMap": {
          "memberId": {
            "functionName": "Modulo",
            "numPartitions": 3 
          },
          "caseNumber": {
            "functionName": "Murmur",
            "numPartitions": 12 
          }
        }
      }
"routing": {
      "segmentPrunerTypes": ["partition"]
    }
"tableIndexConfig": {
      ..
      "segmentPartitionConfig": {
        "columnPartitionMap": {
          "column_foo": {
            "functionName": "Murmur",
            "numPartitions": 12 // same as number of kafka partitions
          }
        }
      }
"routing": {
      "segmentPrunerTypes": ["partition"]
    }
    {
      "dataType": "STRING",
      "maxLength": 1000,
      "name": "textDim1"
    },
{
  "<segment-name>": {
    "segmentName": "<segment-name>",
    "indexes": {
      "<columnName>": {
        "bloom-filter": "NO",
        "dictionary": "YES",
        "forward-index": "YES",
        "inverted-index": "YES",
        "null-value-vector-reader": "NO",
        "range-index": "NO",
        "json-index": "NO"
      }
    }
  }
}
make a pull request
tableIndexConfig.segmentPartitionConfig.columnPartitionMap
.
https://docs.confluent.io/current/clients/producer.html
routing tuning
json_format(field)
json functions
https://yaml.org/spec/spec.html#escaping/in%20double-quoted%20scalars/
commit protocol
Real-time table configs
Pause Stream Ingestion
Table Config Reference
Inverted Index
How to apply an inverted index to existing segments?
http://localhost:9000/help#!/Table/updateTableConfig
http://localhost:9000/help#!/Segment/reloadAllSegments
http://localhost:9000/help#/Segment/getServerMetadata
sorted index column
the dictionary encoding of the default forward index
https://docs.pinot.apache.org/basics/indexing/star-tree-index#index-generation
Components > Broker

forward

Implementation depends on whether the dictionary is enabled or not.

range

Implementation depends on whether the dictionary is enabled or not.

inverted

Requires the dictionary index to be enabled.

json

when optimizeDictionary

Disables dictionary.

text

when optimizeDictionary

Disables dictionary.

FST

Requires dictionary.

H3 (or geospatial)

Incompatible with dictionary.

Configured in tableConfig fieldConfigList
{
  "fieldConfigList": [
    {
      "name": "col1",
      "indexes": {
        "dictionary": {
          "disabled": true
        }
      }
    },
    ...
  ],
...
}
{
  "fieldConfigList": [
    {
      "name": "col1",
      "encodingType": "RAW"
    },
    ...
  ],
...
}

optimizeDictionary

false

Enables the heuristic for all columns and activates some extra rules.

optimizeDictionaryForMetrics

false

Enables the heuristic for metric columns.

noDictionarySizeRatioThreshold

0.85

The saving ratio used in the heuristics.

onHeap

false

Specifies whether the index should be loaded on heap or off heap.

useVarLengthDictionary

false

Determines how to store variable-length values.

intern

empty object

Configuration for interning. Only for on-heap dictionaries. Read about that below.

intern.capacity

null

how many values should be interning

{
  "tableName": "somePinotTable",
  "fieldConfigList": [
    {
      "name": "strColumn",
      "encodingType": "DICTIONARY",
      "indexes": {
        "dictionary": {
          "onHeap": true,
          "intern": {
            "capacity":32000
          }
        }
      }
    }
  ]
}
text index
JSON index

Segment

Discover the segment component in Apache Pinot for efficient data storage and querying within Pinot clusters, enabling optimized data processing and analysis.

Pinot tables are stored in one or more independent shards called segments. A small table may be contained by a single segment, but Pinot lets tables grow to an unlimited number of segments. There are different processes for creating segments (see ingestion). Segments have time-based partitions of table data, and are stored on Pinot servers that scale horizontally as needed for both storage and computation.

Pinot achieves this by breaking the data into smaller chunks known as segments (similar to shards/partitions in relational databases). Segments can be seen as time-based partitions.

A segment is a horizontal shard representing a chunk of table data with some number of rows. The segment stores data for all columns of the table. Each segment packs the data in a columnar fashion, along with the dictionaries and indices for the columns. The segment is laid out in a columnar format so that it can be directly mapped into memory for serving queries.

Columns can be single or multi-valued and the following types are supported: STRING, BOOLEAN, INT, LONG, FLOAT, DOUBLE, TIMESTAMP or BYTES. Only single-valued BIG_DECIMAL data type is supported.

Columns may be declared to be metric or dimension (or specifically as a time dimension) in the schema. Columns can have default null values. For example, the default null value of a integer column can be 0. The default value for bytes columns must be hex-encoded before it's added to the schema.

Pinot uses dictionary encoding to store values as a dictionary ID. Columns may be configured to be “no-dictionary” column in which case raw values are stored. Dictionary IDs are encoded using minimum number of bits for efficient storage (e.g. a column with a cardinality of 3 will use only 2 bits for each dictionary ID).

A forward index is built for each column and compressed for efficient memory use. In addition, you can optionally configure inverted indices for any set of columns. Inverted indices take up more storage, but improve query performance. Specialized indexes like Star-Tree index are also supported. For more details, see Indexing.

Creating a segment

Once the table is configured, we can load some data. Loading data involves generating pinot segments from raw data and pushing them to the pinot cluster. Data can be loaded in batch mode or streaming mode. For more details, see the ingestion overview page.

Load data in batch

Prerequisites

  1. Set up a cluster

  2. Create broker and server tenants

  3. Create an offline table

Below are instructions to generate and push segments to Pinot via standalone scripts. For a production setup, you should use frameworks such as Hadoop or Spark. For more details on setting up data ingestion jobs, see Import Data.

Job Spec YAML

To generate a segment, we need to first create a job spec YAML file. This file contains all the information regarding data format, input data location, and pinot cluster coordinates. Note that this assumes that the controller is RUNNING to fetch the table config and schema. If not, you will have to configure the spec to point at their location. For full configurations, see Ingestion Job Spec.

job-spec.yml
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'

jobType: SegmentCreationAndTarPush
inputDirURI: 'examples/batch/baseballStats/rawdata'
includeFileNamePattern: 'glob:**/*.csv'
excludeFileNamePattern: 'glob:**/*.tmp'
outputDirURI: 'examples/batch/baseballStats/segments'
overwriteOutput: true

pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS

recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
  configs:

tableSpec:
  tableName: 'baseballStats'
  schemaURI: 'http://localhost:9000/tables/baseballStats/schema'
  tableConfigURI: 'http://localhost:9000/tables/baseballStats'
  
segmentNameGeneratorSpec:

pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'

pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000

Create and push segment

To create and push the segment in one go, use the following:

docker run \
    --network=pinot-demo \
    --name pinot-data-ingestion-job \
    ${PINOT_IMAGE} LaunchDataIngestionJob \
    -jobSpecFile examples/docker/ingestion-job-specs/airlineStats.yaml

Sample Console Output

SegmentGenerationJobSpec:
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
  segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**/*.avro
inputDirURI: examples/batch/airlineStats/rawdata
jobType: SegmentCreationAndTarPush
outputDirURI: examples/batch/airlineStats/segments
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: 'http://pinot-controller:9000'}
pinotFSSpecs:
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: {pushAttempts: 2, pushParallelism: 1, pushRetryIntervalMillis: 1000,
  segmentUriPrefix: null, segmentUriSuffix: null}
recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.avro.AvroRecordReader,
  configClassName: null, configs: null, dataFormat: avro}
segmentNameGeneratorSpec: null
tableSpec: {schemaURI: 'http://pinot-controller:9000/tables/airlineStats/schema',
  tableConfigURI: 'http://pinot-controller:9000/tables/airlineStats', tableName: airlineStats}

Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Finished building StatsCollector!
Collected stats for 403 documents
Created dictionary for INT column: FlightNum with cardinality: 386, range: 14 to 7389
Using fixed bytes value dictionary for column: Origin, size: 294
Created dictionary for STRING column: Origin with cardinality: 98, max length in bytes: 3, range: ABQ to VPS
Created dictionary for INT column: Quarter with cardinality: 1, range: 1 to 1
Created dictionary for INT column: LateAircraftDelay with cardinality: 50, range: -2147483648 to 303
......
......
Pushing segment: airlineStats_OFFLINE_16085_16085_29 to location: http://pinot-controller:9000 for table airlineStats
Sending request: http://pinot-controller:9000/v2/segments?tableName=airlineStats to controller: a413b0013806, version: Unknown
Response for pushing table airlineStats segment airlineStats_OFFLINE_16085_16085_29 to location http://pinot-controller:9000 - 200: {"status":"Successfully uploaded segment: airlineStats_OFFLINE_16085_16085_29 of table: airlineStats"}
Pushing segment: airlineStats_OFFLINE_16084_16084_30 to location: http://pinot-controller:9000 for table airlineStats
Sending request: http://pinot-controller:9000/v2/segments?tableName=airlineStats to controller: a413b0013806, version: Unknown
Response for pushing table airlineStats segment airlineStats_OFFLINE_16084_16084_30 to location http://pinot-controller:9000 - 200: {"status":"Successfully uploaded segment: airlineStats_OFFLINE_16084_16084_30 of table: airlineStats"}
bin/pinot-admin.sh LaunchDataIngestionJob \
    -jobSpecFile examples/batch/airlineStats/ingestionJobSpec.yaml

Alternately, you can separately create and then push, by changing the jobType to SegmentCreation or SegmenTarPush.

Templating Ingestion Job Spec

The Ingestion job spec supports templating with Groovy Syntax.

This is convenient if you want to generate one ingestion job template file and schedule it on a daily basis with extra parameters updated daily.

e.g. you could set inputDirURI with parameters to indicate the date, so that the ingestion job only processes the data for a particular date. Below is an example that templates the date for input and output directories.

inputDirURI: 'examples/batch/airlineStats/rawdata/${year}/${month}/${day}'
outputDirURI: 'examples/batch/airlineStats/segments/${year}/${month}/${day}'

You can pass in arguments containing values for ${year}, ${month}, ${day} when kicking off the ingestion job: -values $param=value1 $param2=value2...

docker run \
    --network=pinot-demo \
    --name pinot-data-ingestion-job \
    ${PINOT_IMAGE} LaunchDataIngestionJob \
    -jobSpecFile examples/docker/ingestion-job-specs/airlineStats.yaml
    -values year=2014 month=01 day=03

This ingestion job only generates segments for date 2014-01-03

Load data in streaming

Prerequisites

  1. Set up a cluster

  2. Create broker and server tenants

  3. Create a real-time table and set up a real-time stream

Below is an example of how to publish sample data to your stream. As soon as data is available to the real-time stream, it starts getting consumed by the real-time servers.

Kafka

Run below command to stream JSON data into Kafka topic: flights-realtime

docker run \
  --network pinot-demo \
  --name=loading-airlineStats-data-to-kafka \
  ${PINOT_IMAGE} StreamAvroIntoKafka \
  -avroFile examples/stream/airlineStats/sample_data/airlineStats_data.avro \
  -kafkaTopic flights-realtime -kafkaBrokerList kafka:9092 -zkAddress pinot-zookeeper:2181/kafka

Run below command to stream JSON data into Kafka topic: flights-realtime

bin/pinot-admin.sh StreamAvroIntoKafka \
  -avroFile examples/stream/airlineStats/sample_data/airlineStats_data.avro \
  -kafkaTopic flights-realtime -kafkaBrokerList localhost:19092 -zkAddress localhost:2191/kafka

0.4.0

0.4.0 release introduced the theta-sketch based distinct count function, an S3 filesystem plugin, a unified star-tree index implementation, migration from TimeFieldSpec to DateTimeFieldSpec, etc.

Summary

0.4.0 release introduced various new features, including the theta-sketch based distinct count aggregation function, an S3 filesystem plugin, a unified star-tree index implementation, deprecation of TimeFieldSpec in favor of DateTimeFieldSpec, etc. Miscellaneous refactoring, performance improvement and bug fixes were also included in this release. See details below.

Notable New Features

  • Made DateTimeFieldSpecs mainstream and deprecated TimeFieldSpec (#2756)

    • Used time column from table config instead of schema (#5320)

    • Included dateTimeFieldSpec in schema columns of Pinot Query Console #5392

    • Used DATE_TIME as the primary time column for Pinot tables (#5399)

  • Supported range queries using indexes (#5240)

  • Supported complex aggregation functions

    • Supported Aggregation functions with multiple arguments (#5261)

    • Added api in AggregationFunction to get compiled input expressions (#5339)

  • Added a simple PinotFS benchmark driver (#5160)

  • Supported default star-tree (#5147)

  • Added an initial implementation for theta-sketch based distinct count aggregation function (#5316)

    • One minor side effect: DataSchemaPruner won't work for DistinctCountThetaSketchAggregatinoFunction (#5382)

  • Added access control for Pinot server segment download api (#5260)

  • Added Pinot S3 Filesystem Plugin (#5249)

  • Text search improvement

    • Pruned stop words for text index (#5297)

    • Used 8byte offsets in chunk based raw index creator (#5285)

    • Derived num docs per chunk from max column value length for varbyte raw index creator (#5256)

    • Added inter segment tests for text search and fixed a bug for Lucene query parser creation (#5226)

    • Made text index query cache a configurable option (#5176)

    • Added Lucene DocId to PinotDocId cache to improve performance (#5177)

    • Removed the construction of second bitmap in text index reader to improve performance (#5199)

  • Tooling/usability improvement

    • Added template support for Pinot Ingestion Job Spec (#5341)

    • Allowed user to specify zk data dir and don't do clean up during zk shutdown (#5295)

    • Allowed configuring minion task timeout in the PinotTaskGenerator (#5317)

    • Update JVM settings for scripts (#5127)

    • Added Stream github events demo (#5189)

    • Moved docs link from gitbook to docs.pinot.apache.org (#5193)

  • Re-implemented ORCRecordReader (#5267)

  • Evaluated schema transform expressions during ingestion (#5238)

  • Handled count distinct query in selection list (#5223)

  • Enabled async processing in pinot broker query api (#5229)

  • Supported bootstrap mode for table rebalance (#5224)

  • Supported order-by on BYTES column (#5213)

  • Added Nightly publish to binary (#5190)

  • Shuffled the segments when rebalancing the table to avoid creating hotspot servers (#5197)

  • Supported built-in transform functions (#5312)

    • Added date time transform functions (#5326)

  • Deepstore by-pass in LLC: introduced segment uploader (#5277, #5314)

  • APIs Additions/Changes

    • Added a new server api for download of segments

      • /GET /segments/{tableNameWithType}/{segmentName}

  • Upgraded helix to 0.9.7 (#5411)

  • Added support to execute functions during query compilation (#5406)

  • Other notable refactoring

    • Moved table config into pinot-spi (#5194)

    • Cleaned up integration tests. Standardized the creation of schema, table config and segments (#5385)

    • Added jsonExtractScalar function to extract field from json object (#4597)

    • Added template support for Pinot Ingestion Job Spec #5372

    • Cleaned up AggregationFunctionContext (#5364)

    • Optimized real-time range predicate when cardinality is high (#5331)

    • Made PinotOutputFormat use table config and schema to create segments (#5350)

    • Tracked unavailable segments in InstanceSelector (#5337)

    • Added a new best effort segment uploader with bounded upload time (#5314)

    • In SegmentPurger, used table config to generate the segment (#5325)

    • Decoupled schema from RecordReader and StreamMessageDecoder (#5309)

    • Implemented ARRAYLENGTH UDF for multi-valued columns (#5301)

    • Improved GroupBy query performance (#5291)

    • Optimized ExpressionFilterOperator (#5132)

Major Bug Fixes

  • Do not release the PinotDataBuffer when closing the index (#5400)

  • Handled a no-arg function in query parsing and expression tree (#5375)

  • Fixed compatibility issues during rolling upgrade due to unknown json fields (#5376)

  • Fixed missing error message from pinot-admin command (#5305)

  • Fixed HDFS copy logic (#5218)

  • Fixed spark ingestion issue (#5216)

  • Fixed the capacity of the DistinctTable (#5204)

  • Fixed various links in the Pinot website

Work in Progress

  • Upsert: support overriding data in the real-time table (#4261).

    • Add pinot upsert features to pinot common (#5175)

  • Enhancements for theta-sketch, e.g. multiValue aggregation support, complex predicates, performance tuning, etc

Backward Incompatible Changes

  • TableConfig no longer support de-serialization from json string of nested json string (i.e. no \" inside the json) (#5194)

  • The following APIs are changed in AggregationFunction (use TransformExpressionTree instead of String as the key of blockValSetMap) (#5371):

    void aggregate(int length, AggregationResultHolder aggregationResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
    void aggregateGroupBySV(int length, int[] groupKeyArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
    void aggregateGroupByMV(int length, int[][] groupKeysArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
geoindex schema
{
  "dataType": "BYTES",
  "name": "location_st_point",
  "transformFunction": "toSphericalGeography(stPoint(lon,lat))"
}
geoindex tableConfig
{
  "fieldConfigList": [
    {
      "name": "location_st_point",
      "encodingType":"RAW", // this actually disables the dictionary
      "indexes": {
        "h3": {
          "resolutions": [13, 5, 6]
        }
      }
    }
  ],
  ...
}
geoindex tableConfig
{
  "fieldConfigList": [{
    "name": "location_st_point",
    "encodingType":"RAW", // this actually disables the dictionary
    "indexTypes":["H3"],
    "properties": {
      "resolutions": "13, 5, 6" // Here resolutions must be a string with ints separated by commas 
    }
  }],
  ...
}
SELECT address, ST_DISTANCE(location_st_point, ST_Point(-122, 37, 1))
FROM starbucksStores
WHERE ST_DISTANCE(location_st_point, ST_Point(-122, 37, 1)) < 5000
limit 1000
Open Geospatial Consortium’s (OGC) OpenGIS Specifications
section
ST_Union(geometry[] g1_array) → Geometry
ST_GeomFromText(String wkt) → Geometry
ST_GeomFromWKB(bytes wkb) → Geometry
ST_Point(double x, double y) → Point
ST_Polygon(String wkt) → Polygon
WKT representation
ST_GeogFromWKB(bytes wkb) → Geography
Well-Known Binary geometry representation (WKB)
ST_GeogFromText(String wkt) → Geography
Well-Known Text representation or extended (WKT)
ST_Distance(Geometry/Geography g1, Geometry/Geography g2) → double
ST_GeometryType(Geometry g) → String
ST_AsBinary(Geometry/Geography g) → bytes
ST_AsText(Geometry/Geography g) → string
toSphericalGeography(Geometry g) → Geography
toGeometry(Geography g) → Geometry
ST_Contains(Geometry/Geography, Geometry/Geography) → boolean
H3
resolutions
QuickStart example
table configuration
disable the dictionary index
kRing(x)
Hexagonal grid in H3
Geoindex example

Explain Plan

Query execution within Pinot is modeled as a sequence of operators that are executed in a pipelined manner to produce the final result. The EXPLAIN PLAN FOR syntax can be used to obtain the execution plan of a query, which can be useful to further optimize them.

The explain plan is a feature that is still under development and may change in future releases. Pinot explain plans are human-readable and are intended to be used for debugging and optimization purposes. This is specially important when using the explain plan in automated scripts or tools. The explain plan, even the ones returned as tables or JSON, are not guaranteed to be stable across releases.

Pinot supports different type of explain plans depending on the query engine and the granularity or details we want to obtain.

Different plans for different segments

Segments are the basic unit of data storage and processing in Pinot. When a query is executed, it is executed on each segment and the results are merged together. Not all segments have the data distribution, indexes, etc. Therefore the query engine may decide to execute the query differently on different segments. This includes:

  • Segments that were not refreshed since indexes were added or removed on the table config.

  • Realtime segments that are being ingested, where some indexes (like range indexes) cannot be used.

  • Data distribution, specially min and max values for columns, which can affect the query plan.

Given a Pinot query can touch thousands of segments, Pinot tries to minimize the number of shown when explaining a query. By default, Pinot tries to analyze the plan for each segment and returns a simplified plan. How this simplification is done depends on the query engine, you can read more about that below.

There is a verbose mode that can be used to show the plan for each segment. This mode is activated by setting the explainPlanVerbose query option to true, prefixing SET explainPlanVerbose=true; to the explain plan sentence.

Explain on multi-stage query engine

Following the more complex nature of the multi-stage query engine, its explain plan can be customized to get a plan on different aspects of the query execution.

There are 3 different types of explain plans for the multi-stage query engine:

Mode
Syntax by default
Syntax if segment plan is enabled
Description

Segment plan

SET explainAskingServers=true;

EXPLAIN PLAN FOR

EXPLAIN PLAN FOR

Includes the segment specific information (like indexes).

Logical plan

EXPLAIN PLAN FOR

or

EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR

EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR

Simplest multi-stage plan. No index or data shuffle information.

Workers plan

EXPLAIN IMPLEMENTATION PLAN FOR

EXPLAIN IMPLEMENTATION PLAN FOR

Used to understand data shuffle between servers. Note: The name of this mode is open to discussion and may change in the future.

The syntax used to select each explain plan mode is confusing and it may be changed in the future.

Segment plan

The plan with segments is a detailed representation of the query execution plan that includes the segment specific information, like data distribution, indexes, etc.

This mode was introduced in Pinot 1.3.0 and it is planned to be the default in future releases. Meanwhile it can be used by setting the explainAskingServers query option to true, prefixing SET explainAskingServers=true; to the explain plan sentence. Alternatively this mode can be activated by default by changing the broker configuration pinot.query.multistage.explain.include.segment.plan to true.

Independently of how it is activated, once this mode is enabled, EXPLAIN PLAN FOR syntax will include segment information.

Verbose and brief mode

As explained in Different plans for different segments, by default Pinot tries to minimize the number of shown when explaining a query. In multi-stage, the brief mode includes all different plans, but each equivalent plan is aggregated. For example, if the same plan is executed on 100 segments, the brief mode will show it only once and stats like the number of docs will be summed.

In the verbose mode, one plan is shown per segment, including the segment name and all the segment specific information. This may be useful to know which segments are not using indexes, or which segments are using a different data distribution.

Example

-- SET explainAskingServer= true is required if 
-- pinot.query.multistage.explain.include.segment.plan is false, 
-- optional otherise
SET explainAskingServers=true;
EXPLAIN PLAN FOR
SELECT DISTINCT deviceOS, groupUUID
FROM userAttributes AS a
JOIN userGroups AS g
ON a.userUUID = g.userUUID
WHERE g.groupUUID = 'group-1'
LIMIT 100

Returns

Execution Plan
LogicalSort(offset=[0], fetch=[100])
  PinotLogicalSortExchange(distribution=[hash], collation=[[]], isSortOnSender=[false], isSortOnReceiver=[false])
    LogicalSort(fetch=[100])
      PinotLogicalAggregate(group=[{0, 1}])
        PinotLogicalExchange(distribution=[hash[0, 1]])
          PinotLogicalAggregate(group=[{0, 2}])
            LogicalJoin(condition=[=($1, $3)], joinType=[inner])
              PinotLogicalExchange(distribution=[hash[1]])
                LeafStageCombineOperator(table=[userAttributes])
                  StreamingInstanceResponse
                    StreamingCombineSelect
                      SelectStreaming(table=[userAttributes], totalDocs=[10000])
                        Project(columns=[[deviceOS, userUUID]])
                          DocIdSet(maxDocs=[40000])
                            FilterMatchEntireSegment(numDocs=[10000])
              PinotLogicalExchange(distribution=[hash[1]])
                LeafStageCombineOperator(table=[userGroups])
                  StreamingInstanceResponse
                    StreamingCombineSelect
                      SelectStreaming(table=[userGroups], totalDocs=[2478])
                        Project(columns=[[groupUUID, userUUID]])
                          DocIdSet(maxDocs=[50000])
                            FilterInvertedIndex(predicate=[groupUUID = 'group-1'], indexLookUp=[inverted_index], operator=[EQ])
                      SelectStreaming(segment=[userGroups_OFFLINE_4], table=[userGroups], totalDocs=[4])
                        Project(columns=[[groupUUID, userUUID]])
                          DocIdSet(maxDocs=[10000])
                            FilterEmpty
                      SelectStreaming(segment=[userGroups_OFFLINE_6], table=[userGroups], totalDocs=[4])
                        Project(columns=[[groupUUID, userUUID]])
                          DocIdSet(maxDocs=[10000])
                            FilterMatchEntireSegment(numDocs=[4])

Logical Plan

The logical plan is a high-level representation of the query execution plan. This plan is calculated on the broker without asking the servers for their segment specific plans. This means that the logical plan does not include the segment specific information, like data distribution, indexes, etc.

In Pinot 1.3.0, the logical plan is enabled by default and can be obtained by using EXPLAIN PLAN FOR syntax. Optionally, the segment plan can be enabled by default, in which case the logical plan can be obtained by using EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR syntax.

The recommended way to ask for logical plan is to use EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR given this syntax is available in all versions of Pinot, independently of the configuration.

Example:

-- WITHOUT IMPLENTATION qualifier can be used to ensure logical plan is used
-- It can be used in any version of Pinot even when segment plan is enabled by default
EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR 
SELECT DISTINCT deviceOS, groupUUID
FROM userAttributes AS a
JOIN userGroups AS g
ON a.userUUID = g.userUUID
WHERE g.groupUUID = 'group-1'
LIMIT 100

Returns:

Execution Plan
LogicalSort(offset=[0], fetch=[100])
  PinotLogicalSortExchange(distribution=[hash], collation=[[]], isSortOnSender=[false], isSortOnReceiver=[false])
    LogicalSort(fetch=[100])
      PinotLogicalAggregate(group=[{0, 1}])
        PinotLogicalExchange(distribution=[hash[0, 1]])
          PinotLogicalAggregate(group=[{0, 2}])
            LogicalJoin(condition=[=($1, $3)], joinType=[inner])
              PinotLogicalExchange(distribution=[hash[1]])
                LogicalProject(deviceOS=[$4], userUUID=[$6])
                  LogicalTableScan(table=[[default, userAttributes]])
              PinotLogicalExchange(distribution=[hash[1]])
                LogicalProject(groupUUID=[$3], userUUID=[$4])
                  LogicalFilter(condition=[=($3, _UTF-8'group-1')])
                    LogicalTableScan(table=[[default, userGroups]])

Workers plan

There have been some discussion about how to name this explain mode and it may change in future versions. The term worker is leaking an implementation detail that is not explained anywhere else in the user documentation.

The workers plan is a detailed representation of the query execution plan that includes information on how the query is distributed among different servers and workers inside them. This plan does not include the segment specific information, like data distribution, indexes, etc. and it is probably the useful of the plans for normal use cases.

Their main use case is to try to reduce data shuffling between workers by verifying that, for example, a join is executed in colocated fashion.

Example

EXPLAIN IMPLEMENTATION PLAN FOR
SELECT DISTINCT deviceOS, groupUUID
FROM userAttributes AS a
JOIN userGroups AS g
ON a.userUUID = g.userUUID
WHERE g.groupUUID = 'group-1'
LIMIT 100

Returns:

0]@192.168.0.98:54196|[0] MAIL_RECEIVE(BROADCAST_DISTRIBUTED)
├── [1]@192.168.0.98:54227|[3] MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@192.168.0.98:54196|[0]} (Subtree Omitted)
├── [1]@192.168.0.98:54220|[2] MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@192.168.0.98:54196|[0]} (Subtree Omitted)
├── [1]@192.168.0.98:54214|[1] MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@192.168.0.98:54196|[0]} (Subtree Omitted)
└── [1]@192.168.0.98:54206|[0] MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@192.168.0.98:54196|[0]}
    └── [1]@192.168.0.98:54206|[0] SORT LIMIT 100
        └── [1]@192.168.0.98:54206|[0] MAIL_RECEIVE(HASH_DISTRIBUTED)
            ├── [2]@192.168.0.98:54227|[3] MAIL_SEND(HASH_DISTRIBUTED)->{[1]@192.168.0.98:54207|[0],[1]@192.168.0.98:54215|[1],[1]@192.168.0.98:54221|[2],[1]@192.168.0.98:54228|[3]} (Subtree Omitted)
            ├── [2]@192.168.0.98:54220|[2] MAIL_SEND(HASH_DISTRIBUTED)->{[1]@192.168.0.98:54207|[0],[1]@192.168.0.98:54215|[1],[1]@192.168.0.98:54221|[2],[1]@192.168.0.98:54228|[3]} (Subtree Omitted)
            ├── [2]@192.168.0.98:54214|[1] MAIL_SEND(HASH_DISTRIBUTED)->{[1]@192.168.0.98:54207|[0],[1]@192.168.0.98:54215|[1],[1]@192.168.0.98:54221|[2],[1]@192.168.0.98:54228|[3]} (Subtree Omitted)
            └── [2]@192.168.0.98:54206|[0] MAIL_SEND(HASH_DISTRIBUTED)->{[1]@192.168.0.98:54207|[0],[1]@192.168.0.98:54215|[1],[1]@192.168.0.98:54221|[2],[1]@192.168.0.98:54228|[3]}
                └── [2]@192.168.0.98:54206|[0] SORT LIMIT 100
                    └── [2]@192.168.0.98:54206|[0] AGGREGATE_FINAL
                        └── [2]@192.168.0.98:54206|[0] MAIL_RECEIVE(HASH_DISTRIBUTED)
                            ├── [3]@192.168.0.98:54227|[3] MAIL_SEND(HASH_DISTRIBUTED)->{[2]@192.168.0.98:54207|[0],[2]@192.168.0.98:54215|[1],[2]@192.168.0.98:54221|[2],[2]@192.168.0.98:54228|[3]} (Subtree Omitted)
                            ├── [3]@192.168.0.98:54220|[2] MAIL_SEND(HASH_DISTRIBUTED)->{[2]@192.168.0.98:54207|[0],[2]@192.168.0.98:54215|[1],[2]@192.168.0.98:54221|[2],[2]@192.168.0.98:54228|[3]} (Subtree Omitted)
                            ├── [3]@192.168.0.98:54214|[1] MAIL_SEND(HASH_DISTRIBUTED)->{[2]@192.168.0.98:54207|[0],[2]@192.168.0.98:54215|[1],[2]@192.168.0.98:54221|[2],[2]@192.168.0.98:54228|[3]} (Subtree Omitted)
                            └── [3]@192.168.0.98:54206|[0] MAIL_SEND(HASH_DISTRIBUTED)->{[2]@192.168.0.98:54207|[0],[2]@192.168.0.98:54215|[1],[2]@192.168.0.98:54221|[2],[2]@192.168.0.98:54228|[3]}
                                └── [3]@192.168.0.98:54206|[0] AGGREGATE_LEAF
                                    └── [3]@192.168.0.98:54206|[0] JOIN
                                        ├── [3]@192.168.0.98:54206|[0] MAIL_RECEIVE(HASH_DISTRIBUTED)
                                        │   ├── [4]@192.168.0.98:54227|[1] MAIL_SEND(HASH_DISTRIBUTED)->{[3]@192.168.0.98:54207|[0],[3]@192.168.0.98:54215|[1],[3]@192.168.0.98:54221|[2],[3]@192.168.0.98:54228|[3]} (Subtree Omitted)
                                        │   └── [4]@192.168.0.98:54214|[0] MAIL_SEND(HASH_DISTRIBUTED)->{[3]@192.168.0.98:54207|[0],[3]@192.168.0.98:54215|[1],[3]@192.168.0.98:54221|[2],[3]@192.168.0.98:54228|[3]}
                                        │       └── [4]@192.168.0.98:54214|[0] PROJECT
                                        │           └── [4]@192.168.0.98:54214|[0] TABLE SCAN (userAttributes) null
                                        └── [3]@192.168.0.98:54206|[0] MAIL_RECEIVE(HASH_DISTRIBUTED)
                                            ├── [5]@192.168.0.98:54227|[1] MAIL_SEND(HASH_DISTRIBUTED)->{[3]@192.168.0.98:54207|[0],[3]@192.168.0.98:54215|[1],[3]@192.168.0.98:54221|[2],[3]@192.168.0.98:54228|[3]} (Subtree Omitted)
                                            └── [5]@192.168.0.98:54214|[0] MAIL_SEND(HASH_DISTRIBUTED)->{[3]@192.168.0.98:54207|[0],[3]@192.168.0.98:54215|[1],[3]@192.168.0.98:54221|[2],[3]@192.168.0.98:54228|[3]}
                                                └── [5]@192.168.0.98:54214|[0] PROJECT
                                                    └── [5]@192.168.0.98:54214|[0] FILTER
                                                        └── [5]@192.168.0.98:54214|[0] TABLE SCAN (userGroups) null

Explain on single stage query engine

Explain plan for single stage query engine is described in deep in Explain Plan (Single-Stage)

Explain plan for single stage query engine is simpler and less customized, but returns the information in a tabular format. For example, the query EXPLAIN PLAN FOR SELECT playerID, playerName FROM baseballStats.

Returns the following table:

+---------------------------------------------|------------|---------|
| Operator                                    | Operator_Id|Parent_Id|
+---------------------------------------------|------------|---------|
|BROKER_REDUCE(limit:10)                      | 1          | 0       |
|COMBINE_SELECT                               | 2          | 1       |
|PLAN_START(numSegmentsForThisPlan:1)         | -1         | -1      |
|SELECT(selectList:playerID, playerName)      | 3          | 2       |
|TRANSFORM_PASSTHROUGH(playerID, playerName)  | 4          | 3       |
|PROJECT(playerName, playerID)                | 5          | 4       |
|DOC_ID_SET                                   | 6          | 5       |
|FILTER_MATCH_ENTIRE_SEGMENT(docs:97889)      | 7          | 6       |
+---------------------------------------------|------------|---------|

Where Operator column describes the operator that Pinot will run whereas the Operator_Id and Parent_Id columns show the parent-child relationship between operators, which forms the execution tree. For example, the plan above should be understood as:

BROKER_REDUCE(limit:10)
└── COMBINE_SELECT
    └── PLAN_START(numSegmentsForThisPlan:1)
        └── SELECT(selectList:playerID, playerName)
            └── TRANSFORM_PASSTHROUGH(playerID, playerName)
                └── PROJECT(playerName, playerID)
                    └── DOC_ID_SET
                        └── FILTER_MATCH_ENTIRE_SEGMENT(docs:97889)

Running Pinot in Docker

This guide will show you to run a Pinot cluster using Docker.

Get started setting up a Pinot cluster with Docker using the guide below.

Prerequisites:

  • Install

  • Configure Docker memory with the following minimum resources:

    • CPUs: 8

    • Memory: 16.00 GB

    • Swap: 4 GB

    • Disk Image size: 60 GB

The latest Pinot Docker image is published at apachepinot/pinot:latest. View a list of .

Pull the latest Docker image onto your machine by running the following command:

To pull a specific version, modify the command like below:

Set up a cluster

Once you've downloaded the Pinot Docker image, it's time to set up a cluster. There are two ways to do this.

Quick start

Pinot comes with quick start commands that launch instances of Pinot components in the same process and import pre-built datasets.

For example, the following quick start command launches Pinot with a baseball dataset pre-loaded:

For a list of all available quick start commands, see .

Below are the usages of different ports:

2123: Zookeeper Port

9000: Pinot Controller Port

8000: Pinot Broker Port

7050: Pinot Server Port

6000: Pinot Minion Port

Manual cluster

The quick start scripts launch Pinot with minimal resources. If you want to play with bigger datasets (more than a few MB), you can launch each of the Pinot components individually.

Note that these are sample configurations to be used as references. You will likely want to customize them to meet your needs for production use.

Docker

Create a Network

Create an isolated bridge network in docker

Export Docker Image tags

Export the necessary docker image tags for Pinot, Zookeeper, and Kafka.

Start Zookeeper

Start Zookeeper in daemon mode. This is a single node zookeeper setup. Zookeeper is the central metadata store for Pinot and should be set up with replication for production use. For more information, see .

Start Pinot Controller

Start Pinot Controller in daemon and connect to Zookeeper.

The command below expects a 4GB memory container. Tune-Xms and-Xmx if your machine doesn't have enough resources.

Start Pinot Broker

Start Pinot Broker in daemon and connect to Zookeeper.

The command below expects a 4GB memory container. Tune-Xms and-Xmx if your machine doesn't have enough resources.

Start Pinot Server

Start Pinot Server in daemon and connect to Zookeeper.

The command below expects a 16GB memory container. Tune-Xms and-Xmx if your machine doesn't have enough resources.

Start Kafka

Optionally, you can also start Kafka for setting up real-time streams. This brings up the Kafka broker on port 9092.

Now all Pinot related components are started as an empty cluster.

Run the below command to check container status:

Sample Console Output

Docker Compose

Export Docker Image tags

Optionally, export the necessary docker image tags for Pinot, Zookeeper, and Kafka.

Create docker-compose.yml file

Create a file called docker-compose.yml that contains the following:

Launch the components

Run the following command to launch all the required components:

OR, optionally, run the following command to launch all the components, including kafka:

Run the below command to check the container status:

Sample Console Output

Once your cluster is up and running, see to learn how to run queries against the data.

If you have or installed, you can also try running the .

Grouping Algorithm

In this guide we will learn about the heuristics used for trimming results in Pinot's grouping algorithm (used when processing GROUP BY queries) to make sure that the server doesn't run out of memory.

V1 / Single Stage Query Engine

Within segment

When grouping rows within a segment, Pinot keeps a maximum of numGroupsLimit groups per segment. This value is set to 100,000 by default and can be configured by the pinot.server.query.executor.num.groups.limit property.

If the number of groups of a segment reaches this value, the extra groups will be ignored and the results returned may not be completely accurate. The numGroupsLimitReached property will be set to true in the query response if the value is reached.

Trimming tail groups

After the inner segment groups have been computed, the Pinot query engine optionally trims tail groups. Tail groups are ones that have a lower rank based on the ORDER BY clause used in the query.

When segment group trim is enabled, the query engine will trim the tail groups and keep only max(minSegmentGroupTrimSize, 5 * LIMIT) , where LIMIT is the maximum number of records returned by query - usually set via LIMIT clause). Pinot keeps at least 5 * LIMIT groups when trimming tail groups to ensure the accuracy of results. Trimming is performed only when ordering and limit is specified.

This value can be overridden on a query by query basis by passing the following option:

Cross segments

Once grouping has been done within a segment, Pinot will merge segment results and trim tail groups and keep max(minServerGroupTrimSize, 5 * LIMIT) groups if it gets more groups.

minServerGroupTrimSize is set to 5,000 by default and can be adjusted by configuring the pinot.server.query.executor.min.server.group.trim.size property. Cross segments trim can be disabled by setting the property to -1.

When cross segments trim is enabled, the server will trim the tail groups before sending the results back to the broker. To reduce memory usage while merging per-segment results, It will also trim the tail groups when the number of groups reaches the trimThreshold.

trimThreshold is the upper bound of groups allowed in a server for each query to protect servers from running out of memory. To avoid too frequent trimming, the actual trim size is bounded to trimThreshold / 2. Combining this with the above equation, the actual trim size for a query is calculated as min(max(minServerGroupTrimSize, 5 * LIMIT), trimThreshold / 2).

This configuration is set to 1,000,000 by default and can be adjusted by configuring the pinot.server.query.executor.groupby.trim.threshold property.

A higher threshold reduces the amount of trimming done, but consumes more heap memory. If the threshold is set to more than 1,000,000,000, the server will only trim the groups once before returning the results to the broker.

This value can be overridden on a query by query basis by passing the following option:

At Broker

When broker performs the final merge of the groups returned by various servers, there is another level of trimming that takes place. The tail groups are trimmed and max(minBrokerGroupTrimSize, 5 * LIMIT) groups are retained.

Default value of minBrokerGroupTrimSize is set to 5000. This can be adjusted by configuring pinot.broker.min.group.trim.size property.

GROUP BY behavior

Pinot sets a default LIMIT of 10 if one isn't defined and this applies to GROUP BY queries as well. Therefore, if no limit is specified, Pinot will return 10 groups.

Pinot will trim tail groups based on the ORDER BY clause to reduce the memory footprint and improve the query performance. It keeps at least 5 * LIMIT groups so that the results give good enough approximation in most cases. The configurable min trim size can be used to increase the groups kept to improve the accuracy but has a larger extra memory footprint.

HAVING behavior

If the query has a HAVING clause, it is applied on the merged GROUP BY results that already have the tail groups trimmed. If the HAVING clause is the opposite of the ORDER BY order, groups matching the condition might already be trimmed and not returned. e.g.

Increase min trim size to keep more groups in these cases.

Examples

For a simple keyed aggregation query such as:

a simplified execution plan, showing where trimming happens, looks like:

For sake of brevity, plan above doesn't mention that actual number of groups left is min( trim_value, 5*limit ) .

V2 / Multi Stage Query Engine

Compared to V1, V2 engine uses similar algorithm, but there are notable differences:

  • V2 doesn't implicitly limit number of query results (to 10)

  • V2 doesn't limit number of groups when aggregating cross-segment data

  • V2 doesn't trim results by default in any stage

  • V2 doesn't aggregate results in the broker, pushing final aggregation processing to server(s)

The default V2 algorithm is shown on the following diagram:

Apart from limiting number of groups on segment level, similar limit is applied at intermediate stage. Since V2 query engine allows for subqueries, in an execution plan, there could be arbitrary number of stages doing intermediate aggregation between leaf (bottom-most) and top-most stages, and each stage can be implemented with many instances of AggregateOperator (shown as PinotLogicalAggregate in output). The operator limits number of distinct groups to 100,000 by default, which can be overridden with numGroupsLimit option or num_groups_limit aggregate hint. The limit applies to a single operator instance, meaning that next stage could receive a total of num_instances * num_groups_limit.

It is possible to enable group limiting and trimming at other stages with:

  • is_enable_group_trim hint - it enables trimming at all V1/V2 levels and group limiting at cross-segment level. minSegmentGroupTrimSize value needs to be set separately. Default value: false

  • mse_min_group_trim_size hint - triggers sorting and trimming of group by results at intermediate stage. Requires is_enable_group_trim hint. Default value: 5000

When the above hints are used, query processing looks as follows:

The actual processing depends on the query, which may not contain V1 leaf stage aggregate component, and rely on AggregateOperator on all levels. Moreover, since trimming relies on order and limit propagation, it may not happen in a subquery if order by column(s) are not available.

Examples

  • If hints are applied to query mentioned in V1 examples above, that is :

    then execution plan should be as follows:

    In the plan above trimming happens in three operators: GroupBy, CombineGroupBy and AggregateOperator (which is the physical implementation of PinotLogicalAggregate).

  • Aggregating over result of a join, e.g.

    should produce following execution plan:

    in which there is no leaf stage V1 operator and all aggregation stages are implemented with V2 operator - PinotLogicalAggregate.

Configuration Parameters/hints

Parameter
Default
Query Override
Description

(*) SSQ - Single-Stage Query

(**) MSQ - Multi-Stage Query

Lookup UDF Join

For more information about using JOINs with the multi-stage query engine, see JOINs.

Lookup UDF Join is only supported with the single-stage query engine (v1). Lookup joins can be executed using in the multi-stage query engine. For more information about using JOINs with the multi-stage query engine, see .

Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with in Pinot.

Syntax

The UDF function syntax is listed as below:

  • dimTable Name of the dim table to perform the lookup on.

  • dimColToLookUp The column name of the dim table to be retrieved to decorate our result.

  • dimJoinKey The column name on which we want to perform the lookup i.e. the join column name for dim table.

  • factJoinKey The column name on which we want to perform the lookup against e.g. the join column name for fact table

Noted that:

  1. all the dim-table-related expressions are expressed as literal strings, this is the LOOKUP UDF syntax limitation: we cannot express column identifier which doesn't exist in the query's main table, which is the factTable table.

  2. the syntax definition of [ '''dimJoinKey''', factJoinKey ]* indicates that if there are multiple dim partition columns, there should be multiple join key pair expressed.

Examples

Here are some of the examples

Single-partition-key-column Example

Consider the table baseballStats

Column
Type

and dim table dimBaseballTeams

Column
Type

several acceptable queries are:

Dim-Fact LOOKUP example

playerName
teamID
teamName
teamAddress

Self LOOKUP example

teamID
nameFromLocal
nameFromLookup

Complex-partition-key-columns Example

Consider a single dimension table with schema:

BILLING SCHEMA

Column
Type

Self LOOKUP example

customerId
missedPayment
lookedupCity

Usage FAQ

  • The data return type of the UDF will be that of the dimColToLookUp column type.

  • when multiple primary key columns are used for the dimension table (e.g. composite primary key), ensure that the order of keys appearing in the lookup() UDF is the same as the order defined in the primaryKeyColumns from the dimension table schema.

docker pull apachepinot/pinot:latest
docker pull apachepinot/pinot:1.2.0
docker run \
    -p 2123:2123 \
    -p 9000:9000 \
    -p 8000:8000 \
    -p 7050:7050 \
    -p 6000:6000 \
    apachepinot/pinot:1.2.0 QuickStart \
    -type batch
docker network create -d bridge pinot-demo
export PINOT_IMAGE=apachepinot/pinot:1.2.0
export ZK_IMAGE=zookeeper:3.9.2
export KAFKA_IMAGE= bitnami/kafka:3.6
docker run \
    --network=pinot-demo \
    --name pinot-zookeeper \
    --restart always \
    -p 2181:2181 \
    -d ${ZK_IMAGE}
docker run --rm -ti \
    --network=pinot-demo \
    --name pinot-controller \
    -p 9000:9000 \
    -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms1G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-controller.log" \
    -d ${PINOT_IMAGE} StartController \
    -zkAddress pinot-zookeeper:2181
docker run --rm -ti \
    --network=pinot-demo \
    --name pinot-broker \
    -p 8099:8099 \
    -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-broker.log" \
    -d ${PINOT_IMAGE} StartBroker \
    -zkAddress pinot-zookeeper:2181
docker run --rm -ti \
    --network=pinot-demo \
    --name pinot-server \
    -p 8098:8098 \
    -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log" \
    -d ${PINOT_IMAGE} StartServer \
    -zkAddress pinot-zookeeper:2181
docker run --rm -ti \
    --network pinot-demo --name=kafka \
    -e KAFKA_ZOOKEEPER_CONNECT=pinot-zookeeper:2181/kafka \
    -e KAFKA_BROKER_ID=0 \
    -e KAFKA_ADVERTISED_HOST_NAME=kafka \
    -p 9092:9092 \
    -d ${KAFKA_IMAGE}
docker container ls -a
CONTAINER ID   IMAGE                     COMMAND                  CREATED              STATUS              PORTS                                                       NAMES
accc70bc7f07   bitnami/kafka:3.6         "/opt/bitnami/script…"   About a minute ago   Up About a minute   0.0.0.0:9092->9092/tcp                                      kafka
1b8b80395959   apachepinot/pinot:1.2.0   "./bin/pinot-admin.s…"   About a minute ago   Up About a minute   8096-8097/tcp, 8099/tcp, 9000/tcp, 0.0.0.0:8098->8098/tcp   pinot-server
134a67eec957   apachepinot/pinot:1.2.0   "./bin/pinot-admin.s…"   About a minute ago   Up About a minute   8096-8098/tcp, 9000/tcp, 0.0.0.0:8099->8099/tcp             pinot-broker
4fcc72cb7302   apachepinot/pinot:1.2.0   "./bin/pinot-admin.s…"   About a minute ago   Up About a minute   8096-8099/tcp, 0.0.0.0:9000->9000/tcp                       pinot-controller
144304524f6c   zookeeper:3.9.2           "/docker-entrypoint.…"   About a minute ago   Up About a minute   2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp, 8080/tcp        pinot-zookeeper
export PINOT_IMAGE=apachepinot/pinot:1.2.0
export ZK_IMAGE=zookeeper:3.9.2
export KAFKA_IMAGE=bitnami/kafka:3.6
docker-compose.yml
version: '3.7'

services:
  pinot-zookeeper:
    image: ${ZK_IMAGE:-zookeeper:3.9.2}
    container_name: "pinot-zookeeper"
    restart: unless-stopped
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    networks:
      - pinot-demo
    healthcheck:
      test: ["CMD", "zkServer.sh", "status"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 10s

  pinot-kafka:
    image: ${KAFKA_IMAGE:-bitnami/kafka:3.6}
    container_name: "kafka"
    restart: unless-stopped
    ports:
      - "9092:9092"
    environment:
      KAFKA_ZOOKEEPER_CONNECT: pinot-zookeeper:2181/kafka
      KAFKA_BROKER_ID: 0
      KAFKA_ADVERTISED_HOST_NAME: kafka
    depends_on:
      pinot-zookeeper:
        condition: service_healthy
    networks:
      - pinot-demo
    healthcheck:
      test: [ "CMD-SHELL", "kafka-broker-api-versions.sh -bootstrap-server kafka:9092" ]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 10s
    deploy:
      replicas: ${KAFKA_REPLICAS:-0}  # Default to 0, meaning Kafka won't start unless KAFKA_REPLICAS is set

  pinot-controller:
    image: ${PINOT_IMAGE:-apachepinot/pinot:1.2.0}
    command: "StartController -zkAddress pinot-zookeeper:2181"
    container_name: "pinot-controller"
    restart: unless-stopped
    ports:
      - "9000:9000"
    environment:
      JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms1G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-controller.log"
    depends_on:
      pinot-zookeeper:
        condition: service_healthy
    networks:
      - pinot-demo
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9000/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 10s

  pinot-broker:
    image: ${PINOT_IMAGE:-apachepinot/pinot:1.2.0}
    command: "StartBroker -zkAddress pinot-zookeeper:2181"
    container_name: "pinot-broker"
    restart: unless-stopped
    ports:
      - "8099:8099"
    environment:
      JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-broker.log"
    depends_on:
      pinot-controller:
        condition: service_healthy
    networks:
      - pinot-demo
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8099/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 10s

  pinot-server:
    image: ${PINOT_IMAGE:-apachepinot/pinot:1.2.0}
    command: "StartServer -zkAddress pinot-zookeeper:2181"
    container_name: "pinot-server"
    restart: unless-stopped
    ports:
      - "8098:8098"
    environment:
      JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log"
    depends_on:
      pinot-broker:
        condition: service_healthy
    networks:
      - pinot-demo
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8097/health/readiness || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 10s

networks:
  pinot-demo:
    name: pinot-demo
    driver: bridge
docker compose --project-name pinot-demo up
export KAFKA_REPLICAS=1
docker compose --project-name pinot-demo up
docker container ls -a
CONTAINER ID   IMAGE                     COMMAND                  CREATED          STATUS                        PORTS                                                       NAMES
f34a046ac69f   bitnami/kafka:3.6         "/opt/bitnami/script…"   9 minutes ago    Up About a minute (healthy)   0.0.0.0:9092->9092/tcp                                      kafka
f28021bd5b1d   apachepinot/pinot:1.2.0   "./bin/pinot-admin.s…"   18 minutes ago   Up About a minute (healthy)   8096-8097/tcp, 8099/tcp, 9000/tcp, 0.0.0.0:8098->8098/tcp   pinot-server
e938453054b0   apachepinot/pinot:1.2.0   "./bin/pinot-admin.s…"   18 minutes ago   Up About a minute (healthy)   8096-8098/tcp, 9000/tcp, 0.0.0.0:8099->8099/tcp             pinot-broker
e0d0c71303a8   apachepinot/pinot:1.2.0   "./bin/pinot-admin.s…"   18 minutes ago   Up About a minute (healthy)   8096-8099/tcp, 0.0.0.0:9000->9000/tcp                       pinot-controller
4be5f168f252   zookeeper:3.9.2           "/docker-entrypoint.…"   18 minutes ago   Up About a minute (healthy)   2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp, 8080/tcp        pinot-zookeeper
Docker
all published tags on Docker Hub
Quick Start Examples
Running Replicated Zookeeper
Exploring Pinot
minikube
Docker Kubernetes
Kubernetes quick start
lookupUDFSpec:
    LOOKUP
    '('
    '''dimTable'''
    '''dimColToLookup'''
    [ '''dimJoinKey''', factJoinKey ]*
    ')'

playerID

STRING

yearID

INT

teamID

STRING

league

STRING

playerName

STRING

playerStint

INT

numberOfGames

INT

numberOfGamesAsBatter

INT

AtBatting

INT

runs

INT

teamID

STRING

teamName

STRING

teamAddress

STRING

SELECT 
  playerName, 
  teamID, 
  LOOKUP('dimBaseballTeams', 'teamName', 'teamID', teamID) AS teamName, 
  LOOKUP('dimBaseballTeams', 'teamAddress', 'teamID', teamID) AS teamAddress
FROM baseballStats 

David Allan

BOS

Boston Red Caps/Beaneaters (from 1876–1900) or Boston Red Sox (since 1953)

4 Jersey Street, Boston, MA

David Allan

CHA

null

null

David Allan

SEA

Seattle Mariners (since 1977) or Seattle Pilots (1969)

1250 First Avenue South, Seattle, WA

David Allan

SEA

Seattle Mariners (since 1977) or Seattle Pilots (1969)

1250 First Avenue South, Seattle, WA

SELECT 
  teamID, 
  teamName AS nameFromLocal,
  LOOKUP('dimBaseballTeams', 'teamName', 'teamID', teamID) AS nameFromLookup
FROM dimBaseballTeams

ANA

Anaheim Angels

Anaheim Angels

ARI

Arizona Diamondbacks

Arizona Diamondbacks

ATL

Atlanta Braves

Atlanta Braves

BAL

Baltimore Orioles (original- 1901–1902 current- since 1954)

Baltimore Orioles (original- 1901–1902 current- since 1954)

customerId

INT

creditHistory

STRING

firstName

STRING

lastName

STRING

isCarOwner

BOOLEAN

city

STRING

maritalStatus

STRING

buildingType

STRING

missedPayment

STRING

billingMonth

STRING

select 
  customerId,
  missedPayment, 
  LOOKUP('billing', 'city', 'customerId', customerId, 'creditHistory', creditHistory) AS lookedupCity 
from billing

341

Paid

Palo Alto

374

Paid

Mountain View

398

Paid

Palo Alto

427

Paid

Cupertino

435

Paid

Cupertino

query hints
JOINs
a dimension table
SELECT * 
FROM ...
OPTION(minSegmentGroupTrimSize=value)
SELECT * 
FROM ...
OPTION(groupTrimThreshold=value)
SELECT SUM(colA) 
FROM myTable 
GROUP BY colB 
HAVING SUM(colA) < 100 
ORDER BY SUM(colA) DESC 
LIMIT 10
SELECT i, j, count(*) AS cnt
FROM tab
GROUP BY i, j
ORDER BY i ASC, j ASC
LIMIT 3;
BROKER_REDUCE(sort:[i, j],limit:10) <- sort and trim groups to minBrokerGroupTrimSize
  COMBINE_GROUP_BY <- sort and trim groups to minServerGroupTrimSize
    PLAN_START
      GROUP_BY <- limit to numGroupsLimit, then sort and trim to minSegmentGroupTrimSize
        PROJECT(i, j)
          DOC_ID_SET
            FILTER_MATCH_ENTIRE_SEGMENT
SELECT /*+ aggOptions(is_enable_group_trim='true', mse_min_group_trim_size='10') */        
i, j, count(*) as cnt
 FROM myTable
 GROUP BY i, j
 ORDER BY i ASC, j ASC
 LIMIT 3
LogicalSort
  PinotLogicalSortExchange(distribution=[hash])
    LogicalSort
      PinotLogicalAggregate <- aggregate up to num_groups_limit groups, then sort and trim output to group_trim_size
        PinotLogicalExchange(distribution=[hash[0, 1]])
          LeafStageCombineOperator(table=[mytable])
            StreamingInstanceResponse
              CombineGroupBy <- aggregate up to minSegmentGroupTrimSize groups
                GroupBy <- aggregate up to numGroupsLimit groups, optionally sort and trim to minSegmenGroupTrimSize
                  Project
                    DocIdSet
                      FilterMatchEntireSegment
select /*+  aggOptions(is_enable_group_trim='true', mse_min_group_trim_size='3') */ 
       t1.i, t1.j, count(*) as cnt
from tab t1
join tab t2 on 1=1
group by t1.i, t1.j
order by t1.i asc, t1.j asc
limit 5
LogicalSort
  PinotLogicalSortExchange(distribution=[hash])
    LogicalSort
      PinotLogicalAggregate(aggType=[FINAL]) <- aggregate up to num_groups_limit groups, then sort and trim output to group_trim_size
        PinotLogicalExchange(distribution=[hash[0, 1]])
          PinotLogicalAggregate(aggType=[LEAF]) <- aggregate up to num_groups_limit groups, then sort and trim output to group_trim_size
            LogicalJoin(condition=[true])
              PinotLogicalExchange(distribution=[random])
                LeafStageCombineOperator(table=[mytable])
                  ...
                    FilterMatchEntireSegment
              PinotLogicalExchange(distribution=[broadcast])
                LeafStageCombineOperator(table=[mytable])
                  ...
                    FilterMatchEntireSegment

pinot.server.query.executor.max.execution.threads

-1 (use all execution threads)

SET maxExecutionThreads = value;

The maximum number of execution threads (parallelism of segment processing) used per query.

pinot.server.query.executor.num.groups.limit

100,000

SET numGroupsLimit = value;

The maximum number of groups allowed per segment.

pinot.server.query.executor.min.segment.group.trim.size

-1 (disabled)

SET minSegmentGroupTrimSize = value;

The minimum number of groups to keep when trimming groups at the segment level.

pinot.server.query.executor.min.server.group.trim.size

5,000

SET minServerGroupTrimSize = value;

The minimum number of groups to keep when trimming groups at the server level.

pinot.server.query.executor.groupby.trim.threshold

1,000,000

SET groupTrimThreshold = value;

The number of groups to trigger the server level trim.

pinot.broker.min.group.trim.size

5000

SET minBrokerGroupTrimSize = value;

The minimum number of groups to keep when trimming groups at the broker. Applies only to SSQ(*).

pinot.broker.mse.enable.group.trim

false (disabled)

/*+ aggOptions(is_enable_group_trim='value') */

Enable group trim for the query (if possible). Applies only to MSQ(**).

pinot.server.query.executor.mse.min.group.trim.size

5000

/*+ aggOptions(mse_min_group_trim_size='value') */ or SET mseMinGroupTrimSize = value;

The number of groups to keep when trimming groups at intermediate stage. Applies only to MSQ(**).

EXPLAIN's
Group by results approximation at various stages of V1 query execution
Default V2 engine group by results approximation
Group by results trimming at various stages of V2 query execution utilizing V1 in leaf stage

0.6.0

This release introduced some excellent new features, including upsert, tiered storage, pinot-spark-connector, support of having clause, more validations on table config and schema, support of ordinals

Summary

This release introduced some excellent new features, including upsert, tiered storage, pinot-spark-connector, support of having clause, more validations on table config and schema, support of ordinals in GROUP BY and ORDER BY clause, array transform functions, adding push job type of segment metadata only mode, and some new APIs like updating instance tags, new health check endpoint. It also contains many key bug fixes. See details below.

The release was cut from the following commit: e5c9bec and the following cherry-picks:

  • d033a11

Notable New Features

  • Tiered storage (#5793)

  • Upsert feature (#6096, #6113, #6141, #6149, #6167)

  • Pre-generate aggregation functions in QueryContext (#5805)

  • Adding controller healthcheck endpoint: /health (#5846)

  • Add pinot-spark-connector (#5787)

  • Support multi-value non-dictionary group by (#5851)

  • Support type conversion for all scalar functions (#5849)

  • Add additional datetime functionality (#5438)

  • Support post-aggregation in ORDER-BY (#5856)

  • Support post-aggregation in SELECT (#5867)

  • Add RANGE FilterKind to support merging ranges for SQL (#5898)

  • Add HAVING support (#58895889)

  • Support for exact distinct count for non int data types (#5872)

  • Add max qps bucket count (#5922)

  • Add Range Indexing support for raw values (#5853)

  • Add IdSet and IdSetAggregationFunction (#5926)

  • [Deepstore by-pass]Add a Deepstore bypass integration test with minor bug fixes. (#5857)

  • Add Hadoop counters for detecting schema mismatch (#5873)

  • Add RawThetaSketchAggregationFunction (#5970)

  • Instance API to directly updateTags (#5902)

  • Add streaming query handler (#5717)

  • Add InIdSetTransformFunction (#5973)

  • Add ingestion descriptor in the header (#5995)

  • Zookeeper put api (#5949)

  • Feature/#5390 segment indexing reload status api (#5718)

  • Segment processing framework (#5934)

  • Support streaming query in QueryExecutor (#6027)

  • Add list of allowed tables for emitting table level metrics (#6037)

  • Add FilterOptimizer which supports optimizing both PQL and SQL query filter (#6056)

  • Adding push job type of segment metadata only mode (#5967)

  • Minion taskExecutor for RealtimeToOfflineSegments task (#6050, #6124)

  • Adding array transform functions: array_average, array_max, array_min, array_sum (#6084)

  • Allow modifying/removing existing star-trees during segment reload (#6100)

  • Implement off-heap bloom filter reader (#6118)

  • Support for multi-threaded Group By reducer for SQL. (#6044)

  • Add OnHeapGuavaBloomFilterReader (#6147)

  • Support using ordinals in GROUP BY and ORDER BY clause (#6152)

  • Merge common APIs for Dictionary (#6176)

  • Add table level lock for segment upload ([#6165])

  • Added recursive functions validation check for group by (#6186)

  • Add StrictReplicaGroupInstanceSelector (#6208)

  • Add IN_SUBQUERY support (#6022)

  • Add IN_PARTITIONED_SUBQUERY support (#6043)

  • Some UI features (#5810, #5981, #6117, #6215)

Special notes

  • Brokers should be upgraded before servers in order to keep backward-compatible:

    • Change group key delimiter from '\t' to '\0' (#5858)

    • Support for exact distinct count for non int data types (#5872)

  • Pinot Components have to be deployed in the following order:

    (PinotServiceManager -> Bootstrap services in role ServiceRole.CONTROLLER -> All remaining bootstrap services in parallel)

    • Starts Broker and Server in parallel when using ServiceManager (#5917)

    • New settings introduced and old ones deprecated:

    • Make real-time threshold property names less ambiguous (#5953)

    • Change Signature of Broker API in Controller (#6119)

  • This aggregation function is still in beta version. This PR involves change on the format of data sent from server to broker, so it works only when both broker and server are upgraded to the new version:

    • Enhance DistinctCountThetaSketchAggregationFunction (#6004)

Major Bug fixes

  • Improve performance of DistinctCountThetaSketch by eliminating empty sketches and unions. (#5798)

  • Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data (#5816)

  • Fixing backward-compatible issue of schema fetch call (#5885)

  • Fix race condition in MetricsHelper (#5887)

  • Fixing the race condition that segment finished before ControllerLeaderLocator created. (#5864)

  • Fix CSV and JSON converter on BYTES column (#5931)

  • Fixing the issue that transform UDFs are parsed as function name 'OTHER', not the real function names (#5940)

  • Incorporating embedded exception while trying to fetch stream offset (#5956)

  • Use query timeout for planning phase (#5990)

  • Add null check while fetching the schema (#5994)

  • Validate timeColumnName when adding/updating schema/tableConfig (#5966)

  • Handle the partitioning mismatch between table config and stream (#6031)

  • Fix built-in virtual columns for immutable segment (#6042)

  • Refresh the routing when real-time segment is committed (#6078)

  • Add support for Decimal with Precision Sum aggregation (#6053)

  • Fixing the calls to Helix to throw exception if zk connection is broken (#6069)

  • Allow modifying/removing existing star-trees during segment reload (#6100)

  • Add max length support in schema builder (#6112)

  • Enhance star-tree to skip matching-all predicate on non-star-tree dimension (#6109)

Backward Incompatible Changes

  • Make real-time threshold property names less ambiguous (#5953)

  • Enhance DistinctCountThetaSketchAggregationFunction (#6004)

  • Deep Extraction Support for ORC, Thrift, and ProtoBuf Records (#6046)

Neha Pawar from the Apache Pinot team shows you how to set up a Pinot cluster
PINOT_VERSION=1.1.0 #set to the Pinot version you decide to use

wget https://downloads.apache.org/pinot/apache-pinot-$PINOT_VERSION/apache-pinot-$PINOT_VERSION-bin.tar.gz
git clone https://github.com/apache/pinot.git
cd pinot
mvn install package -DskipTests -Pbin-dist
cd build

Quick Start Examples

This section describes quick start commands that launch all Pinot components in a single process.

Pinot ships with QuickStart commands that launch Pinot components in a single process and import pre-built datasets. These quick start examples are a good place if you're just getting started with Pinot. The examples begin with the example, after the following notes:

  • Prerequisites

    You must have either or . The examples are available in each option and work the same. The decision of which to choose depends on your installation preference and how you generally like to work. If you don't know which to choose, using Docker will make your cleanup easier after you are done with the examples.

  • Pinot versions in examples

    The Docker-based examples on this page use pinot:latest, which instructs Docker to pull and use the most recent release of Apache Pinot. If you prefer to use a specific release instead, you can designate it by replacing latest with the release number, like this: pinot:0.12.1.

    The local install-based examples that are run using the launcher scripts will use the Apache Pinot version you installed.

  • Stopping a running example

    To stop a running example, enter Ctrl+C in the same terminal where you ran the docker run command to start the example.

macOS Monterey Users

By default the Airplay receiver server runs on port 7000, which is also the port used by the Pinot Server in the Quick Start. You may see the following error when running these examples:

If you disable the Airplay receiver server and try again, you shouldn't see this error message anymore.

Batch Processing

This example demonstrates how to do batch processing with Pinot. The command:

  • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

  • Creates the baseballStats table

  • Launches a standalone data ingestion job that builds one segment for a given CSV data file for the baseballStats table and pushes the segment to the Pinot Controller.

  • Issues sample queries to Pinot

Batch JSON

This example demonstrates how to import and query JSON documents in Pinot. The command:

  • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

  • Creates the githubEvents table

  • Launches a standalone data ingestion job that builds one segment for a given JSON data file for the githubEvents table and pushes the segment to the Pinot Controller.

  • Issues sample queries to Pinot

Batch with complex data types

This example demonstrates how to do batch processing in Pinot where the the data items have complex fields that need to be unnested. The command:

  • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

  • Creates the githubEvents table

  • Launches a standalone data ingestion job that builds one segment for a given JSON data file for the githubEvents table and pushes the segment to the Pinot Controller.

  • Issues sample queries to Pinot

Streaming

This example demonstrates how to do stream processing with Pinot. The command:

  • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

  • Creates meetupRsvp table

  • Launches a meetup stream

  • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot.

  • Issues sample queries to Pinot

Streaming JSON

This example demonstrates how to do stream processing with JSON documents in Pinot. The command:

  • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

  • Creates meetupRsvp table

  • Launches a meetup stream

  • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot

  • Issues sample queries to Pinot

Streaming with minion cleanup

This example demonstrates how to do stream processing in Pinot with RealtimeToOfflineSegmentsTask and MergeRollupTask minion tasks continuously optimizing segments as data gets ingested. The command:

  • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, Pinot Minion, and Pinot Server.

  • Creates githubEvents table

  • Launches a GitHub events stream

  • Publishes data to a Kafka topic githubEvents that is subscribed to by Pinot.

  • Issues sample queries to Pinot

Streaming with complex data types

This example demonstrates how to do stream processing in Pinot where the stream contains items that have complex fields that need to be unnested. The command:

  • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, Pinot Minion, and Pinot Server.

  • Creates meetupRsvp table

  • Launches a meetup stream

  • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot.

  • Issues sample queries to Pinot

Upsert

This example demonstrates how to do with Pinot. The command:

  • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

  • Creates meetupRsvp table

  • Launches a meetup stream

  • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot

  • Issues sample queries to Pinot

Upsert JSON

This example demonstrates how to do with JSON documents in Pinot. The command:

  • Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

  • Creates meetupRsvp table

  • Launches a meetup stream

  • Publishes data to a Kafka topic meetupRSVPEvents that is subscribed to by Pinot

  • Issues sample queries to Pinot

Hybrid

This example demonstrates how to do hybrid stream and batch processing with Pinot. The command:

  1. Starts Apache Kafka, Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server.

  2. Creates airlineStats table

  3. Launches a standalone data ingestion job that builds segments under a given directory of Avro files for the airlineStats table and pushes the segments to the Pinot Controller.

  4. Launches a stream of flights stats

  5. Publishes data to a Kafka topic airlineStatsEvents that is subscribed to by Pinot.

  6. Issues sample queries to Pinot

Join

This example demonstrates how to do joins in Pinot using the . The command:

  • Starts Apache Zookeeper, Pinot Controller, Pinot Broker, and Pinot Server in the same container.

  • Creates the baseballStats table

  • Launches a data ingestion job that builds one segment for a given CSV data file for the baseballStats table and pushes the segment to the Pinot Controller.

  • Creates the dimBaseballTeams table

  • Launches a data ingestion job that builds one segment for a given CSV data file for the dimBaseballStats table and pushes the segment to the Pinot Controller.

  • Issues sample queries to Pinot

Time Series

For production use, you should ideally implement your own Time Series Language Plugin. The one included in the Pinot distribution is only for demonstration purposes.

This examples demonstrates Pinot's Time Series Engine, which supports running pluggable Time Series Query Languages via a Language Plugin architecture. The default Pinot binary includes a toy Time Series Query Language using the same name as Uber's language "m3ql". You can try the following query as an example:

Failed to start a Pinot [SERVER]
java.lang.RuntimeException: java.net.BindException: Address already in use
	at org.apache.pinot.core.transport.QueryServer.start(QueryServer.java:103) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da21132906]
	at org.apache.pinot.server.starter.ServerInstance.start(ServerInstance.java:158) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da21132906]
	at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:110) ~[pinot-all-0.9.0-jar-with-dependencies.jar:0.9.0-cf8b84e8b0d6ab62374048de586ce7da2113
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type batch
./bin/pinot-admin.sh QuickStart -type batch
pinot-admin QuickStart -type batch
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type batch_json_index
./bin/pinot-admin.sh QuickStart -type batch_json_index
pinot-admin QuickStart -type batch_json_index
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type batch_complex_type
./bin/pinot-admin.sh QuickStart -type batch_complex_type
pinot-admin QuickStart -type batch_complex_type
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type stream
./bin/pinot-admin.sh QuickStart -type stream
pinot-admin QuickStart -type stream
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type stream_json_index
./bin/pinot-admin.sh QuickStart -type stream_json_index
pinot-admin QuickStart -type stream_json_index
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type realtime_minion
./bin/pinot-admin.sh QuickStart -type realtime_minion
pinot-admin QuickStart -type realtime_minion
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type stream_complex_type
./bin/pinot-admin.sh QuickStart -type stream_complex_type
pinot-admin QuickStart -type stream_complex_type
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type upsert
./bin/pinot-admin.sh QuickStart -type upsert
pinot-admin QuickStart -type upsert
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type upsert_json_index
./bin/pinot-admin.sh QuickStart -type upsert_json_index
pinot-admin QuickStart -type upsert_json_index
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type hybrid
./bin/pinot-admin.sh QuickStart -type hybrid
pinot-admin QuickStart -type hybrid
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type join
./bin/pinot-admin.sh QuickStart -type join
pinot-admin QuickStart -type join
fetch{table="meetupRsvp_REALTIME",filter="",ts_column="__metadata$recordTimestamp",ts_unit="MILLISECONDS",value="1"}
| sum{rsvp_count}
| transformNull{0}
| keepLastValue{}
docker run \
    -p 9000:9000 \
    apachepinot/pinot:latest QuickStart \
    -type time_series
./bin/pinot-admin.sh QuickStart -type time_series
pinot-admin QuickStart -type time_series
Batch Processing
installed Pinot locally
have Docker installed if you want to use the Pinot Docker image
stream processing with upsert
stream processing with upsert
Lookup UDF

Operations FAQ

This page has a collection of frequently asked questions about operations with answers from the community.

This is a list of questions frequently asked in our troubleshooting channel on Slack. To contribute additional questions and answers, make a pull request.

Memory

How much heap should I allocate for my Pinot instances?

Typically, Apache Pinot components try to use as much off-heap (MMAP/DirectMemory) wherever possible. For example, Pinot servers load segments in memory-mapped files in MMAP mode (recommended), or direct memory in HEAP mode. Heap memory is used mostly for query execution and storing some metadata. We have seen production deployments with high throughput and low-latency work well with just 16 GB of heap for Pinot servers and brokers. The Pinot controller may also cache some metadata (table configurations etc) in heap, so if there are just a few tables in the Pinot cluster, a few GB of heap should suffice.

DR

Does Pinot provide any backup/restore mechanism?

Pinot relies on deep-storage for storing a backup copy of segments (offline as well as real-time). It relies on Zookeeper to store metadata (table configurations, schema, cluster state, and so on). It does not explicitly provide tools to take backups or restore these data, but relies on the deep-storage (ADLS/S3/GCP/etc), and ZK to persist these data/metadata.

Alter Table

Can I change a column name in my table, without losing data?

Changing a column name or data type is considered backward incompatible change. While Pinot does support schema evolution for backward compatible changes, it does not support backward incompatible changes like changing name/data-type of a column.

How to change number of replicas of a table?

You can change the number of replicas by updating the table configuration's segmentsConfig section. Make sure you have at least as many servers as the replication.

For offline tables, update replication:

{ 
    "tableName": "pinotTable", 
    "tableType": "OFFLINE", 
    "segmentsConfig": {
      "replication": "3", 
      ... 
    }
    ..

For real-time tables, update replicasPerPartition:

{ 
    "tableName": "pinotTable", 
    "tableType": "REALTIME", 
    "segmentsConfig": {
      "replicasPerPartition": "3", 
      ... 
    }
    ..

After changing the replication, run a table rebalance.

Note that if you are using replica groups, it's expected these configurations equal numReplicaGroups. If they do not match, Pinot will use numReplicaGroups.

How to set or change table retention?

By default there is no retention set for a table in Apache Pinot. You may however, set retention by setting the following properties in the segmentsConfig section inside table configs:

  • retentionTimeUnit

  • retentionTimeValue

Updating the retention value in the table config should be good enough, there is no need to rebalance the table or reload its segments.

Rebalance

How to run a rebalance on a table?

See Rebalance.

Why does my real-time table not use the new nodes I added to the cluster?

Likely explanation: num partitions * num replicas < num servers.

In real-time tables, segments of the same partition always remain on the same node. This sticky assignment is needed for replica groups and is critical if using upserts. For instance, if you have 3 partitions, 1 replica, and 4 nodes, only ¾ nodes will be used, and all of p0 segments will be on 1 node, p1 on 1 node, and p2 on 1 node. One server will be unused, and will remain unused through rebalances.

There’s nothing we can do about CONSUMING segments, they will continue to use only 3 nodes if you have 3 partitions. But we can rebalance such that completed segments use all nodes. If you want to force the completed segments of the table to use the new server use this config:

"instanceAssignmentConfigMap": {
      "COMPLETED": {
        "tagPoolConfig": {
          "tag": "DefaultTenant_OFFLINE"
        },
        "replicaGroupPartitionConfig": {
        }
      }
    },

Segments

How to control the number of segments generated?

The number of segments generated depends on the number of input files. If you provide only 1 input file, you will get 1 segment. If you break up the input file into multiple files, you will get as many segments as the input files.

What are the common reasons my segment is in a BAD state ?

This typically happens when the server is unable to load the segment. Possible causes: out-of-memory, no disk space, unable to download segment from deep-store, and similar other errors. Check server logs for more information.

How to reset a segment when it runs into a BAD state?

Use the segment reset controller REST API to reset the segment:

curl -X POST "{host}/segments/{tableNameWithType}/{segmentName}/reset"

How do I pause real-time ingestion?

Refer to Pause Stream Ingestion.

What's the difference between Reset, Refresh, and Reload?

  • Reset: Gets a segment in ERROR state back to ONLINE or CONSUMING state. Behind the scenes, the Pinot controller takes the segment to the OFFLINE state, waits for External View to stabilize, and then moves it back to ONLINE or CONSUMING state, thus effectively resetting segments or consumers in error states.

  • Refresh: Replaces the segment with a new one, with the same name but often different data. Under the hood, the Pinot controller sets new segment metadata in Zookeeper, and notifies brokers and servers to check their local states about this segment and update accordingly. Servers also download the new segment to replace the old one, when both have different checksums. There is no separate rest API for refreshing, and it is done as part of the SegmentUpload API.

  • Reload: Loads the segment again, often to generate a new index as updated in the table configuration. Underlying, the Pinot server gets the new table configuration from Zookeeper, and uses it to guide the segment reloading. In fact, the last step of REFRESH as explained above is to load the segment into memory to serve queries. There is a dedicated rest API for reloading. By default, it doesn't download segments, but the option is provided to force the server to download the segment to replace the local one cleanly.

In addition, RESET brings the segment OFFLINE temporarily; while REFRESH and RELOAD swap the segment on server atomically without bringing down the segment or affecting ongoing queries.

Tenants

How can I make brokers/servers join the cluster without the DefaultTenant tag?

Set this property in your controller.conf file:

cluster.tenant.isolation.enable=false

Now your brokers and servers should join the cluster as broker_untagged and server_untagged. You can then directly use the POST /tenants API to create the tenants you want, as in the following:

curl -X POST "http://localhost:9000/tenants" 
-H "accept: application/json" 
-H "Content-Type: application/json" 
-d "{\"tenantRole\":\"BROKER\",\"tenantName\":\"foo\",\"numberOfInstances\":1}"

Minion

How do I tune minion task timeout and parallelism on each worker?

There are two task configurations, but they are set as part of cluster configurations, like in the following example. One controls the task's overall timeout (1hr by default) and one sets how many tasks to run on a single minion worker (1 by default). The <taskType> is the task to tune, such as MergeRollupTask or RealtimeToOfflineSegmentsTask etc.

Using "POST /cluster/configs API" on CLUSTER tab in Swagger, with this payload:
{
	"<taskType>.timeoutMs": "600000",
	"<taskType>.numConcurrentTasksPerInstance": "4"
}

How to I manually run a Periodic Task?

See Running a Periodic Task Manually.

Tuning and Optimizations

Do replica groups work for real-time?

Yes, replica groups work for real-time. There's 2 parts to enabling replica groups:

  1. Replica groups segment assignment.

  2. Replica group query routing.

Replica group segment assignment

Replica group segment assignment is achieved in real-time, if number of servers is a multiple of number of replicas. The partitions get uniformly sprayed across the servers, creating replica groups. For example, consider we have 6 partitions, 2 replicas, and 4 servers.

r1
r2

p1

S0

S1

p2

S2

S3

p3

S0

S1

p4

S2

S3

p5

S0

S1

p6

S2

S3

As you can see, the set (S0, S2) contains r1 of every partition, and (s1, S3) contains r2 of every partition. The query will only be routed to one of the sets, and not span every server. If you are are adding/removing servers from an existing table setup, you have to run rebalance for segment assignment changes to take effect.

Replica group query routing

Once replica group segment assignment is in effect, the query routing can take advantage of it. For replica group based query routing, set the following in the table config's routing section, and then restart brokers

{
    "tableName": "pinotTable", 
    "tableType": "REALTIME",
    "routing": {
        "instanceSelectorType": "replicaGroup"
    }
    ..
}

Overwrite index configs at tier level

When using tiered storage, user may want to have different encoding and indexing types for a column in different tiers to balance query latency and cost saving more flexibly. For example, segments in the hot tier can use dict-encoding, bloom filter and all kinds of relevant index types for very fast query execution. But for segments in the cold tier, where cost saving matters more than low query latency, one may want to use raw values and bloom filters only.

The following two examples show how to overwrite encoding type and index configs for tiers. Similar changes are also demonstrated in the MultiDirQuickStart example.

  1. Overwriting single-column index configs using fieldConfigList. All top level fields in FieldConfig class can be overwritten, and fields not overwritten are kept intact.

{
  ...
  "fieldConfigList": [    
    {
      "name": "ArrTimeBlk",
      "encodingType": "DICTIONARY",
      "indexes": {
        "inverted": {
          "enabled": "true"
        }
      },
      "tierOverwrites": {
        "hotTier": {
          "encodingType": "DICTIONARY",
          "indexes": { // change index types for this tier
            "bloom": {
              "enabled": "true"
            }
          }
        },
        "coldTier": {
          "encodingType": "RAW", // change encoding type for this tier
          "indexes": { } // remove all indexes
        }
      }
    }
  ],
  1. Overwriting star-tree index configurations using tableIndexConfig. The StarTreeIndexConfigs is overwritten as a whole. In fact, all top level fields defined in IndexingConfig class can be overwritten, so single-column index configs defined in tableIndexConfig can also be overwritten but it's less clear than using fieldConfigList.

  "tableIndexConfig": {
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": [
          "AirlineID",
          "Origin",
          "Dest"
        ],
        "skipStarNodeCreationForDimensions": [],
        "functionColumnPairs": [
          "COUNT__*",
          "MAX__ArrDelay"
        ],
        "maxLeafRecords": 10
      }
    ],
...
    "tierOverwrites": {
      "hotTier": {
        "starTreeIndexConfigs": [ // create different STrTree index on this tier
          {
            "dimensionsSplitOrder": [
              "Carrier",
              "CancellationCode",
              "Origin",
              "Dest"
            ],
            "skipStarNodeCreationForDimensions": [],
            "functionColumnPairs": [
              "MAX__CarrierDelay",
              "AVG__CarrierDelay"
            ],
            "maxLeafRecords": 10
          }
        ]
      },
      "coldTier": {
        "starTreeIndexConfigs": [] // removes ST index for this tier
      }
    }
  },
 ...

Credential

How do I update credentials for real-time upstream without downtime?

  1. Pause the stream ingestion.

  2. Wait for the pause status to change to success.

  3. Update the credential in the table config.

  4. Resume the consumption.

Table

Explore the table component in Apache Pinot, a fundamental building block for organizing and managing data in Pinot clusters, enabling effective data processing and analysis.

Pinot stores data in tables. A Pinot table is conceptually identical to a relational database table with rows and columns. Columns have the same name and data type, known as the table's schema.

Pinot schemas are defined in a JSON file. Because that schema definition is in its own file, multiple tables can share a single schema. Each table can have a unique name, indexing strategy, partitioning, data sources, and other metadata.

Pinot table types include:

  • real-time: Ingests data from a streaming source like Apache Kafka®

  • offline: Loads data from a batch source

  • hybrid: Loads data from both a batch source and a streaming source

Pinot breaks a table into multiple segments and stores these segments in a deep-store such as Hadoop Distributed File System (HDFS) as well as Pinot servers.

In the Pinot cluster, a table is modeled as a Helix resource and each segment of a table is modeled as a Helix Partition.

Table naming in Pinot follows typical naming conventions, such as starting names with a letter, not ending with an underscore, and using only alphanumeric characters.

Pinot supports the following types of tables:

Type
Description

Offline

Offline tables ingest pre-built Pinot segments from external data stores and are generally used for batch ingestion.

Real-time

Real-time tables ingest data from streams (such as Kafka) and build segments from the consumed data.

Hybrid

Hybrid Pinot tables have both real-time as well as offline tables under the hood. By default, all tables in Pinot are hybrid.

The user querying the database does not need to know the type of the table. They only need to specify the table name in the query.

e.g. regardless of whether we have an offline table myTable_OFFLINE, a real-time table myTable_REALTIME, or a hybrid table containing both of these, the query will be:

select count(*)
from myTable

Table configuration is used to define the table properties, such as name, type, indexing, routing, and retention. It is written in JSON format and is stored in Zookeeper, along with the table schema.

Use the following properties to make your tables faster or leaner:

  • Segment

  • Indexing

  • Tenants

Segments

A table is comprised of small chunks of data known as segments. Learn more about how Pinot creates and manages segments here.

For offline tables, segments are built outside of Pinot and uploaded using a distributed executor such as Spark or Hadoop. For details, see Batch Ingestion.

For real-time tables, segments are built in a specific interval inside Pinot. You can tune the following for the real-time segments.

Flush

The Pinot real-time consumer ingests the data, creates the segment, and then flushes the in-memory segment to disk. Pinot allows you to configure when to flush the segment in the following ways:

  • Number of consumed rows: After consuming the specified number of rows from the stream, Pinot will persist the segment to disk.

  • Number of rows per segment: Pinot learns and then estimates the number of rows that need to be consumed. The learning phase starts by setting the number of rows to 100,000 (this value can be changed) and adjusts it to reach the appropriate segment size. Because Pinot corrects the estimate as it goes along, the segment size might go significantly over the correct size during the learning phase. You should set this value to optimize the performance of queries.

  • Max time duration to wait: Pinot consumers wait for the configured time duration after which segments are persisted to the disk.

Replicas A segment can have multiple replicas to provide higher availability. You can configure the number of replicas for a table segment using the CLI.

Completion Mode By default, if the in-memory segment in the non-winner server is equivalent to the committed segment, then the non-winner server builds and replaces the segment. If the available segment is not equivalent to the committed segment, the server just downloads the committed segment from the controller.

However, in certain scenarios, the segment build can get very memory-intensive. In these cases, you might want to enforce the non-committer servers to just download the segment from the controller instead of building it again. You can do this by setting completionMode: "DOWNLOAD" in the table configuration.

For details, see Completion Config.

Download Scheme

A Pinot server might fail to download segments from the deep store, such as HDFS, after its completion. However, you can configure servers to download these segments from peer servers instead of the deep store. Currently, only HTTP and HTTPS download schemes are supported. More methods, such as gRPC/Thrift, are planned be added in the future.

For more details about peer segment download during real-time ingestion, refer to this design doc on bypass deep store for segment completion.

Indexing

You can create multiple indices on a table to increase the performance of the queries. The following types of indices are supported:

  • Forward Index

    • Dictionary-encoded forward index with bit compression

    • Raw value forward index

    • Sorted forward index with run-length encoding

  • Inverted Index

    • Bitmap inverted index

    • Sorted inverted index

  • Star-tree Index

  • Range Index

  • Text Index

  • Geospatial

For more details on each indexing mechanism and corresponding configurations, see Indexing.

Set up Bloomfilters on columns to make queries faster. You can also keep segments in off-heap instead of on-heap memory for faster queries.

Pre-aggregation

Aggregate the real-time stream data as it is consumed to reduce segment sizes. We add the metric column values of all rows that have the same values for all dimension and time columns and create a single row in the segment. This feature is only available on REALTIME tables.

The only supported aggregation is SUM. The columns to pre-aggregate need to satisfy the following requirements:

  • All metrics should be listed in noDictionaryColumns.

  • No multi-value dimensions

  • All dimension columns are treated to have a dictionary, even if they appear as noDictionaryColumns in the config.

The following table config snippet shows an example of enabling pre-aggregation during real-time ingestion:

pinot-table-realtime.json
    "tableIndexConfig": { 
      "noDictionaryColumns": ["metric1", "metric2"],
      "aggregateMetrics": true,
      ...
    }

Tenants

Each table is associated with a tenant. A segment resides on the server, which has the same tenant as itself. For details, see Tenant.

Optionally, override if a table should move to a server with different tenant based on segment status. The example below adds a tagOverrideConfig under the tenants section for real-time tables to override tags for consuming and completed segments.

  "broker": "brokerTenantName",
  "server": "serverTenantName",
  "tagOverrideConfig" : {
    "realtimeConsuming" : "serverTenantName_REALTIME"
    "realtimeCompleted" : "serverTenantName_OFFLINE"
  }
}

In the above example, the consuming segments will still be assigned to serverTenantName_REALTIME hosts, but once they are completed, the segments will be moved to serverTeantnName_OFFLINE.

You can specify the full name of any tag in this section. For example, you could decide that completed segments for this table should be in Pinot servers tagged as allTables_COMPLETED). To learn more about, see the Moving Completed Segments section.

Hybrid table

A hybrid table is a table composed of two tables, one offline and one real-time, that share the same name. In a hybrid table, offline segments can be pushed periodically. The retention on the offline table can be set to a high value because segments are coming in on a periodic basis, whereas the retention on the real-time part can be small.

Once an offline segment is pushed to cover a recent time period, the brokers automatically switch to using the offline table for segments for that time period and use the real-time table only for data not available in the offline table.

To learn how time boundaries work for hybrid tables, see Broker.

A typical use case for hybrid tables is pushing deduplicated, cleaned-up data into an offline table every day while consuming real-time data as it arrives. Data can remain in offline tables for as long as a few years, while the real-time data would be cleaned every few days.

Examples

Create a table config for your data, or see examples for all possible batch/streaming tables.

Prerequisites

  • Set up the cluster

  • Create broker and server tenants

Offline table creation

docker run \
    --network=pinot-demo \
    --name pinot-batch-table-creation \
    ${PINOT_IMAGE} AddTable \
    -schemaFile examples/batch/airlineStats/airlineStats_schema.json \
    -tableConfigFile examples/batch/airlineStats/airlineStats_offline_table_config.json \
    -controllerHost pinot-controller \
    -controllerPort 9000 \
    -exec

Sample console output

Executing command: AddTable -tableConfigFile examples/batch/airlineStats/airlineStats_offline_table_config.json -schemaFile examples/batch/airlineStats/airlineStats_schema.json -controllerHost pinot-controller -controllerPort 9000 -exec
Sending request: http://pinot-controller:9000/schemas to controller: a413b0013806, version: Unknown
{"status":"Table airlineStats_OFFLINE succesfully added"}
bin/pinot-admin.sh AddTable \
    -schemaFile examples/batch/airlineStats/airlineStats_schema.json \
    -tableConfigFile examples/batch/airlineStats/airlineStats_offline_table_config.json \
    -exec
# add schema
curl -F schemaName=@airlineStats_schema.json  localhost:9000/schemas

# add table
curl -i -X POST -H 'Content-Type: application/json' \
    -d @airlineStats_offline_table_config.json localhost:9000/tables

Check out the table config in the Rest API to make sure it was successfully uploaded.

Streaming table creation

Start Kafka

docker run \
    --network pinot-demo --name=kafka \
    -e KAFKA_ZOOKEEPER_CONNECT=pinot-zookeeper:2181/kafka \
    -e KAFKA_BROKER_ID=0 \
    -e KAFKA_ADVERTISED_HOST_NAME=kafka \
    -d wurstmeister/kafka:latest

Create a Kafka topic

docker exec \
  -t kafka \
  /opt/kafka/bin/kafka-topics.sh \
  --zookeeper pinot-zookeeper:2181/kafka \
  --partitions=1 --replication-factor=1 \
  --create --topic flights-realtime

Create a streaming table

docker run \
    --network=pinot-demo \
    --name pinot-streaming-table-creation \
    ${PINOT_IMAGE} AddTable \
    -schemaFile examples/stream/airlineStats/airlineStats_schema.json \
    -tableConfigFile examples/docker/table-configs/airlineStats_realtime_table_config.json \
    -controllerHost pinot-controller \
    -controllerPort 9000 \
    -exec

Sample output

Executing command: AddTable -tableConfigFile examples/docker/table-configs/airlineStats_realtime_table_config.json -schemaFile examples/stream/airlineStats/airlineStats_schema.json -controllerHost pinot-controller -controllerPort 9000 -exec
Sending request: http://pinot-controller:9000/schemas to controller: 8fbe601012f3, version: Unknown
{"status":"Table airlineStats_REALTIME succesfully added"}

Start Kafka-Zookeeper

bin/pinot-admin.sh StartZookeeper -zkPort 2191

Start Kafka

bin/pinot-admin.sh  StartKafka -zkAddress=localhost:2191/kafka -port 19092

Create stream table

bin/pinot-admin.sh AddTable \
    -schemaFile examples/stream/airlineStats/airlineStats_schema.json \
    -tableConfigFile examples/stream/airlineStats/airlineStats_realtime_table_config.json \
    -exec

Check out the table config in the Rest API to make sure it was successfully uploaded.

Hybrid table creation

To create a hybrid table, you have to create the offline and real-time tables individually. You don't need to create a separate hybrid table.

"OFFLINE": {
    "tableName": "pinotTable", 
    "tableType": "OFFLINE", 
    "segmentsConfig": {
      ... 
    }, 
    "tableIndexConfig": { 
      ... 
    },  
    "tenants": {
      "broker": "myBrokerTenant", 
      "server": "myServerTenant"
    },
    "metadata": {
      ...
    }
  },
  "REALTIME": { 
    "tableName": "pinotTable", 
    "tableType": "REALTIME", 
    "segmentsConfig": {
      ...
    }, 
    "tableIndexConfig": { 
      ... 
      "streamConfigs": {
        ...
      },  
    },  
    "tenants": {
      "broker": "myBrokerTenant", 
      "server": "myServerTenant"
    },
    "metadata": {
    ...
    }
  }
}

Batch import example

Step-by-step guide for pushing your own data into the Pinot cluster

This example assumes you have set up your cluster using Pinot in Docker.

Preparing your data

Let's gather our data files and put them in pinot-quick-start/rawdata.

mkdir -p /tmp/pinot-quick-start/rawdata

Supported file formats are CSV, JSON, AVRO, PARQUET, THRIFT, ORC. If you don't have sample data, you can use this sample CSV.

/tmp/pinot-quick-start/rawdata/transcript.csv
studentID,firstName,lastName,gender,subject,score,timestampInEpoch
200,Lucy,Smith,Female,Maths,3.8,1570863600000
200,Lucy,Smith,Female,English,3.5,1571036400000
201,Bob,King,Male,Maths,3.2,1571900400000
202,Nick,Young,Male,Physics,3.6,1572418800000

Creating a schema

Schema is used to define the columns and data types of the Pinot table. A detailed overview of the schema can be found in Schema.

Columns are categorized into 3 types:

Column Type
Description

Dimensions

Typically used in filters and group by, for slicing and dicing into data

Metrics

Typically used in aggregations, represents the quantitative data

Time

Optional column, represents the timestamp associated with each row

In our example transcript-schema, the studentID,firstName,lastName,gender,subject columns are the dimensions, the score column is the metric and timestampInEpoch is the time column.

Once you have identified the dimensions, metrics and time columns, create a schema for your data, using the following reference.

/tmp/pinot-quick-start/transcript-schema.json
{
  "schemaName": "transcript",
  "dimensionFieldSpecs": [
    {
      "name": "studentID",
      "dataType": "INT"
    },
    {
      "name": "firstName",
      "dataType": "STRING"
    },
    {
      "name": "lastName",
      "dataType": "STRING"
    },
    {
      "name": "gender",
      "dataType": "STRING"
    },
    {
      "name": "subject",
      "dataType": "STRING"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "score",
      "dataType": "FLOAT"
    }
  ],
  "dateTimeFieldSpecs": [{
    "name": "timestampInEpoch",
    "dataType": "LONG",
    "format" : "1:MILLISECONDS:EPOCH",
    "granularity": "1:MILLISECONDS"
  }]
}

Creating a table configuration

A table configuration is used to define the configuration related to the Pinot table. A detailed overview of the table can be found in Table.

Here's the table configuration for the sample CSV file. You can use this as a reference to build your own table configuration. Edit the tableName and schemaName.

/tmp/pinot-quick-start/transcript-table-offline.json
{
  "tableName": "transcript",
  "segmentsConfig" : {
    "timeColumnName": "timestampInEpoch",
    "timeType": "MILLISECONDS",
    "replication" : "1",
    "schemaName" : "transcript"
  },
  "tableIndexConfig" : {
    "invertedIndexColumns" : [],
    "loadMode"  : "MMAP"
  },
  "tenants" : {
    "broker":"DefaultTenant",
    "server":"DefaultTenant"
  },
  "tableType":"OFFLINE",
  "metadata": {}
}

Uploading your table configuration and schema

Review the directory structure so far.

$ ls /tmp/pinot-quick-start
rawdata			transcript-schema.json	transcript-table-offline.json

$ ls /tmp/pinot-quick-start/rawdata 
transcript.csv

Upload the table configuration using the following command.

docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-batch-table-creation \
    apachepinot/pinot:latest AddTable \
    -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
    -tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
    -controllerHost manual-pinot-controller \
    -controllerPort 9000 -exec
bin/pinot-admin.sh AddTable \
  -tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
  -schemaFile /tmp/pinot-quick-start/transcript-schema.json -exec

Use the Rest API that is running on your Pinot instance to review the table configuration and schema and make sure it was successfully uploaded. This link uses localhost as an example.

Creating a segment

Pinot table data is stored as Pinot segments. A detailed overview of segments can be found in Segment.

  1. To generate a segment, first create a job specification (JobSpec) yaml file. A JobSpec yaml file contains all the information regarding data format, input data location, and pinot cluster coordinates. Copy the following job specification file (example from Pinot quickstart file). If you're using your own data, be sure to do the following:

    • Replace transcript with your table name

    • Set the correct recordReaderSpec

    // /tmp/pinot-quick-start/docker-job-spec.yml or /tmp/pinot-quick-start/batch-job-spec.yml
    
    executionFrameworkSpec:
      name: 'standalone'
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: '/tmp/pinot-quick-start/rawdata/'
    includeFileNamePattern: 'glob:**/*.csv'
    outputDirURI: '/tmp/pinot-quick-start/segments/'
    overwriteOutput: true
    pushJobSpec:
      pushFileNamePattern: 'glob:**/*.tar.gz'
    pinotFSSpecs:
      - scheme: file
        className: org.apache.pinot.spi.filesystem.LocalPinotFS
    recordReaderSpec:
      dataFormat: 'csv'
      className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
      configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
      tableName: 'transcript'
      schemaURI: 'http://localhost:9000/tables/transcript/schema'
      tableConfigURI: 'http://localhost:9000/tables/transcript'
    pinotClusterSpecs:
      - controllerURI: 'http://localhost:9000'
  2. Depending if you're using Docker or a launcher script, choose one of the following commands to generate a segment to upload to Pinot:

docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-data-ingestion-job \
    apachepinot/pinot:latest LaunchDataIngestionJob \
    -jobSpecFile /tmp/pinot-quick-start/docker-job-spec.yml
bin/pinot-admin.sh LaunchDataIngestionJob \
    -jobSpecFile /tmp/pinot-quick-start/batch-job-spec.yml

Here is some sample output.

SegmentGenerationJobSpec: 
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
  segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**\/*.csv
inputDirURI: /tmp/pinot-quick-start/rawdata/
jobType: SegmentCreationAndTarPush
outputDirURI: /tmp/pinot-quick-start/segments
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: 'http://localhost:9000'}
pinotFSSpecs:
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: null
recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader,
  configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig,
  configs: null, dataFormat: csv}
segmentNameGeneratorSpec: null
tableSpec: {schemaURI: 'http://localhost:9000/tables/transcript/schema', tableConfigURI: 'http://localhost:9000/tables/transcript',
  tableName: transcript}

Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Finished building StatsCollector!
Collected stats for 4 documents
Using fixed bytes value dictionary for column: studentID, size: 9
Created dictionary for STRING column: studentID with cardinality: 3, max length in bytes: 3, range: 200 to 202
Using fixed bytes value dictionary for column: firstName, size: 12
Created dictionary for STRING column: firstName with cardinality: 3, max length in bytes: 4, range: Bob to Nick
Using fixed bytes value dictionary for column: lastName, size: 15
Created dictionary for STRING column: lastName with cardinality: 3, max length in bytes: 5, range: King to Young
Created dictionary for FLOAT column: score with cardinality: 4, range: 3.2 to 3.8
Using fixed bytes value dictionary for column: gender, size: 12
Created dictionary for STRING column: gender with cardinality: 2, max length in bytes: 6, range: Female to Male
Using fixed bytes value dictionary for column: subject, size: 21
Created dictionary for STRING column: subject with cardinality: 3, max length in bytes: 7, range: English to Physics
Created dictionary for LONG column: timestampInEpoch with cardinality: 4, range: 1570863600000 to 1572418800000
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
Converting segment: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to v3 format
v3 segment location for segment: transcript_OFFLINE_1570863600000_1572418800000_0 is /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3
Deleting files in v1 segment directory: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0
Starting building 1 star-trees with configs: [StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[org.apache.pinot.core.startree.v2.AggregationFunctionColumnPair@3a48efdc],maxLeafRecords=1]] using OFF_HEAP builder
Starting building star-tree with config: StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[org.apache.pinot.core.startree.v2.AggregationFunctionColumnPair@3a48efdc],maxLeafRecords=1]
Generated 3 star-tree records from 4 segment records
Finished constructing star-tree, got 9 tree nodes and 4 records under star-node
Finished creating aggregated documents, got 6 aggregated records
Finished building star-tree in 10ms
Finished building 1 star-trees in 27ms
Computed crc = 3454627653, based on files [/var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/columns.psf, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/index_map, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/metadata.properties, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index_map]
Driver, record read time : 0
Driver, stats collector time : 0
Driver, indexing time : 0
Tarring segment from: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz
Size for segment: transcript_OFFLINE_1570863600000_1572418800000_0, uncompressed: 6.73KB, compressed: 1.89KB
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: [/tmp/pinot-quick-start/segments/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz]... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@243c4f91] for table transcript
Pushing segment: transcript_OFFLINE_1570863600000_1572418800000_0 to location: http://localhost:9000 for table transcript
Sending request: http://localhost:9000/v2/segments?tableName=transcript to controller: nehas-mbp.hsd1.ca.comcast.net, version: Unknown
Response for pushing table transcript segment transcript_OFFLINE_1570863600000_1572418800000_0 to location http://localhost:9000 - 200: {"status":"Successfully uploaded segment: transcript_OFFLINE_1570863600000_1572418800000_0 of table: transcript"}

Querying your data

If everything worked, find your table in the Query Console to run queries against it.

Architecture

Understand how the components of Apache Pinot™ work together to create a scalable OLAP database that can deliver low-latency, high-concurrency queries at scale.

Apache Pinot™ is a distributed OLAP database designed to serve real-time, user-facing use cases, which means handling large volumes of data and many concurrent queries with very low query latencies. Pinot supports the following requirements:

  • Ultra low-latency queries (as low as 10ms P95)

  • High query concurrency (as many as 100,000 queries per second)

  • High data freshness (streaming data available for query immediately upon ingestion)

  • Large data volume (up to petabytes)

Distributed design principles

To accommodate large data volumes with stringent latency and concurrency requirements, Pinot is designed as a distributed database that supports the following requirements:

  • Highly available: Pinot has no single point of failure. When tables are configured for replication, and a node goes down, the cluster is able to continue processing queries.

  • Horizontally scalable: Operators can scale a Pinot cluster by adding new nodes when the workload increases. There are even two node types (servers and brokers) to scale query volume, query complexity, and data size independently.

  • Immutable data: Pinot assumes all stored data is immutable, which helps simplify the parts of the system that handle data storage and replication. However, Pinot still supports upserts on streaming entity data and background purges of data to comply with data privacy regulations.

  • Dynamic configuration changes: Operations like adding new tables, expanding a cluster, ingesting data, modifying an existing table, and adding indexes do not impact query availability or performance.

Core components

As described in the Pinot Components, Pinot has four node types:

  • Controller

  • Broker

  • Server

  • Minion

Apache Helix and ZooKeeper

Distributed systems do not maintain themselves, and in fact require sophisticated scheduling and resource management to function. Pinot uses Apache Helix for this purpose. Helix exists as an independent project, but it was designed by the original creators of Pinot for Pinot's own cluster management purposes, so the architectures of the two systems are well-aligned. Helix takes the form of a process on the controller, plus embedded agents on the brokers and servers. It uses Apache ZooKeeper as a fault-tolerant, strongly consistent, durable state store.

Helix maintains a picture of the intended state of the cluster, including the number of servers and brokers, the configuration and schema of all tables, connections to streaming ingest sources, currently executing batch ingestion jobs, the assignment of table segments to the servers in the cluster, and more. All of these configuration items are potentially mutable quantities, since operators routinely change table schemas, add or remove streaming ingest sources, begin new batch ingestion jobs, and so on. Additionally, physical cluster state may change as servers and brokers fail or suffer network partition. Helix works constantly to drive the actual state of the cluster to match the intended state, pushing configuration changes to brokers and servers as needed.

There are three physical node types in a Helix cluster:

  • Participant: These nodes do things, like store data or perform computation. Participants host resources, which are Helix's fundamental storage abstraction. Because Pinot servers store segment data, they are participants.

  • Spectator: These nodes see things, observing the evolving state of the participants through events pushed to the spectator. Because Pinot brokers need to know which servers host which segments, they are spectators.

  • Controller: This node observes and manages the state of participant nodes. The controller is responsible for coordinating all state transitions in the cluster and ensures that state constraints are satisfied while maintaining cluster stability.

In addition, Helix defines two logical components to express its storage abstraction:

  • Partition. A unit of data storage that lives on at least one participant. Partitions may be replicated across multiple participants. A Pinot segment is a partition.

  • Resource. A logical collection of partitions, providing a single view over a potentially large set of data stored across a distributed system. A Pinot table is a resource.

In summary, the Pinot architecture maps onto Helix components as follows:

Pinot Component
Helix Component

Segment

Helix Partition

Table

Helix Resource

Controller

Helix Controller or Helix agent that drives the overall state of the cluster

Server

Helix Participant

Broker

A Helix Spectator that observes the cluster for changes in the state of segments and servers. To support multi-tenancy, brokers are also modeled as Helix Participants.

Minion

Helix Participant that performs computation rather than storing data

Helix uses ZooKeeper to maintain cluster state. ZooKeeper sends Helix spectators notifications of changes in cluster state (which correspond to changes in ZNodes). Zookeeper stores the following information about the cluster:

Resource
Stored Properties

Controller

  • Controller that is assigned as the current leader

Servers and Brokers

  • List of servers and brokers

  • Configuration of all current servers and brokers

  • Health status of all current servers and brokers

Tables

  • List of tables

  • Table configurations

  • Table schema

  • List of the table's segments

Segment

  • Exact server locations of a segment

  • State of each segment (online/offline/error/consuming)

  • Metadata about each segment

Zookeeper, as a first-class citizen of a Pinot cluster, may use the well-known ZNode structure for operations and troubleshooting purposes. Be advised that this structure can change in future Pinot releases.

Pinot's Zookeeper Browser UI

Controller

The Pinot controller schedules and re-schedules resources in a Pinot cluster when metadata changes or a node fails. As an Apache Helix Controller, it schedules the resources that comprise the cluster and orchestrates connections between certain external processes and cluster components (e.g., ingest of real-time tables and offline tables). It can be deployed as a single process on its own server or as a group of redundant servers in an active/passive configuration.

Fault tolerance

Only one controller can be active at a time, so when multiple controllers are present in a cluster, they elect a leader. When that controller instance becomes unavailable, the remaining instances automatically elect a new leader. Leader election is achieved using Apache Helix. A Pinot cluster can serve queries without an active controller, but it can't perform any metadata-modifying operations, like adding a table or consuming a new segment.

Controller REST interface

The controller provides a REST interface that allows read and write access to all logical storage resources (e.g., servers, brokers, tables, and segments). See Pinot Data Explorer for more information on the web-based admin tool.

Broker

The broker's responsibility is to route queries to the appropriate server instances, or in the case of multi-stage queries, to compute a complete query plan and distribute it to the servers required to execute it. The broker collects and merges the responses from all servers into a final result, then sends the result back to the requesting client. The broker exposes an HTTP endpoint that accepts SQL queries in JSON format and returns the response in JSON.

Each broker maintains a query routing table. The routing table maps segments to the servers that store them. (When replication is configured on a table, each segment is stored on more than one server.) The broker computes multiple routing tables depending on the configured routing strategy for a table. The default strategy is to balance the query load across all available servers.

Advanced routing strategies are available, such as replica-aware routing, partition-based routing, and minimal server selection routing.

//This is an example ZNode config for EXTERNAL VIEW in Helix
{
  "id" : "baseballStats_OFFLINE",
  "simpleFields" : {
    ...
  },
  "mapFields" : {
    "baseballStats_OFFLINE_0" : {
      "Server_10.1.10.82_7000" : "ONLINE"
    }
  },
  ...
}

Query processing

Every query processed by a broker uses the single-stage engine or the multi-stage engine. For single-stage queries, the broker does the following:

  • Computes query routes based on the routing strategy defined in the table configuration.

  • Computes the list of segments to query on each server. (See routing for further details on this process.)

  • Sends the query to each of those servers for local execution against their segments.

  • Receives the results from each server and merges them.

  • Sends the query result to the client.

// Query: select count(*) from baseballStats limit 10

// RESPONSE
// ========
{
    "resultTable": {
        "dataSchema": {
            "columnDataTypes": ["LONG"],
            "columnNames": ["count(*)"]
        },
        "rows": [
            [97889]
        ]
    },
    "exceptions": [],
    "numServersQueried": 1,
    "numServersResponded": 1,
    "numSegmentsQueried": 1,
    "numSegmentsProcessed": 1,
    "numSegmentsMatched": 1,
    "numConsumingSegmentsQueried": 0,
    "numDocsScanned": 97889,
    "numEntriesScannedInFilter": 0,
    "numEntriesScannedPostFilter": 0,
    "numGroupsLimitReached": false,
    "totalDocs": 97889,
    "timeUsedMs": 5,
    "segmentStatistics": [],
    "traceInfo": {},
    "minConsumingFreshnessTimeMs": 0
}

For multi-stage queries, the broker performs the following:

  • Computes a query plan that runs on multiple sets of servers. The servers selected for the first stage are selected based on the segments required to execute the query, which are determined in a process similar to single-stage queries.

  • Sends the relevant portions of the query plan to one or more servers in the cluster for each stage of the query plan.

  • The servers that received query plans each execute their part of the query. For more details on this process, read about the multi-stage engine.

  • The broker receives a complete result set from the final stage of the query, which is always a single server.

  • The broker sends the query result to the client.

Server

Servers host segments on locally attached storage and process queries on those segments. By convention, operators speak of "real-time" and "offline" servers, although there is no difference in the server process itself or even its configuration that distinguishes between the two. This is merely a convention reflected in the table assignment strategy to confine the two different kinds of workloads to two groups of physical instances, since the performance-limiting factors differ between the two kinds of workloads. For example, offline servers might optimize for larger storage capacity, whereas real-time servers might optimize for memory and CPU cores.

Offline servers

Offline servers host segments created by ingesting batch data. The controller writes these segments to the offline server according to the table's replication factor and segment assignment strategy. Typically, the controller writes new segments to the deep store, and affected servers download the segment from deep store. The controller then notifies brokers that a new segment exists, and is available to participate in queries.

Because offline tables tend to have long retention periods, offline servers tend to scale based on the size of the data they store.

Real-time servers

Real-time servers ingest data from streaming sources, like Apache Kafka®, Apache Pulsar®, or AWS Kinesis. Streaming data ends up in conventional segment files just like batch data, but is first accumulated in an in-memory data structure known as a consuming segment. Each message consumed from a streaming source is written immediately to the relevant consuming segment, and is available for query processing from the consuming segment immediately, since consuming segments participate in query processing as first-class citizens. Consuming segments get flushed to disk periodically based on a completion threshold, which can be calculated by row count, ingestion time, or segment size. A flushed segment on a real-time table is called a completed segment, and is functionally equivalent to a segment created during offline ingest.

Real-time servers tend to be scaled based on the rate at which they ingest streaming data.

Minion

A Pinot minion is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minion tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.

Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function without minions, they are typically present to support routine tasks like ingesting batch data.

Data ingestion overview

Pinot tables exist in two varieties: offline (or batch) and real-time. Offline tables contain data from batch sources like CSV, Avro, or Parquet files, and real-time tables contain data from streaming sources like like Apache Kafka®, Apache Pulsar®, or AWS Kinesis.

Offline (batch) ingest

Pinot ingests batch data using an ingestion job, which follows a process like this:

  1. The job transforms a raw data source (such as a CSV file) into segments. This is a potentially complex process resulting in a file that is typically several hundred megabytes in size.

  2. The job then transfers the file to the cluster's deep store and notifies the controller that a new segment exists.

  3. The controller (in its capacity as a Helix controller) updates the ideal state of the cluster in its cluster metadata map.

  4. The controller then assigns the segment to one or more "offline" servers (depending on replication factor) and notifies them that new segments are available.

  5. The servers then download the newly created segments directly from the deep store.

  6. The cluster's brokers, which watch for state changes as Helix spectators, detect the new segments and update their segment routing tables accordingly. The cluster is now able to query the new offline segments.

Real-time ingest

Ingestion is established at the time a real-time table is created, and continues as long as the table exists. When the controller receives the metadata update to create a new real-time table, the table configuration specifies the source of the streaming input data—often a topic in a Kafka cluster. This kicks off a process like this:

  1. The controller picks one or more servers to act as direct consumers of the streaming input source.

  2. The controller creates consuming segments for the new table. It does this by creating an entry in the global metadata map for a new consuming segment for each of the real-time servers selected in step 1.

  3. Through Helix functionality on the controller and the relevant servers, the servers proceed to create consuming segments in memory and establish a connection to the streaming input source. When this input source is Kafka, each server acts as a Kafka consumer directly, with no other components involved in the integration.

  4. Through Helix functionality on the controller and all of the cluster's brokers, the brokers become aware of the consuming segments, and begin including them in query routing immediately.

  5. The consuming servers simultaneously begin consuming messages from the streaming input source, storing them in the consuming segment.

  6. When a server decides its consuming segment is complete, it commits the in-memory consuming segment to a conventional segment file, uploads it to the deep store, and notifies the controller.

  7. The controller and the server create a new consuming segment to continue real-time ingestion.

  8. The controller marks the newly committed segment as online. Brokers then discover the new segment through the Helix notification mechanism, allowing them to route queries to it in the usual fashion.

Indexing

This page describes the indexing techniques available in Apache Pinot

Apache Pinot™ supports the following indexing techniques:

    • Dictionary-encoded forward index with bit compression

    • Raw value forward index

    • Sorted forward index with run-length encoding

    • Bitmap inverted index

    • Sorted inverted index

By default, Pinot creates a dictionary-encoded forward index for each column.

Enabling indexes

There are two ways to enable indexes for a Pinot table.

As part of ingestion, during Pinot segment generation

Indexing is enabled by specifying the column names in the table configuration. More details about how to configure each type of index can be found in the respective index's section linked above or in the .

Dynamically added or removed

Indexes can also be dynamically added to or removed from segments at any point. Update your table configuration with the latest set of indexes you want to have.

For example, if you have an inverted index on the foo field and now want to also include the bar field, you would update your table configuration from this:

To this:

The updated index configuration won't be picked up unless you invoke the reload API. This API sends reload messages via Helix to all servers, as part of which indexes are added or removed from the local segments. This happens without any downtime and is completely transparent to the queries.

When adding an index, only the new index is created and appended to the existing segment. When removing an index, its related states are cleaned up from Pinot servers. You can find this API under the Segments tab on Swagger:

You can also find this action on the , on the specific table's page.

Not all indexes can be retrospectively applied to existing segments. For more detailed documentation on applying indexes, see the .

Tuning Index

The inverted index provides good performance for most use cases, especially if your use case doesn't have a strict low latency requirement.

You should start by using this, and if your queries aren't fast enough, switch to advanced indices like the sorted or star-tree index.

Data and Index types compatibility matrix

Matrix below show which combinations of data types, cardinality and encoding are compatible with various index types:

data type
bloom
fst
ifst
geo
inverted
json
native
text
range
startree
timestamp
vector

(1) Supports only dictionary-encoded columns.

(2) Supports only single value columns.

(3) Supported only if values can be parsed as long.

(4) Supported only if values can be parsed as json.

(5) Supports only multi value columns.

Running in Kubernetes

Pinot quick start in Kubernetes

Get started running Pinot in Kubernetes.

Note: The examples in this guide are sample configurations to be used as reference. For production setup, you may want to customize it to your needs.

Prerequisites

Kubernetes

This guide assumes that you already have a running Kubernetes cluster.

If you haven't yet set up a Kubernetes cluster, see the links below for instructions:

    • Make sure to run with enough resources: minikube start --vm=true --cpus=4 --memory=8g --disk-size=50g

Pinot

Make sure that you've downloaded Apache Pinot. The scripts for the setup in this guide can be found in our.

Set up a Pinot cluster in Kubernetes

Start Pinot with Helm

The Pinot repository has pre-packaged Helm charts for Pinot and Presto. The Helm repository index file is .

Note: Specify StorageClass based on your cloud vendor. Don't mount a blob store (such as AzureFile, GoogleCloudStorage, or S3) as the data serving file system. Use only Amazon EBS/GCP Persistent Disk/Azure Disk-style disks.

  • For AWS: "gp2"

  • For GCP: "pd-ssd" or "standard"

  • For Azure: "AzureDisk"

  • For Docker-Desktop: "hostpath"

1.1.1 Update Helm dependency

1.1.2 Start Pinot with Helm

Check Pinot deployment status

Load data into Pinot using Kafka

Bring up a Kafka cluster for real-time data ingestion

Check Kafka deployment status

Ensure the Kafka deployment is ready before executing the scripts in the following steps. Run the following command:

Below is an example output showing the deployment is ready:

Create Kafka topics

Run the scripts below to create two Kafka topics for data ingestion:

Load data into Kafka and create Pinot schema/tables

The script below does the following:

  • Ingests 19492 JSON messages to Kafka topic flights-realtime at a speed of 1 msg/sec

  • Ingests 19492 Avro messages to Kafka topic flights-realtime-avro at a speed of 1 msg/sec

  • Uploads Pinot schema airlineStats

  • Creates Pinot table airlineStats to ingest data from JSON encoded Kafka topic flights-realtime

  • Creates Pinot table airlineStatsAvro to ingest data from Avro encoded Kafka topic flights-realtime-avro

Query with the Pinot Data Explorer

Pinot Data Explorer

The following script (located at ./pinot/helm/pinot) performs local port forwarding, and opens the Pinot query console in your default web browser.

Query Pinot with Superset

Bring up Superset using Helm

  1. Install the SuperSet Helm repository:

  1. Get the Helm values configuration file:

  1. For Superset to install Pinot dependencies, edit /tmp/superset-values.yaml file to add apinotdb pip dependency into bootstrapScript field.

  2. You can also build your own image with this dependency or use the image apachepinot/pinot-superset:latest instead.

  1. Replace the default admin credentials inside the init section with a meaningful user profile and stronger password.

  2. Install Superset using Helm:

  1. Ensure your cluster is up by running:

Access the Superset UI

  1. Run the below command to port forward Superset to your localhost:18088.

  1. Navigate to Superset in your browser with the admin credentials you set in the previous section.

  2. Create a new database connection with the following URI: pinot+http://pinot-broker.pinot-quickstart:8099/query?controller=http://pinot-controller.pinot-quickstart:9000/

  3. Once the database is added, you can add more data sets and explore the dashboard options.

Access Pinot with Trino

Deploy Trino

  1. Deploy Trino with the Pinot plugin installed:

  1. See the charts in the Trino Helm chart repository:

  1. In order to connect Trino to Pinot, you'll need to add the Pinot catalog, which requires extra configurations. Run the below command to get all the configurable values.

  1. To add the Pinot catalog, edit the additionalCatalogs section by adding:

Pinot is deployed at namespace pinot-quickstart, so the controller serviceURL is pinot-controller.pinot-quickstart:9000

  1. After modifying the /tmp/trino-values.yaml file, deploy Trino with:

  1. Once you've deployed Trino, check the deployment status:

Query Pinot with the Trino CLI

Once Trino is deployed, run the below command to get a runnable Trino CLI.

  1. Download the Trino CLI:

  1. Port forward Trino service to your local if it's not already exposed:

  1. Use the Trino console client to connect to the Trino service:

  1. Query Pinot data using the Trino CLI, like in the sample queries below.

Sample queries to execute

List all catalogs

List all tables

Show schema

Count total documents

Access Pinot with Presto

Deploy Presto with the Pinot plugin

  1. First, deploy Presto with default configurations:

  1. To customize your deployment, run the below command to get all the configurable values.

  1. After modifying the /tmp/presto-values.yaml file, deploy Presto:

  1. Once you've deployed the Presto instance, check the deployment status:

Query Presto using the Presto CLI

Once Presto is deployed, you can run the below command from , or follow the steps below.

  1. Download the Presto CLI:

  1. Port forward presto-coordinator port 8080 to localhost port 18080:

  1. Start the Presto CLI with the Pinot catalog:

  1. Query Pinot data with the Presto CLI, like in the sample queries below.

Sample queries to execute

List all catalogs

List all tables

Show schema

Count total documents

Delete a Pinot cluster in Kubernetes

To delete your Pinot cluster in Kubernetes, run the following command:

0.7.1

This release introduced several awesome new features, including JSON index, lookup-based join support, geospatial support, TLS support for pinot connections, and various performance optimizations.

Summary

This release introduced several awesome new features, including JSON index, lookup-based join support, geospatial support, TLS support for pinot connections, and various performance optimizations and improvements.

It also adds several new APIs to better manage the segments and upload data to the offline table. It also contains many key bug fixes. See details below.

The release was cut from the following commit:

and the following cherry-picks:

Notable New Features

  • Add a server metric: queriesDisabled to check if queries disabled or not. ()

  • Optimization on GroupKey to save the overhead of ser/de the group keys () ()

  • Support validation for jsonExtractKey and jsonExtractScalar functions () ()

  • Real Time Provisioning Helper tool improvement to take data characteristics as input instead of an actual segment ()

  • Add the isolation level config isolation.level to Kafka consumer (2.0) to ingest transactionally committed messages only ()

  • Enhance StarTreeIndexViewer to support multiple trees ()

  • Improves ADLSGen2PinotFS with service principal based auth, auto create container on initial run. It's backwards compatible with key based auth. ()

  • Add metrics for minion tasks status ()

  • Use minion data directory as tmp directory for SegmentGenerationAndPushTask to ensure directory is always cleaned up ()

  • Add optional HTTP basic auth to pinot broker, which enables user- and table-level authentication of incoming queries. ()

  • Add Access Control for REST endpoints of Controller ()

  • Add date_trunc to scalar functions to support date_trunc during ingestion ()

  • Allow tar gz with > 8gb size ()

  • Add Lookup UDF Join support (), (), () ()

  • Add cron scheduler metrics reporting ()

  • Support generating derived column during segment load, so that derived columns can be added on-the-fly ()

  • Support chained transform functions ()

  • Add scalar function JsonPathArray to extract arrays from json ()

  • Add a guard against multiple consuming segments for same partition ()

  • Remove the usage of deprecated range delimiter ()

  • Handle scheduler calls with proper response when it's disabled. ()

  • Simplify SegmentGenerationAndPushTask handling getting schema and table config ()

  • Add a cluster config to config number of concurrent tasks per instance for minion task: SegmentGenerationAndPushTaskGenerator ()

  • Replace BrokerRequestOptimizer with QueryOptimizer to also optimize the PinotQuery ()

  • Add additional string scalar functions ()

  • Add additional scalar functions for array type ()

  • Add CRON scheduler for Pinot tasks ()

  • Set default Data Type while setting type in Add Schema UI dialog ()

  • Add ImportData sub command in pinot admin ()

  • H3-based geospatial index () ()

  • Add JSON index support () () ()

  • Make minion tasks pluggable via reflection ()

  • Add compatibility test for segment operations upload and delete ()

  • Add segment reset API that disables and then enables the segment ()

  • Add Pinot minion segment generation and push task. ()

  • Add a version option to pinot admin to show all the component versions ()

  • Add FST index using lucene lib to speedup REGEXP_LIKE operator on text ()

  • Add APIs for uploading data to an offline table. ()

  • Allow the use of environment variables in stream configs ()

  • Enhance task schedule api for single type/table support ()

  • Add broker time range based pruner for routing. Query operators supported: RANGE, =, <, <=, >, >=, AND, OR()

  • Add json path functions to extract values from json object ()

  • Create a pluggable interface for Table config tuner ()

  • Add a Controller endpoint to return table creation time ()

  • Add tooltips, ability to enable-disable table state to the UI ()

  • Add Pinot Minion client ()

  • Add more efficient use of RoaringBitmap in OnHeapBitmapInvertedIndexCreator and OffHeapBitmapInvertedIndexCreator ()

  • Add decimal percentile support. ()

  • Add API to get status of consumption of a table ()

  • Add support to add offline and real-time tables, individually able to add schema and schema listing in UI ()

  • Improve performance for distinct queries ()

  • Allow adding custom configs during the segment creation phase ()

  • Use sorted index based filtering only for dictionary encoded column ()

  • Enhance forward index reader for better performance ()

  • Support for text index without raw ()

  • Add api for cluster manager to get table state ()

  • Perf optimization for SQL GROUP BY ORDER BY ()

  • Add support using environment variables in the format of ${VAR_NAME:DEFAULT_VALUE} in Pinot table configs. ()

Special notes

  • Pinot controller metrics prefix is fixed to add a missing dot (). This is a backward-incompatible change that JMX query on controller metrics must be updated

  • Legacy group key delimiter (\t) was removed to be backward-compatible with release 0.5.0 ()

  • Upgrade zookeeper version to 3.5.8 to fix ZOOKEEPER-2184: Zookeeper Client should re-resolve hosts when connection attempts fail. ()

  • Add TLS-support for client-pinot and pinot-internode connections () Upgrades to a TLS-enabled cluster can be performed safely and without downtime. To achieve a live-upgrade, go through the following steps:

    • First, configure alternate ingress ports for https/netty-tls on brokers, controllers, and servers. Restart the components with a rolling strategy to avoid cluster downtime.

    • Second, verify manually that https access to controllers and brokers is live. Then, configure all components to prefer TLS-enabled connections (while still allowing unsecured access). Restart the individual components.

    • Third, disable insecure connections via configuration. You may also have to set controller.vip.protocol and controller.vip.port and update the configuration files of any ingestion jobs. Restart components a final time and verify that insecure ingress via http is not available anymore.

  • PQL endpoint on Broker is deprecated ()

    • Apache Pinot has adopted SQL syntax and semantics. Legacy PQL (Pinot Query Language) is deprecated and no longer supported. Use SQL syntax to query Pinot on broker endpoint /query/sql and controller endpoint /sql

Major Bug fixes

  • Fix the SIGSEGV for large index ()

  • Handle creation of segments with 0 rows so segment creation does not fail if data source has 0 rows. ()

  • Fix QueryRunner tool for multiple runs ()

  • Use URL encoding for the generated segment tar name to handle characters that cannot be parsed to URI. ()

  • Fix a bug of miscounting the top nodes in StarTreeIndexViewer ()

  • Fix the raw bytes column in real-time segment ()

  • Fixes a bug to allow using JSON_MATCH predicate in SQL queries ()

  • Fix the overflow issue when loading the large dictionary into the buffer ()

  • Fix empty data table for distinct query ()

  • Fix the default map return value in DictionaryBasedGroupKeyGenerator ()

  • Fix log message in ControllerPeriodicTask ()

  • Fix bug : RealtimeTableDataManager shuts down SegmentBuildTimeLeaseExtender for all tables in the host ()

  • Fix license headers and plugin checks

0.3.0

0.3.0 release of Apache Pinot introduces the concept of plugins that makes it easy to extend and integrate with other systems.

What's the big change?

The reason behind the architectural change from the previous release (0.2.0) and this release (0.3.0), is the possibility of extending Apache Pinot. The 0.2.0 release was not flexible enough to support new storage types nor new stream types. Basically, inserting a new functionality required to change too much code. Thus, the Pinot team went through an extensive refactoring and improvement of the source code.

For instance, the picture below shows the module dependencies of the 0.2.X or previous releases. If we wanted to support a new storage type, we would have had to change several modules. Pretty bad, huh?

In order to conquer this challenge, below major changes are made:

  • Refactored common interfaces to pinot-spi module

  • Concluded four types of modules:

    • Pinot input format: How to read records from various data/file formats: e.g. Avro/CSV/JSON/ORC/Parquet/Thrift

    • Pinot filesystem: How to operate files on various filesystems: e.g. Azure Data Lake/Google Cloud Storage/S3/HDFS

    • Pinot stream ingestion: How to ingest data stream from various upstream systems, e.g. Kafka/Kinesis/Eventhub

    • Pinot batch ingestion: How to run Pinot batch ingestion jobs in various frameworks, like Standalone, Hadoop, Spark.

  • Built shaded jars for each individual plugin

  • Added support to dynamically load pinot plugins at server startup time

Now the architecture supports a plug-and-play fashion, where new tools can be supported with little and simple extensions, without affecting big chunks of code. Integrations with new streaming services and data formats can be developed in a much more simple and convenient way.

Notable New Features

  • SQL Support

    • Added Calcite SQL compiler

    • Added SQL response format (, )

    • Added support for GROUP BY with ORDER BY ()

    • Query console defaults to use SQL syntax ()

    • Support column alias (, )

    • Added SQL query endpoint: /query/sql ()

    • Support arithmetic operators ()

    • Support non-literal expressions for right-side operand in predicate comparison()

  • Added support for DISTINCT ()

  • Added support default value for BYTES column ()

  • JDK 11 Support

  • Added support to tune size vs accuracy for approximation aggregation functions: DistinctCountHLL, PercentileEst, PercentileTDigest ()

  • Added Data Anonymizer Tool ()

  • Deprecated pinot-hadoop and pinot-spark modules, replace with pinot-batch-ingestion-hadoop and pinot-batch-ingestion-spark

  • Support STRING and BYTES for no dictionary columns in real-time consuming segments ()

  • Make pinot-distribution to build a pinot-all jar and assemble it ()

  • Added support for PQL case insensitive ()

  • Enhanced TableRebalancer logics

    • Moved to new rebalance strategy ()

    • Supported rebalancing tables under any condition()

    • Supported reassigning completed segments along with Consuming segments for LLC real-time table ()

  • Added experimental support for Text Search‌ ()

  • Upgraded Helix to version 0.9.4, task management now works as expected ()

  • Added date_trunc transformation function. ()

  • Support schema evolution for consuming segment. ()

  • APIs Additions/Changes

    • Pinot Admin Command

      • Added -queryType option in PinotAdmin PostQuery subcommand ()

      • Added -schemaFile as option in AddTable command ()

      • Added OperateClusterConfig sub command in PinotAdmin ()

    • Pinot Controller Rest APIs

      • Get Table leader controller resource ()

      • Support HTTP POST/PUT to upload JSON encoded schema ()

      • Table rebalance API now requires both table name and type as parameters. ()

      • Refactored Segments APIs ()

      • Added segment batch deletion REST API ()

      • Update schema API to reload table on schema change when applicable ()

      • Enhance the task related REST APIs ()

      • Added PinotClusterConfig REST APIs ()

        • GET /cluster/configs

        • POST /cluster/configs

        • DELETE /cluster/configs/{configName}

  • Configurations Additions/Changes

    • Config: controller.host is now optional in Pinot Controller

    • Added instance config: queriesDisabled to disable query sending to a running server ()

    • Added broker config: pinot.broker.enable.query.limit.override configurable max query response size ()

    • Removed deprecated server configs ()

      • pinot.server.starter.enableSegmentsLoadingCheck

      • pinot.server.starter.timeoutInSeconds

      • pinot.server.instance.enable.shutdown.delay

      • pinot.server.instance.starter.maxShutdownWaitTime

      • pinot.server.instance.starter.checkIntervalTime

    • Decouple server instance id with hostname/port config. ()

    • Add FieldConfig to encapsulate encoding, indexing info for a field.()

Major Bug Fixes

  • Fixed the bug of releasing the segment when there are still threads working on it. ()

  • Fixed the bug of uneven task distribution for threads ()

  • Fixed encryption for .tar.gz segment file upload ()

  • Fixed controller rest API to download segment from non local FS. ()

  • Fixed the bug of not releasing segment lock if segment recovery throws exception ()

  • Fixed the issue of server not registering state model factory before connecting the Helix manager ()

  • Fixed the exception in server instance when Helix starts a new ZK session ()

  • Fixed ThreadLocal DocIdSet issue in ExpressionFilterOperator ()

  • Fixed the bug in default value provider classes ()

  • Fixed the bug when no segment exists in RealtimeSegmentSelector ()

Work in Progress

  • We are in the process of supporting text search query functionalities.

  • We are in the process of supporting null value (), currently limited query feature is supported

    • Added Presence Vector to represent null value ()

    • Added null predicate support for leaf predicates ()

Backward Incompatible Changes

  • It’s a disruptive upgrade from version 0.1.0 to this because of the protocol changes between Pinot Broker and Pinot Server. Ensure that you upgrade to release 0.2.0 first, then upgrade to this version.

  • If you build your own startable or war without using scripts generated in Pinot-distribution module. For Java 8, an environment variable “plugins.dir” is required for Pinot to find out where to load all the Pinot plugin jars. For Java 11, plugins directory is required to be explicitly set into classpath. See pinot-admin.sh as an example.

  • As always, we recommend that you upgrade controllers first, and then brokers and lastly the servers in order to have zero downtime in production clusters.

  • Kafka 0.9 is no longer included in the release distribution.

  • Pull request introduces a backward incompatible API change for segments management.

    • Removed segment toggle APIs

    • Removed list all segments in cluster APIs

    • Deprecated below APIs:

      • GET /tables/{tableName}/segments

      • GET /tables/{tableName}/segments/metadata

      • GET /tables/{tableName}/segments/crc

      • GET /tables/{tableName}/segments/{segmentName}

      • GET /tables/{tableName}/segments/{segmentName}/metadata

      • GET /tables/{tableName}/segments/{segmentName}/reload

      • POST /tables/{tableName}/segments/{segmentName}/reload

      • GET /tables/{tableName}/segments/reload

      • POST /tables/{tableName}/segments/reload

  • Pull request deprecated below task related APIs:

    • GET:

      • /tasks/taskqueues: List all task queues

      • /tasks/taskqueuestate/{taskType} -> /tasks/{taskType}/state

      • /tasks/tasks/{taskType} -> /tasks/{taskType}/tasks

      • /tasks/taskstates/{taskType} -> /tasks/{taskType}/taskstates

      • /tasks/taskstate/{taskName} -> /tasks/task/{taskName}/taskstate

      • /tasks/taskconfig/{taskName} -> /tasks/task/{taskName}/taskconfig

    • PUT:

      • /tasks/scheduletasks -> POST /tasks/schedule

      • /tasks/cleanuptasks/{taskType} -> /tasks/{taskType}/cleanup

      • /tasks/taskqueue/{taskType}: Toggle a task queue

    • DELETE:

      • /tasks/taskqueue/{taskType} -> /tasks/{taskType}

  • Deprecated modules pinot-hadoop and pinot-spark and replaced with pinot-batch-ingestion-hadoop and pinot-batch-ingestion-spark.

  • Introduced new Pinot batch ingestion jobs and yaml based job specs to define segment generation jobs and segment push jobs.

  • You may see exceptions like below in pinot-brokers during cluster upgrade, but it's safe to ignore them.

"tableIndexConfig": {
        "invertedIndexColumns": ["foo"],
        ...
    }
"tableIndexConfig": {
        "invertedIndexColumns": ["foo", "bar"],
        ...
    }
curl -X POST \
  "http://localhost:9000/segments/myTable/reload" \
  -H "accept: application/json"

boolean

❌

❌

❌

🆗

❌

❌

❌

🆗

🆗 (2)

🆗

❌

int

🆗

❌

❌

🆗

❌

❌

❌

🆗

🆗 (2)

🆗

❌

long

🆗

❌

❌

🆗

❌

❌

❌

🆗

🆗 (2)

🆗

❌

float

🆗

❌

❌

🆗

❌

❌

❌

🆗

🆗 (2)

🆗

🆗 (5)

double

🆗

❌

❌

🆗

❌

❌

❌

🆗

🆗 (2)

🆗

❌

big decimal

❌

❌

❌

🆗

❌

❌

❌

🆗

🆗 (2)

🆗

❌

timestamp

❌

❌

❌

🆗

❌

❌

❌

🆗

🆗 (2)

🆗

❌

string

🆗

🆗 (1)

🆗 (1)

🆗

🆗 (2) (4)

🆗

🆗

🆗

🆗 (2)

🆗 (3)

❌

json

❌

❌

❌

🆗

🆗 (2)

🆗

🆗

🆗

🆗 (2)

❌

❌

bytes

🆗

❌

🆗 (2)

🆗

❌

❌

❌

🆗

🆗 (2)

❌

❌

map

❌

❌

❌

❌

❌

❌

❌

❌

❌

❌

❌

Bloom filter
Forward index
FST index
Geospatial
Inverted index
JSON index
Range index
Star-tree index
Text search support
Timestamp index
table configuration reference
Cluster Manager in the Pinot UI
Indexing FAQ
78152cd
b527af3
84d59e3
a18dc60
4ec38f7
b48dac0
5d2bc0c
913492e
50a4531
1f21403
8dbb70b
#6586
#6593
#6559
#6246
#6594
#6546
#6580
#6569
#6531
#6549
#6560
#6552
#6507
#6538
#6533
#6530
#6465
#6383
#6286
#6502
#6494
#6495
#6490
#6483
#6475
#6474
#6469
#6468
#6423
#6458
#6446
#6451
#6452
#6396
#6409
#6306
#6408
#6216
#6346
#6395
#6382
#6336
#6340
#6380
#6120
#6354
#6373
#6352
#6259
#6347
#6255
#6331
#6327
#6339
#6320
#6323
#6322
#6296
#6285
#6299
#6288
#6262
#6284
#6211
#6225
#6271
#6499
#6589
#6558
#6418
#6607
#6577
#6466
#6582
#6571
#6569
#6574
#6535
#6476
#6363
#6712
#6709
#6671
#6682

0.8.0

This release introduced several new features, including compatibility tests, enhanced complex type and Json support, partial upsert support, and new stream ingestion plugins.

Summary

This release introduced several awesome new features, including compatibility tests, enhanced complex type and Json support, partial upsert support, and new stream ingestion plugins (AWS Kinesis, Apache Pulsar). It contains a lot of query enhancements such as new timestamp and boolean type support and flexible numerical column comparison. It also includes many key bug fixes. See details below.

The release was cut from the following commit: fe83e95aa9124ee59787c580846793ff7456eaa5

and the following cherry-picks:

  • 668b5e0

  • ee887b9

  • c2f7fcc

  • c1ac8a1

  • 4da1dae

  • 573651b

  • c6c407d

  • 0d96c7f

  • c2637d1

Notable New Features

  • Extract time handling for SegmentProcessorFramework (#7158)

  • Add Apache Pulsar low level and high level connector (#7026)

  • Enable parallel builds for compat checker (#7149)

  • Add controller/server API to fetch aggregated segment metadata (#7102)

  • Support Dictionary Based Plan For DISTINCT (#7141)

  • Provide HTTP client to kinesis builder (#7148)

  • Add datetime function with 2 arguments (#7116)

  • Adding ability to check ingestion status for Offline Pinot table (#7070)

  • Add timestamp datatype support in JDBC (#7117)

  • Allow updating controller and broker helix hostname (#7064)

  • Cancel running Kinesis consumer tasks when timeout occurs (#7109)

  • Implement Append merger for partial upsert (#7087)

    `* SegmentProcessorFramework Enhancement (#7092)

  • Added TaskMetricsEmitted periodic controler job (#7091)

  • Support json path expressions in query. (#6998)

  • Support data preprocessing for AVRO and ORC formats (#7062)

  • Add partial upsert config and mergers (#6899)

  • Add support for range index rule recommendation(#7034) (#7063)

  • Allow reloading consuming segment by default (#7078)

  • Add LZ4 Compression Codec (#6804) ([#7035](https://github.com/apache/pinot/pull/7035

    ))

  • Make Pinot JDK 11 Compilable (#6424\

  • Introduce in-Segment Trim for GroupBy OrderBy Query (#6991)

  • Produce GenericRow file in segment processing mapper (#7013)

  • Add ago() scalar transform function (#6820)

  • Add Bloom Filter support for IN predicate(#7005) (#7007)

  • Add genericRow file reader and writer (#6997)

  • Normalize LHS and RHS numerical types for >, >=, <, and <= operators. (#6927)

  • Add Kinesis Stream Ingestion Plugin (#6661)

  • feature/#6766 JSON and Startree index information in API (#6873)

  • Support null value fields in generic row ser/de (#6968)

  • Implement PassThroughTransformOperator to optimize select queries(#6972) (#6973)

  • Optimize TIME_CONVERT/DATE_TIME_CONVERT predicates (#6957)

  • Prefetch call to fetch buffers of columns seen in the query (#6967)

  • Enabling compatibility tests in the script (#6959)

  • Add collectionToJsonMode to schema inference (#6946)

  • Add the complex-type support to decoder/reader (#6945)

  • Adding a new Controller API to retrieve ingestion status for real-time… (#6890)

  • Add support for Long in Modulo partition function. (#6929)

  • Enhance PinotSegmentRecordReader to preserve null values (#6922)

  • add complex-type support to avro-to-pinot schema inference (#6928)

  • Add correct yaml files for real-time data(#6787) (#6916)

  • Add complex-type transformation to offline segment creation (#6914)

  • Add config File support(#6787) (#6901)

  • Enhance JSON index to support nested array (#6877)

  • Add debug endpoint for tables. (#6897)

  • JSON column datatype support. (#6878)

  • Allow empty string in MV column (#6879)

  • Add Zstandard compression support with JMH benchmarking(#6804) (#6876)

  • Normalize LHS and RHS numerical types for = and != operator. (#6811)

  • Change ConcatCollector implementation to use off-heap (#6847)

  • [PQL Deprecation] Clean up the old BrokerRequestOptimizer (#6859)

  • [PQL Deprecation] Do not compile PQL broker request for SQL query (#6855)

  • Add TIMESTAMP and BOOLEAN data type support (#6719)

  • Add admin endpoint for Pinot Minon. (#6822)

  • Remove the usage of PQL compiler (#6808)

  • Add endpoints in Pinot Controller, Broker and Server to get system and application configs. (#6817)

  • Support IN predicate in ColumnValue SegmentPruner(#6756) (#6776)

  • Enable adding new segments to a upsert-enabled real-time table (#6567)

  • Interface changes for Kinesis connector (#6667)

  • Pinot Minion SegmentGenerationAndPush task: PinotFS configs inside taskSpec is always temporary and has higher priority than default PinotFS created by the minion server configs (#6744)

  • DataTable V3 implementation and measure data table serialization cost on server (#6710)

  • add uploadLLCSegment endpoint in TableResource (#6653)

  • File-based SegmentWriter implementation (#6718)

  • Basic Auth for pinot-controller (#6613)

  • UI integration with Authentication API and added login page (#6686)

  • Support data ingestion for offline segment in one pass (#6479)

  • SumPrecision: support all data types and star-tree (#6668)

  • complete compatibility regression testing (#6650)

  • Kinesis implementation Part 1: Rename partitionId to partitionGroupId (#6655)

  • Make Pinot metrics pluggable (#6640)

  • Recover the segment from controller when LLC table cannot load it (#6647)

  • Adding a new API for validating specified TableConfig and Schema (#6620)

  • Introduce a metric for query/response size on broker. (#6590)

  • Adding a controller periodic task to clean up dead minion instances (#6543)

  • Adding new validation for Json, TEXT indexing (#6541)

  • Always return a response from query execution. (#6596)

Special notes

  • After the 0.8.0 release, we will officially support jdk 11, and can now safely start to use jdk 11 features. Code is still compilable with jdk 8 (#6424)

  • RealtimeToOfflineSegmentsTask config has some backward incompatible changes (#7158)

    — timeColumnTransformFunction is removed (backward-incompatible, but rollup is not supported anyway)

    — Deprecate collectorType and replace it with mergeType

    — Add roundBucketTimePeriod and partitionBucketTimePeriod to config the time bucket for round and partition

  • Regex path for pluggable MinionEventObserverFactory is changed from org.apache.pinot.*.event.* to org.apache.pinot.*.plugin.minion.tasks.* (#6980)

  • Moved all pinot built-in minion tasks to the pinot-minion-builtin-tasks module and package them into a shaded jar (#6618)

  • Reloading consuming segment flag pinot.server.instance.reload.consumingSegment will be true by default (#7078)

  • Move JSON decoder from pinot-kafka to pinot-json package. (#7021)

  • Backward incompatible schema change through controller rest API PUT /schemas/{schemaName} will be blocked. (#6737)

  • Deprecated /tables/validateTableAndSchema in favor of the new configs/validate API and introduced new APIs for /tableConfigs to operate on the real-time table config, offline table config and schema in one shot. (#6840)

Major Bug fixes

  • Fix race condition in MinionInstancesCleanupTask (#7122)

  • Fix custom instance id for controller/broker/minion (#7127)

  • Fix UpsertConfig JSON deserialization. (#7125)

  • Fix the memory issue for selection query with large limit (#7112)

  • Fix the deleted segments directory not exist warning (#7097)

  • Fixing docker build scripts by providing JDK_VERSION as parameter (#7095)

  • Misc fixes for json data type (#7057)

  • Fix handling of date time columns in query recommender(#7018) (#7031)

  • fixing pinot-hadoop and pinot-spark test (#7030)

  • Fixing HadoopPinotFS listFiles method to always contain scheme (#7027)

  • fixed GenericRow compare for different _fieldToValueMap size (#6964)

  • Fix NPE in NumericalFilterOptimizer due to IS NULL and IS NOT NULL operator. (#7001)

  • Fix the race condition in real-time text index refresh thread (#6858) (#6990)

  • Fix deep store directory structure (#6976)

  • Fix NPE issue when consumed kafka message is null or the record value is null. (#6950)

  • Mitigate calcite NPE bug. (#6908)

  • Fix the exception thrown in the case that a specified table name does not exist (#6328) (#6765)

  • Fix CAST transform function for chained transforms (#6941)

  • Fixed failing pinot-controller npm build (#6795)

helm repo add pinot https://raw.githubusercontent.com/apache/pinot/master/helm
kubectl create ns pinot-quickstart
helm install pinot pinot/pinot \
    -n pinot-quickstart \
    --set cluster.name=pinot \
    --set server.replicaCount=2
helm dependency update
kubectl create ns pinot-quickstart
helm install -n pinot-quickstart pinot ./pinot
kubectl get all -n pinot-quickstart
helm repo add kafka https://charts.bitnami.com/bitnami
helm install -n pinot-quickstart kafka kafka/kafka --set replicas=1,zookeeper.image.tag=latest,listeners.client.protocol=PLAINTEXT
kubectl get all -n pinot-quickstart | grep kafka
pod/kafka-controller-0                   1/1     Running     0          2m
pod/kafka-controller-1                   1/1     Running     0          2m
pod/kafka-controller-2                   1/1     Running     0          2m
kubectl -n pinot-quickstart exec kafka-controller-0 -- kafka-topics.sh --bootstrap-server kafka:9092 --topic flights-realtime --create --partitions 1 --replication-factor 1
kubectl -n pinot-quickstart exec kafka-controller-0 -- kafka-topics.sh --bootstrap-server kafka:9092 --topic flights-realtime-avro --create --partitions 1 --replication-factor 1
kubectl apply -f pinot/helm/pinot/pinot-realtime-quickstart.yml
./query-pinot-data.sh
helm repo add superset https://apache.github.io/superset
helm inspect values superset/superset > /tmp/superset-values.yaml
kubectl create ns superset
helm upgrade --install --values /tmp/superset-values.yaml superset superset/superset -n superset
kubectl get all -n superset
kubectl port-forward service/superset 18088:8088 -n superset
helm repo add trino https://trinodb.github.io/charts/
helm search repo trino
helm inspect values trino/trino > /tmp/trino-values.yaml
additionalCatalogs:
  pinot: |
    connector.name=pinot
    pinot.controller-urls=pinot-controller.pinot-quickstart:9000
kubectl create ns trino-quickstart
helm install my-trino trino/trino --version 0.2.0 -n trino-quickstart --values /tmp/trino-values.yaml
kubectl get pods -n trino-quickstart
curl -L https://repo1.maven.org/maven2/io/trino/trino-cli/363/trino-cli-363-executable.jar -o /tmp/trino && chmod +x /tmp/trino
echo "Visit http://127.0.0.1:18080 to use your application"
kubectl port-forward service/my-trino 18080:8080 -n trino-quickstart
/tmp/trino --server localhost:18080 --catalog pinot --schema default
trino:default> show catalogs;
  Catalog
---------
 pinot
 system
 tpcds
 tpch
(4 rows)

Query 20211025_010256_00002_mxcvx, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0.70 [0 rows, 0B] [0 rows/s, 0B/s]
trino:default> show tables;
    Table
--------------
 airlinestats
(1 row)

Query 20211025_010326_00003_mxcvx, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0.28 [1 rows, 29B] [3 rows/s, 104B/s]
trino:default> DESCRIBE airlinestats;
        Column        |      Type      | Extra | Comment
----------------------+----------------+-------+---------
 flightnum            | integer        |       |
 origin               | varchar        |       |
 quarter              | integer        |       |
 lateaircraftdelay    | integer        |       |
 divactualelapsedtime | integer        |       |
 divwheelsons         | array(integer) |       |
 divwheelsoffs        | array(integer) |       |
......

Query 20211025_010414_00006_mxcvx, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0.37 [79 rows, 5.96KB] [212 rows/s, 16KB/s]
trino:default> select count(*) as cnt from airlinestats limit 10;
 cnt
------
 9746
(1 row)

Query 20211025_015607_00009_mxcvx, FINISHED, 2 nodes
Splits: 17 total, 17 done (100.00%)
0.24 [1 rows, 9B] [4 rows/s, 38B/s]
helm install presto pinot/presto -n pinot-quickstart
kubectl apply -f presto-coordinator.yaml
helm inspect values pinot/presto > /tmp/presto-values.yaml
helm install presto pinot/presto -n pinot-quickstart --values /tmp/presto-values.yaml
kubectl get pods -n pinot-quickstart
./pinot-presto-cli.sh
curl -L https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.246/presto-cli-0.246-executable.jar -o /tmp/presto-cli && chmod +x /tmp/presto-cli
kubectl port-forward service/presto-coordinator 18080:8080 -n pinot-quickstart> /dev/null &
/tmp/presto-cli --server localhost:18080 --catalog pinot --schema default
presto:default> show catalogs;
 Catalog
---------
 pinot
 system
(2 rows)

Query 20191112_050827_00003_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]
presto:default> show tables;
    Table
--------------
 airlinestats
(1 row)

Query 20191112_050907_00004_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [1 rows, 29B] [1 rows/s, 41B/s]
presto:default> DESCRIBE pinot.dontcare.airlinestats;
        Column        |  Type   | Extra | Comment
----------------------+---------+-------+---------
 flightnum            | integer |       |
 origin               | varchar |       |
 quarter              | integer |       |
 lateaircraftdelay    | integer |       |
 divactualelapsedtime | integer |       |
......

Query 20191112_051021_00005_xkm4g, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:02 [80 rows, 6.06KB] [35 rows/s, 2.66KB/s]
presto:default> select count(*) as cnt from pinot.dontcare.airlinestats limit 10;
 cnt
------
 9745
(1 row)

Query 20191112_051114_00006_xkm4g, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [1 rows, 8B] [2 rows/s, 19B/s]
kubectl delete ns pinot-quickstart
Enable Kubernetes on Docker-Desktop
Install Minikube for local setup
Set up a Kubernetes Cluster using Amazon Elastic Kubernetes Service (Amazon EKS)
Set up a Kubernetes Cluster using Google Kubernetes Engine (GKE)
Set up a Kubernetes Cluster using Azure Kubernetes Service (AKS)
open source project on GitHub
here
here
Sample Output of K8s Deployment Status
# checkout pinot
git clone https://github.com/apache/pinot.git
cd pinot/helm/pinot
2020/03/09 23:37:19.879 ERROR [HelixTaskExecutor] [CallbackProcessor@b808af5-pinot] [pinot-broker] [] Message cannot be processed: 78816abe-5288-4f08-88c0-f8aa596114fe, {CREATE_TIMESTAMP=1583797034542, MSG_ID=78816abe-5288-4f08-88c0-f8aa596114fe, MSG_STATE=unprocessable, MSG_SUBTYPE=REFRESH_SEGMENT, MSG_TYPE=USER_DEFINE_MSG, PARTITION_NAME=fooBar_OFFLINE, RESOURCE_NAME=brokerResource, RETRY_COUNT=0, SRC_CLUSTER=pinot, SRC_INSTANCE_TYPE=PARTICIPANT, SRC_NAME=Controller_hostname.domain,com_9000, TGT_NAME=Broker_hostname,domain.com_6998, TGT_SESSION_ID=f6e19a457b80db5, TIMEOUT=-1, segmentName=fooBar_559, tableName=fooBar_OFFLINE}{}{}
java.lang.UnsupportedOperationException: Unsupported user defined message sub type: REFRESH_SEGMENT
      at org.apache.pinot.broker.broker.helix.TimeboundaryRefreshMessageHandlerFactory.createHandler(TimeboundaryRefreshMessageHandlerFactory.java:68) ~[pinot-broker-0.2.1172.jar:0.3.0-SNAPSHOT-c9d88e47e02d799dc334d7dd1446a38d9ce161a3]
      at org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:1096) ~[helix-core-0.9.1.509.jar:0.9.1.509]
      at org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:866) [helix-core-0.9.1.509.jar:0.9.1.509]
#4694
#4877
#4602
#4994
#5016
#5033
#4964
#5018
#5070
#4535
#4583
#4666
#4747
#4791
#4977
#4983
#4695
#4990
#5015
#4993
#5020
#4740
#4954
#4726
#4959
#5073
#4545
#4639
#4824
#4806
#4828
#4838
#5054
#5073
#4767
#5040
#4903
#4995
#5006
#4764
#4793
#4855
#4808
#4882
#4929
#4976
#5114
#5137
#5138
#4230
#4585
#4943
#4806
#5054
0.2.0 and before Pinot Module Dependency Diagram
Dependency graph after introducing pinot-plugin in 0.3.0

Minion

Explore the minion component in Apache Pinot, empowering efficient data movement and segment generation within Pinot clusters.

A Pinot minion is an optional cluster component that executes background tasks on table data apart from the query processes performed by brokers and servers. Minions run on independent hardware resources, and are responsible for executing minion tasks as directed by the controller. Examples of minon tasks include converting batch data from a standard format like Avro or JSON into segment files to be loaded into an offline table, and rewriting existing segment files to purge records as required by data privacy laws like GDPR. Minion tasks can run once or be scheduled to run periodically.

Minions isolate the computational burden of out-of-band data processing from the servers. Although a Pinot cluster can function with or without minions, they are typically present to support routine tasks like batch data ingest.

Starting a minion

Make sure you've set up Zookeeper. If you're using Docker, make sure to pull the Pinot Docker image. To start a minion:

Usage: StartMinion
    -help                                                   : Print this message. (required=false)
    -minionHost               <String>                      : Host name for minion. (required=false)
    -minionPort               <int>                         : Port number to start the minion at. (required=false)
    -zkAddress                <http>                        : HTTP address of Zookeeper. (required=false)
    -clusterName              <String>                      : Pinot cluster name. (required=false)
    -configFileName           <Config File Name>            : Minion Starter Config file. (required=false)
docker run \
    --network=pinot-demo \
    --name pinot-minion \
    -d ${PINOT_IMAGE} StartMinion \
    -zkAddress pinot-zookeeper:2181
bin/pinot-admin.sh StartMinion \
    -zkAddress localhost:2181

Interfaces

Pinot task generator

The Pinot task generator interface defines the APIs for the controller to generate tasks for minions to execute.

public interface PinotTaskGenerator {

  /**
   * Initializes the task generator.
   */
  void init(ClusterInfoAccessor clusterInfoAccessor);

  /**
   * Returns the task type of the generator.
   */
  String getTaskType();

  /**
   * Generates a list of tasks to schedule based on the given table configs.
   */
  List<PinotTaskConfig> generateTasks(List<TableConfig> tableConfigs);

  /**
   * Returns the timeout in milliseconds for each task, 3600000 (1 hour) by default.
   */
  default long getTaskTimeoutMs() {
    return JobConfig.DEFAULT_TIMEOUT_PER_TASK;
  }

  /**
   * Returns the maximum number of concurrent tasks allowed per instance, 1 by default.
   */
  default int getNumConcurrentTasksPerInstance() {
    return JobConfig.DEFAULT_NUM_CONCURRENT_TASKS_PER_INSTANCE;
  }

  /**
   * Performs necessary cleanups (e.g. remove metrics) when the controller leadership changes.
   */
  default void nonLeaderCleanUp() {
  }
}

PinotTaskExecutorFactory

Factory for PinotTaskExecutor which defines the APIs for Minion to execute the tasks.

public interface PinotTaskExecutorFactory {

  /**
   * Initializes the task executor factory.
   */
  void init(MinionTaskZkMetadataManager zkMetadataManager);

  /**
   * Returns the task type of the executor.
   */
  String getTaskType();

  /**
   * Creates a new task executor.
   */
  PinotTaskExecutor create();
}
public interface PinotTaskExecutor {

  /**
   * Executes the task based on the given task config and returns the execution result.
   */
  Object executeTask(PinotTaskConfig pinotTaskConfig)
      throws Exception;

  /**
   * Tries to cancel the task.
   */
  void cancel();
}

MinionEventObserverFactory

Factory for MinionEventObserver which defines the APIs for task event callbacks on minion.

public interface MinionEventObserverFactory {

  /**
   * Initializes the task executor factory.
   */
  void init(MinionTaskZkMetadataManager zkMetadataManager);

  /**
   * Returns the task type of the event observer.
   */
  String getTaskType();

  /**
   * Creates a new task event observer.
   */
  MinionEventObserver create();
}
public interface MinionEventObserver {

  /**
   * Invoked when a minion task starts.
   *
   * @param pinotTaskConfig Pinot task config
   */
  void notifyTaskStart(PinotTaskConfig pinotTaskConfig);

  /**
   * Invoked when a minion task succeeds.
   *
   * @param pinotTaskConfig Pinot task config
   * @param executionResult Execution result
   */
  void notifyTaskSuccess(PinotTaskConfig pinotTaskConfig, @Nullable Object executionResult);

  /**
   * Invoked when a minion task gets cancelled.
   *
   * @param pinotTaskConfig Pinot task config
   */
  void notifyTaskCancelled(PinotTaskConfig pinotTaskConfig);

  /**
   * Invoked when a minion task encounters exception.
   *
   * @param pinotTaskConfig Pinot task config
   * @param exception Exception encountered during execution
   */
  void notifyTaskError(PinotTaskConfig pinotTaskConfig, Exception exception);
}

Built-in tasks

SegmentGenerationAndPushTask

The PushTask can fetch files from an input folder e.g. from a S3 bucket and converts them into segments. The PushTask converts one file into one segment and keeps file name in segment metadata to avoid duplicate ingestion. Below is an example task config to put in TableConfig to enable this task. The task is scheduled every 10min to keep ingesting remaining files, with 10 parallel task at max and 1 file per task.

NOTE: You may want to simply omit "tableMaxNumTasks" due to this caveat: the task generates one segment per file, and derives segment name based on the time column of the file. If two files happen to have same time range and are ingested by tasks from different schedules, there might be segment name conflict. To overcome this issue for now, you can omit “tableMaxNumTasks” and by default it’s Integer.MAX_VALUE, meaning to schedule as many tasks as possible to ingest all input files in a single batch. Within one batch, a sequence number suffix is used to ensure no segment name conflict. Because the sequence number suffix is scoped within one batch, tasks from different batches might encounter segment name conflict issue said above.

When performing ingestion at scale remember that Pinot will list all of the files contained in the `inputDirURI` every time a `SegmentGenerationAndPushTask` job gets scheduled. This could become a bottleneck when fetching files from a cloud bucket like GCS. To prevent this make `inputDirURI` point to the least number of files possible.

  "ingestionConfig": {
    "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "DAILY",
      "batchConfigMaps": [
        {
          "input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS",
          "input.fs.prop.region": "us-west-2",
          "input.fs.prop.secretKey": "....",
          "input.fs.prop.accessKey": "....",
          "inputDirURI": "s3://my.s3.bucket/batch/airlineStats/rawdata/",
          "includeFileNamePattern": "glob:**/*.avro",
          "excludeFileNamePattern": "glob:**/*.tmp",
          "inputFormat": "avro"
        }
      ]
    }
  },
  "task": {
    "taskTypeConfigsMap": {
      "SegmentGenerationAndPushTask": {
        "schedule": "0 */10 * * * ?",
        "tableMaxNumTasks": "10"
      }
    }
  }

RealtimeToOfflineSegmentsTask

See Pinot managed Offline flows for details.

MergeRollupTask

See Minion merge rollup task for details.

Enable tasks

Tasks are enabled on a per-table basis. To enable a certain task type (e.g. myTask) on a table, update the table config to include the task type:

{
  ...
  "task": {
    "taskTypeConfigsMap": {
      "myTask": {
        "myProperty1": "value1",
        "myProperty2": "value2"
      }
    }
  }
}

Under each enable task type, custom properties can be configured for the task type.

There are also two task configs to be set as part of cluster configs like below. One controls task's overall timeout (1hr by default) and one for how many tasks to run on a single minion worker (1 by default).

Using "POST /cluster/configs" API on CLUSTER tab in Swagger, with this payload
{
	"RealtimeToOfflineSegmentsTask.timeoutMs": "600000",
	"RealtimeToOfflineSegmentsTask.numConcurrentTasksPerInstance": "4"
}

Schedule tasks

Auto-schedule

There are 2 ways to enable task scheduling:

Controller level schedule for all minion tasks

Tasks can be scheduled periodically for all task types on all enabled tables. Enable auto task scheduling by configuring the schedule frequency in the controller config with the key controller.task.frequencyPeriod. This takes period strings as values, e.g. 2h, 30m, 1d.

Per table and task level schedule

Tasks can also be scheduled based on cron expressions. The cron expression is set in the schedule config for each task type separately. This config in the controller config, controller.task.scheduler.enabled should be set to true to enable cron scheduling.

As shown below, the RealtimeToOfflineSegmentsTask will be scheduled at the first second of every minute (following the syntax defined here).

  "task": {
    "taskTypeConfigsMap": {
      "RealtimeToOfflineSegmentsTask": {
        "bucketTimePeriod": "1h",
        "bufferTimePeriod": "1h",
        "schedule": "0 * * * * ?"
      }
    }
  },

Manual schedule

Tasks can be manually scheduled using the following controller rest APIs:

Rest API
Description

POST /tasks/schedule

Schedule tasks for all task types on all enabled tables

POST /tasks/schedule?taskType=myTask

Schedule tasks for the given task type on all enabled tables

POST /tasks/schedule?tableName=myTable_OFFLINE

Schedule tasks for all task types on the given table

POST /tasks/schedule?taskType=myTask&tableName=myTable_OFFLINE

Schedule tasks for the given task type on the given table

Schedule task on specific instances

Tasks can be scheduled on specific instances using the following config at task level:

  "task": {
    "taskTypeConfigsMap": {
      "RealtimeToOfflineSegmentsTask": {
        "bucketTimePeriod": "1h",
        "bufferTimePeriod": "1h",
        "schedule": "0 * * * * ?",
        "minionInstanceTag": "tag1_MINION"
      }
    }
  },

By default, the value is minion_untagged to have backward-compatibility. This will allow users to schedule tasks on specific nodes and isolate tasks among tables / task-types.

Rest API
Description

POST /tasks/schedule?taskType=myTask&tableName=myTable_OFFLINE&minionInstanceTag=tag1_MINION

Schedule tasks for the given task type of the given table on the minion nodes tagged as tag1_MINION.

Task level advanced configs

allowDownloadFromServer

When a task is executed on a segment, the minion node fetches the segment from deepstore. If the deepstore is not accessible, the minion node can download the segment from the server node. This is controlled by the allowDownloadFromServer config in the task config. By default, this is set to false.

We can also set this config at a minion instance level pinot.minion.task.allow.download.from.server (default is false). This instance level config helps in enforcing this behaviour if the number of tables / tasks is pretty high and we want to enable for all. Note: task-level config will override instance-level config value.

Plug-in custom tasks

To plug in a custom task, implement PinotTaskGenerator, PinotTaskExecutorFactory and MinionEventObserverFactory (optional) for the task type (all of them should return the same string for getTaskType()), and annotate them with the following annotations:

Implementation
Annotation

PinotTaskGenerator

@TaskGenerator

PinotTaskExecutorFactory

@TaskExecutorFactory

MinionEventObserverFactory

@EventObserverFactory

After annotating the classes, put them under the package of name org.apache.pinot.*.plugin.minion.tasks.*, then they will be auto-registered by the controller and minion.

Example

See SimpleMinionClusterIntegrationTest where the TestTask is plugged-in.

Task Manager UI

In the Pinot UI, there is Minion Task Manager tab under Cluster Manager page. From that minion task manager tab, one can find a lot of task related info for troubleshooting. Those info are mainly collected from the Pinot controller that schedules tasks or Helix that tracks task runtime status. There are also buttons to schedule tasks in an ad hoc way. Below are some brief introductions to some pages under the minion task manager tab.

This one shows which types of Minion Task have been used. Essentially which task types have created their task queues in Helix.

Clicking into a task type, one can see the tables using that task. And a few buttons to stop the task queue, cleaning up ended tasks etc.

Then clicking into any table in this list, one can see how the task is configured for that table. And the task metadata if there is one in ZK. For example, MergeRollupTask tracks a watermark in ZK. If the task is cron scheduled, the current and next schedules are also shown in this page like below.

At the bottom of this page is a list of tasks generated for this table for this specific task type. Like here, one MergeRollup task has been generated and completed.

Clicking into a task from that list, we can see start/end time for it, and the subtasks generated for that task (as context, one minion task can have multiple subtasks to process data in parallel). In this example, it happened to have one sub-task here, and it shows when it starts and stops and which minion worker it's running.

Clicking into this subtask, one can see more details about it like the input task configs and error info if the task failed.

Task-related metrics

There is a controller job that runs every 5 minutes by default and emits metrics about Minion tasks scheduled in Pinot. The following metrics are emitted for each task type:

  • NumMinionTasksInProgress: Number of running tasks

  • NumMinionSubtasksRunning: Number of running sub-tasks

  • NumMinionSubtasksWaiting: Number of waiting sub-tasks (unassigned to a minion as yet)

  • NumMinionSubtasksError: Number of error sub-tasks (completed with an error/exception)

  • PercentMinionSubtasksInQueue: Percent of sub-tasks in waiting or running states

  • PercentMinionSubtasksInError: Percent of sub-tasks in error

The controller also emits metrics about how tasks are cron scheduled:

  • cronSchedulerJobScheduled: Number of current cron schedules registered to be triggered regularly according their cron expressions. It's a Gauge.

  • cronSchedulerJobTrigger: Number of cron scheduled triggered, as a Meter.

  • cronSchedulerJobSkipped: Number of late cron scheduled skipped, as a Meter.

  • cronSchedulerJobExecutionTimeMs: Time used to complete task generation, as a Timer.

For each task, the minion will emit these metrics:

  • TASK_QUEUEING: Task queueing time (task_dequeue_time - task_inqueue_time), assuming the time drift between helix controller and pinot minion is minor, otherwise the value may be negative

  • TASK_EXECUTION: Task execution time, which is the time spent on executing the task

  • NUMBER_OF_TASKS: number of tasks in progress on that minion. Whenever a Minion starts a task, increase the Gauge by 1, whenever a Minion completes (either succeeded or failed) a task, decrease it by 1

  • NUMBER_TASKS_EXECUTED: Number of tasks executed, as a Meter.

  • NUMBER_TASKS_COMPLETED: Number of tasks completed, as a Meter.

  • NUMBER_TASKS_CANCELLED: Number of tasks cancelled, as a Meter.

  • NUMBER_TASKS_FAILED: Number of tasks failed, as a Meter. Different from fatal failure, the task encountered an error which can not be recovered from this run, but it may still succeed by retrying the task.

  • NUMBER_TASKS_FATAL_FAILED: Number of tasks fatal failed, as a Meter. Different from failure, the task encountered an error, which will not be recoverable even with retrying the task.

Forward index

The forward index is the mechanism Pinot employs to store the values of each column. At a conceptual level, the forward index can be thought of as a mapping from document IDs (also known as row indices) to the actual column values of each row.

Forward indexes are enabled by default, meaning that columns will have a forward index unless explicitly disabled. Disabling the forward index can save storage space when other indexes sufficiently cover the required data patterns. For information on how to disable the forward index and its implications, refer to Disabling the Forward Index.

Dictionary encoded vs raw value

How forward indexes are implemented depends on the index encoding and whether the column is sorted.

When the encoding is set to RAW, the forward index is implemented as an array, where the indices correspond to document IDs and the values represent the actual row values. For more details, refer to the raw value forward index section.

In the case of DICTIONARY encoding, the forward index doesn't store the actual row values but instead stores dictionary IDs. This introduces an additional level of indirection when reading values, but it allows for more efficient physical layouts when unique number of values in the column is significantly smaller than the number of rows.

The DICTIONARY encoding can be even more efficient if the segment is sorted by the indexed column. You can learn more about the dictionary encoded forward index and the sorted forward index in their respective sections.

When working out whether a column should use dictionary encoded or raw value encoding, the following comparison table may help:

Dictionary
Raw Value

Provides compression when low to medium cardinality.

Eliminates padding overhead

Allows for indexing (esp inv index).

No inv index (only JSON/Text/FST index)

Adds one level of dereferencing, so can increase disk seeks

Eliminates additional dereferencing, so good when all docs of interest are contiguous

For Strings, adds padding to make all values equal length in the dictionary

Chunk de-compression overhead with docs selected don't have spatial locality

Dictionary-encoded forward index with bit compression (default)

In this approach, each unique value in a column is assigned an ID, and a dictionary is constructed to map these IDs back to their corresponding values. Instead of storing the actual values, the default forward index stores these bit-compressed IDs. This method is particularly effective when dealing with columns containing few unique values, as it significantly improves space efficiency.

The below diagram shows the dictionary encoding for two columns with integer and string types. ForcolA, dictionary encoding saved a significant amount of space for duplicated values.

The diagram below illustrates dictionary encoding for two columns with different data types (integer and string). For colA, dictionary encoding leads to significant space savings due to duplicated values. However, for colB, which contains mostly unique values, the compression effect is limited, and padding overhead may be high.

To know more about dictionary encoding, see Dictionary index.

When using the dictionary-encoded forward index for multi-value column, to further compress the forward index for repeated multi-value entires, enable the MV_ENTRY_DICT compression type which adds another level of dictionary encoding on the multi-value entries. This may be useful, for example, in cases where you pre-join a fact table with dimension table, where the multi-value entries in the dimension table are repeated after joining with the fact table.

It can be enabled with parameter:

Parameter
Default
Description

dictIdCompressionType

null

The compression that will be used for dictionary-encoded forward index

Sorted forward index with run-length encoding

When a column is physically sorted, Pinot employs a sorted forward index with run-length encoding, which builds upon dictionary encoding. Instead of storing dictionary IDs for each document ID, this approach stores pairs of start and end document IDs for each unique value.

Sorted forward index

(For simplicity, this diagram does not include the dictionary encoding layer.)

Sorted forward indexes offer the benefits of efficient compression and data locality and can also serve as an inverted index. They are active when two conditions are met: the segment is sorted by the column, and the dictionary is enabled for that column. Refer to the dictionary documentation for details on enabling the dictionary.

When dealing with multiple segments, it's crucial to ensure that data is sorted within each segment. Sorting across segments is not necessary.

To guarantee that a segment is sorted by a particular column, follow these steps:

  • For real-time tables, use the tableIndexConfig.sortedColumn property. If there is exactly one column specified in that array, Pinot will sort the segment by that column upon committing.

  • For offline tables, you must pre-sort the data by the specified column before ingesting it into Pinot.

It's crucial to note that for offline tables, the tableIndexConfig.sortedColumn property is indeed ignored.

Additionally, for online tables, even though this property is specified as a JSON array, at most one column should be included. Using an array with more than one column is incorrect and will not result in segments being sorted by all the columns listed in the array.

When a real-time segment is committed, rows will be sorted by the sorting column and it will be transformed into an offline segment.

During the creation of an offline segment, which also applies when a real-time segment is committed, Pinot scans the data in each column. If it detects that all values within a column are sorted in ascending order, Pinot concludes that the segment is sorted based on that particular column. In case this happens on more than one column, all of them are considered as sorting columns. Consequently, whether a segment is sorted by a column or not solely depends on the actual data distribution within the segment and entirely disregards the value of the sortedColumn property. This approach also implies that two segments belonging to the same table may have a different number of sorting columns. In the extreme scenario where a segment contains only one row, Pinot will consider all columns within that segment as sorting columns.

Here is an example of a table configuration that illustrates these concepts:

Part of a tableConfig
{
    "tableIndexConfig": {
        "sortedColumn": [
            "column_name"
        ],
        ...
    }
}

Checking sort status

You can check the sorted status of a column in a segment by running the following:

$ grep memberId <segment_name>/v3/metadata.properties | grep isSorted
column.memberId.isSorted = true

Alternatively, for offline tables and for committed segments in real-time tables, you can retrieve the sorted status from the getServerMetadata endpoint. The following example is based on the Batch Quick Start:

curl -X GET \
  "http://localhost:9000/segments/baseballStats/metadata?columns=playerID&columns=teamID" \
  -H "accept: application/json" 2>/dev/null | \
  jq -c  '.[] | . as $parent |  
          .columns[] | 
          [$parent .segmentName, .columnName, .sorted]'
["baseballStats_OFFLINE_0","teamID",false]
["baseballStats_OFFLINE_0","playerID",false]

Raw value forward index

The raw value forward index stores actual values instead of IDs. This means that it eliminates the need for dictionary lookups when fetching values, which can result in improved query performance. Raw forward index is particularly effective for columns with a large number of unique values, where dictionary encoding doesn't provide significant compression benefits.

As shown in the diagram below, dictionary encoding can lead to numerous random memory accesses for dictionary lookups. In contrast, the raw value forward index allows for sequential value scanning, which can enhance query performance when applied appropriately.

Note: Raw value forward index currently does not support inverted index (all others JSON/TEXT/Range/etc are supported). Also, since reading a value from this index requires reading the entire chunk in memory and decompressing, it is not suitable for heavy random reads.

The raw format is applied when the dictionary is disabled for a column and the encoding is explicitly set to RAW. For more details, refer to the dictionary documentation and the field config list.

Note: Both configurations must be enabled together for the raw format to take effect. Setting only the encodingType to RAW in the field config is not sufficient.

When using the raw format, you can configure the following parameters:

Parameter
Default
Description

chunkCompressionType

null

The compression that will be used. Replaced by compressionCodec since release 1.2.0

compressionCodec

null

The compression that will be used. Introduced in release 1.2.0

deriveNumDocsPerChunk

false

Modifies the behavior when storing variable length values (like string or bytes)

rawIndexWriterVersion

2

The version initially used

targetDocsPerChunk

1000

The target number of docs per chunk

targetMaxChunkSize

1MB

The target max chunk size

The compressionCodec parameter has the following valid values:

  • PASS_THROUGH

  • SNAPPY

  • ZSTANDARD

  • LZ4

  • GZIP (Introduced in release 1.2.0)

  • null (the JSON null value, not "null"), which is the default. In this case, PASS_THROUGH will be used for metrics and LZ4 for other columns.

deriveNumDocsPerChunk is only used when the datatype may have a variable length, such as with string, big decimal, bytes, etc. By default, Pinot uses a fixed number of elements that was chosen empirically. If changed to true, Pinot will use a heuristic value that depends on the column data.

rawIndexWriterVersion changes the algorithm used to create the index. This changes the actual data layout, but modern versions of Pinot can read indexes written in older versions. The latest version right now is 4.

targetDocsPerChunk changes the target number of docs to store in a chunk. For rawIndexWriterVersion versions 2 and 3, this will store exactly targetDocsPerChunk per chunk. For rawIndexWriterVersion version 4, this config is used in conjunction with targetMaxChunkSize and chunk size is determined with the formula min(lengthOfLongestDocumentInSegment * targetDocsPerChunk, targetMaxChunkSize). A negative value will disable dynamic chunk sizing and use the static targetMaxChunkSize.

targetMaxChunkSize changes the target max chunk size. For rawIndexWriterVersion versions 2 and 3, this can only be used with deriveNumDocsPerChunk. For rawIndexWriterVersion version 4, this sets the upper bound for a dynamically calculated chunk size. Documents larger than the targetMaxChunkSize will be given their own 'huge' chunk, therefore, it is recommended to size this such that huge chunks are avoided.

Raw forward index configuration

The recommended way to configure the forward index using raw format is by including the parameters explained above in the indexes.forward object. For example:

Configured in tableConfig fieldConfigList
{
  "tableName": "somePinotTable",
  "fieldConfigList": [
    {
      "name": "playerID",
      "encodingType": "RAW",
      "indexes": {
        "forward": {
          "compressionCodec": "PASS_THROUGH", // or "SNAPPY", "ZSTANDARD", "LZ4" or "GZIP"
          "deriveNumDocsPerChunk": false,
          "rawIndexWriterVersion": 2
        }
      }
    },
    ...
  ],
...
}

Deprecated

An alternative method to configure the raw format parameters is available. This older approach can still be used, although it is not recommended. Here are the details of this older method:

  • chunkCompressionType: This parameter can be defined as a sibling of name and encodingType in the fieldConfigList section.

  • deriveNumDocsPerChunk: You can configure this parameter with the property deriveNumDocsPerChunkForRawIndex. Note that in properties, all values must be strings, so valid values for this property are "true" and "false".

  • rawIndexWriterVersion: This parameter can be configured using the property rawIndexWriterVersion. Again, in properties, all values must be strings, so valid values for this property are "2", "3", and so on.

For example:

Configured in tableConfig fieldConfigList
{
  "tableName": "somePinotTable",
  "fieldConfigList": [
    {
      "name": "playerID",
      "encodingType": "RAW",
      "chunkCompressionType": "PASS_THROUGH", // it can also be defined here
      "properties": {
        "deriveNumDocsPerChunkForRawIndex": "false", // here the string value has to be used
        "rawIndexWriterVersion": "2" // here the string value has to be used
      }
    },
    ...
  ],
...
}

While this older method is still supported, it is not the recommended way to configure these parameters. There are no plans to remove support for this older method, but keep in mind that any new parameters added in the future may only be configurable in the forward JSON object.

Disabling the forward index

Traditionally the forward index has been a mandatory index for all columns in the on-disk segment file format.

However, certain columns may only be used as a filter in the WHERE clause for all queries. In such scenarios the forward index is not necessary as essentially other indexes and structures in the segments can provide the required SQL query functionality. Forward index just takes up extra storage space for such scenarios and can ideally be freed up.

Thus, to provide users an option to save storage space, a knob to disable the forward index is now available.

Forward index on one or more columns(s) in your Pinot table can be disabled with the following limitations:

  • Only supported for immutable (offline) segments.

  • If the column has a range index then the column must be of single-value type and use range index version 2.

  • MV columns with duplicates within a row will lose the duplicated entries on forward index regeneration. The ordering of data with an MV row may also change on regeneration. A backfill is required in such scenarios (to preserve duplicates or ordering).

  • If forward index regeneration support on reload (i.e. re-enabling the forward index for a forward index disabled column) is required then the dictionary and inverted index must be enabled on that particular column.

Sorted columns will allow the forward index to be disabled, but this operation will be treated as a no-op and the index (which acts as both a forward index and inverted index) will be created.

To disable the forward index, in table config under fieldConfigList, set the disabled property to true as shown below:

Configured in tableConfig fieldConfigList
{
  "tableName": "somePinotTable",
  "fieldConfigList": [
    {
      "name":"columnA",
      "indexes": {
        "forward": {
          "disabled": true
        }
      }
    },
    ...
  ],
  ...
}

The older way to do so is still supported, but not recommended.

Configured in tableConfig fieldConfigList
"fieldConfigList":[
  {
     "name":"columnA",
     "properties": {
        "forwardIndexDisabled": "true"
      }
  }
]

A table reload operation must be performed for the above config to take effect. Enabling / disabling other indexes on the column can be done via the usual table config options.

The forward index can also be regenerated for a column where it is disabled by enabling the index and reloading the segment. The forward index can only be regenerated if the dictionary and inverted index have been enabled for the column. If either have been disabled then the only way to get the forward index back is to regenerate the segments via the offline jobs and re-push / refresh the data.

Warning:

For multi-value (MV) columns the following invariants cannot be maintained after regenerating the forward index for a forward index disabled column:

  • Ordering guarantees of the MV values within a row

  • If entries within an MV row are duplicated, the duplicates will be lost. Regenerate the segments via your offline jobs and re-push / refresh the data to get back the original MV data with duplicates.

We will work on removing the second invariant in the future.

Examples of queries which will fail after disabling the forward index for an example column, columnA, can be found below:

Select

Forward index disabled columns cannot be present in the SELECT clause even if filters are added on it.

SELECT columnA
FROM myTable
    WHERE columnA = 10
SELECT *
FROM myTable

Group By Order By

Forward index disabled columns cannot be present in the GROUP BY and ORDER BY clauses. They also cannot be part of the HAVING clause.

SELECT SUM(columnB)
FROM myTable
GROUP BY columnA
SELECT SUM(columnB), columnA
FROM myTable
GROUP BY columnA
ORDER BY columnA
SELECT MIN(columnA)
FROM myTable
GROUP BY columnB
HAVING MIN(columnA) > 100
ORDER BY columnB

Aggregation Queries

A subset of the aggregation functions do work when the forward index is disabled such as MIN, MAX, DISTINCTCOUNT, DISTINCTCOUNTHLL and more. Some of the other aggregation functions will not work such as the below:

SELECT SUM(columnA), AVG(columnA)
FROM myTable
SELECT MAX(ADD(columnA, columnB))
FROM myTable

Distinct

Forward index disabled columns cannot be present in the SELECT DISTINCT clause.

SELECT DISTINCT columnA
FROM myTable

Range Queries

To run queries on single-value columns where the filter clause contains operators such as >, <, >=, <= a version 2 range index must be present. Without the range index such queries will fail as shown below:

SELECT columnB
FROM myTable
    WHERE columnA > 1000

Connect to Streamlit

In this Apache Pinot guide, we'll learn how visualize data using the Streamlit web framework.

In this guide you'll learn how to visualize data from Apache Pinot using Streamlit. Streamlit is a Python library that makes it easy to build interactive data based web applications.

We're going to use Streamlit to build a real-time dashboard to visualize the changes being made to Wikimedia properties.

Real-Time Dashboard Architecture

Startup components

We're going to use the following Docker compose file, which spins up instances of Zookeeper, Kafka, along with a Pinot controller, broker, and server:

version: '3.7'
services:
  zookeeper:
    image: zookeeper:3.5.6
    container_name: "zookeeper-wiki"
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
  kafka:
    image: wurstmeister/kafka:latest
    restart: unless-stopped
    container_name: "kafka-wiki"
    ports:
      - "9092:9092"
    expose:
      - "9093"
    depends_on:
      - zookeeper
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper-wiki:2181/kafka
      KAFKA_BROKER_ID: 0
      KAFKA_ADVERTISED_HOST_NAME: kafka-wiki
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-wiki:9093,OUTSIDE://localhost:9092
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,OUTSIDE:PLAINTEXT
  pinot-controller:
    image: apachepinot/pinot:0.10.0
    command: "StartController -zkAddress zookeeper-wiki:2181 -dataDir /data"
    container_name: "pinot-controller-wiki"
    volumes:
      - ./config:/config
      - ./data:/data
    restart: unless-stopped
    ports:
      - "9000:9000"
    depends_on:
      - zookeeper
  pinot-broker:
    image: apachepinot/pinot:0.10.0
    command: "StartBroker -zkAddress zookeeper-wiki:2181"
    restart: unless-stopped
    container_name: "pinot-broker-wiki"
    volumes:
      - ./config:/config
    ports:
      - "8099:8099"
    depends_on:
      - pinot-controller
  pinot-server:
    image: apachepinot/pinot:0.10.0
    command: "StartServer -zkAddress zookeeper-wiki:2181"
    restart: unless-stopped
    container_name: "pinot-server-wiki"
    volumes:
      - ./config:/config
    depends_on:
      - pinot-broker

docker-compose.yml

Run the following command to launch all the components:

docker-compose up

Wikimedia recent changes stream

Wikimedia provides provides a continuous stream of structured event data describing changes made to various Wikimedia properties. The events are published over HTTP using the Server-Side Events (SSE) Protocol.

You can find the endpoint at: stream.wikimedia.org/v2/stream/recentchange

We'll need to install the SSE client library to consume this data:

pip install sseclient-py

Next, create a file called wiki.py that contains the following:

import json
import pprint
import sseclient
import requests

def with_requests(url, headers):
    """Get a streaming response for the given event feed using requests."""    
    return requests.get(url, stream=True, headers=headers)

url = 'https://stream.wikimedia.org/v2/stream/recentchange'
headers = {'Accept': 'text/event-stream'}
response = with_requests(url, headers)
client = sseclient.SSEClient(response)

for event in client.events():
    stream = json.loads(event.data)
    pprint.pprint(stream)

wiki.py

The highlighted section shows how we connect to the recent changes feed using the SSE client library.

Let's run this script as shown below:

python wiki.py

We'll see the following (truncated) output:

Output

{'$schema': '/mediawiki/recentchange/1.0.0',
 'bot': False,
 'comment': '[[:File:Storemyr-Fagerbakken landskapsvernområde HVASSER '
            'Oslofjorden Norway (Protected coastal forest Recreational area '
            'hiking trails) Rituell-kultisk steinstreng sørøst i skogen (small '
            'archeological stone string) Vår (spring) 2021-04-24.jpg]] removed '
            'from category',
 'id': 1923506287,
 'meta': {'domain': 'commons.wikimedia.org',
          'dt': '2022-05-12T09:57:00Z',
          'id': '3800228e-43d8-440d-8034-c68977742653',
          'offset': 3855767440,
          'partition': 0,
          'request_id': '930b17cc-f14a-4656-afa1-d15b79a8f666',
          'stream': 'mediawiki.recentchange',
          'topic': 'eqiad.mediawiki.recentchange',
          'uri': 'https://commons.wikimedia.org/wiki/Category:Iron_Age_in_Norway'},
 'namespace': 14,
 'parsedcomment': '<a '
                  'href="/wiki/File:Storemyr-Fagerbakken_landskapsvernomr%C3%A5de_HVASSER_Oslofjorden_Norway_(Protected_coastal_forest_Recreational_area_hiking_trails)_Rituell-kultisk_steinstreng_s%C3%B8r%C3%B8st_i_skogen_(small_archeological_stone_string)_V%C3%A5r_(spring)_2021-04-24.jpg" '
                  'title="File:Storemyr-Fagerbakken landskapsvernområde '
                  'HVASSER Oslofjorden Norway (Protected coastal forest '
                  'Recreational area hiking trails) Rituell-kultisk '
                  'steinstreng sørøst i skogen (small archeological stone '
                  'string) Vår (spring) '
                  '2021-04-24.jpg">File:Storemyr-Fagerbakken '
                  'landskapsvernområde HVASSER Oslofjorden Norway (Protected '
                  'coastal forest Recreational area hiking trails) '
                  'Rituell-kultisk steinstreng sørøst i skogen (small '
                  'archeological stone string) Vår (spring) 2021-04-24.jpg</a> '
                  'removed from category',
 'server_name': 'commons.wikimedia.org',
 'server_script_path': '/w',
 'server_url': 'https://commons.wikimedia.org',
 'timestamp': 1652349420,
 'title': 'Category:Iron Age in Norway',
 'type': 'categorize',
 'user': 'Krg',
 'wiki': 'commonswiki'}
{'$schema': '/mediawiki/recentchange/1.0.0',
 'bot': False,
 'comment': '[[:File:Storemyr-Fagerbakken landskapsvernområde HVASSER '
            'Oslofjorden Norway (Protected coastal forest Recreational area '
            'hiking trails) Rituell-kultisk steinstreng sørøst i skogen (small '
            'archeological stone string) Vår (spring) 2021-04-24.jpg]] removed '
            'from category',
 'id': 1923506289,
 'meta': {'domain': 'commons.wikimedia.org',
          'dt': '2022-05-12T09:57:00Z',
          'id': '2b819d20-beca-46a5-8ce3-b2f3b73d2cbe',
          'offset': 3855767441,
          'partition': 0,
          'request_id': '930b17cc-f14a-4656-afa1-d15b79a8f666',
          'stream': 'mediawiki.recentchange',
          'topic': 'eqiad.mediawiki.recentchange',
          'uri': 'https://commons.wikimedia.org/wiki/Category:Cultural_heritage_monuments_in_F%C3%A6rder'},
 'namespace': 14,
 'parsedcomment': '<a '
                  'href="/wiki/File:Storemyr-Fagerbakken_landskapsvernomr%C3%A5de_HVASSER_Oslofjorden_Norway_(Protected_coastal_forest_Recreational_area_hiking_trails)_Rituell-kultisk_steinstreng_s%C3%B8r%C3%B8st_i_skogen_(small_archeological_stone_string)_V%C3%A5r_(spring)_2021-04-24.jpg" '
                  'title="File:Storemyr-Fagerbakken landskapsvernområde '
                  'HVASSER Oslofjorden Norway (Protected coastal forest '
                  'Recreational area hiking trails) Rituell-kultisk '
                  'steinstreng sørøst i skogen (small archeological stone '
                  'string) Vår (spring) '
                  '2021-04-24.jpg">File:Storemyr-Fagerbakken '
                  'landskapsvernområde HVASSER Oslofjorden Norway (Protected '
                  'coastal forest Recreational area hiking trails) '
                  'Rituell-kultisk steinstreng sørøst i skogen (small '
                  'archeological stone string) Vår (spring) 2021-04-24.jpg</a> '
                  'removed from category',
 'server_name': 'commons.wikimedia.org',
 'server_script_path': '/w',
 'server_url': 'https://commons.wikimedia.org',
 'timestamp': 1652349420,
 'title': 'Category:Cultural heritage monuments in Færder',
 'type': 'categorize',
 'user': 'Krg',
 'wiki': 'commonswiki'}

Ingest recent changes into Kafka

Now we're going to import each of the events into Apache Kafka. First let's create a Kafka topic called wiki_events with 5 partitions:

docker exec -it kafka-wiki kafka-topics.sh \
  --bootstrap-server localhost:9092 \
  --create \
  --topic wiki_events \
  --partitions 5

Create a new file called wiki_to_kafka.py and import the following libraries:

import json
import sseclient
import datetime
import requests
import time
from confluent_kafka import Producer

wiki_to_kafka.py

Add these functions:

def with_requests(url, headers):
    """Get a streaming response for the given event feed using requests."""    
    return requests.get(url, stream=True, headers=headers)

def acked(err, msg):
    if err is not None:
        print("Failed to deliver message: {0}: {1}"
              .format(msg.value(), err.str()))

def json_serializer(obj):
    if isinstance(obj, (datetime.datetime, datetime.date)):
        return obj.isoformat()
    raise "Type %s not serializable" % type(obj)

wiki_to_kafka.py

And now let's add the code that calls the recent changes API and imports events into the wiki_events topic:

producer = Producer({'bootstrap.servers': 'localhost:9092'})

url = 'https://stream.wikimedia.org/v2/stream/recentchange'
headers = {'Accept': 'text/event-stream'}
response = with_requests(url, headers) 
client = sseclient.SSEClient(response)

events_processed = 0
while True:
    try: 
        for event in client.events():
            stream = json.loads(event.data)
            payload = json.dumps(stream, default=json_serializer, ensure_ascii=False).encode('utf-8')
            producer.produce(topic='wiki_events', 
              key=str(stream['meta']['id']), value=payload, callback=acked)

            events_processed += 1
            if events_processed % 100 == 0:
                print(f"{str(datetime.datetime.now())} Flushing after {events_processed} events")
                producer.flush()
    except Exception as ex:
        print(f"{str(datetime.datetime.now())} Got error:" + str(ex))
        response = with_requests(url, headers) 
        client = sseclient.SSEClient(response)
        time.sleep(2)

wiki_to_kafka.py

The highlighted parts of this script indicate where events are ingested into Kafka and then flushed to disk.

If we run this script:

python wiki_to_kafka.py

We'll see a message every time 100 messages are pushed to Kafka, as shown below:

Output

2022-05-12 10:58:34.449326 Flushing after 100 events
2022-05-12 10:58:39.151599 Flushing after 200 events
2022-05-12 10:58:43.399528 Flushing after 300 events
2022-05-12 10:58:47.350277 Flushing after 400 events
2022-05-12 10:58:50.847959 Flushing after 500 events
2022-05-12 10:58:54.768228 Flushing after 600 events

Explore Kafka

Let's check that the data has made its way into Kafka.

The following command returns the message offset for each partition in the wiki_events topic:

docker exec -it kafka-wiki kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list localhost:9092 \
  --topic wiki_events

Output

wiki_events:0:42
wiki_events:1:61
wiki_events:2:52
wiki_events:3:56
wiki_events:4:58

Looks good. We can also stream all the messages in this topic by running the following command:

docker exec -it kafka-wiki kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 \
  --topic wiki_events \
  --from-beginning

Output

...
{"$schema": "/mediawiki/recentchange/1.0.0", "meta": {"uri": "https://en.wikipedia.org/wiki/Super_Wings", "request_id": "6f82e64d-220f-41f4-88c3-2e15f03ae504", "id": "c30cd735-1ead-405e-94d1-49fbe7c40411", "dt": "2022-05-12T10:05:36Z", "domain": "en.wikipedia.org", "stream": "mediawiki.recentchange", "topic": "eqiad.mediawiki.recentchange", "partition": 0, "offset": 3855779703}, "type": "log", "namespace": 0, "title": "Super Wings", "comment": "", "timestamp": 1652349936, "user": "2001:448A:50E0:885B:FD1D:2D04:233E:7647", "bot": false, "log_id": 0, "log_type": "abusefilter", "log_action": "hit", "log_params": {"action": "edit", "filter": "550", "actions": "tag", "log": 32575794}, "log_action_comment": "2001:448A:50E0:885B:FD1D:2D04:233E:7647 triggered [[Special:AbuseFilter/550|filter 550]], performing the action \"edit\" on [[Super Wings]]. Actions taken: Tag ([[Special:AbuseLog/32575794|details]])", "server_url": "https://en.wikipedia.org", "server_name": "en.wikipedia.org", "server_script_path": "/w", "wiki": "enwiki", "parsedcomment": ""}
{"$schema": "/mediawiki/recentchange/1.0.0", "meta": {"uri": "https://no.wikipedia.org/wiki/Brukerdiskusjon:Haros", "request_id": "a20c9692-f301-4faf-9373-669bebbffff4", "id": "566ee63e-8e86-4a7e-a1f3-562704306509", "dt": "2022-05-12T10:05:36Z", "domain": "no.wikipedia.org", "stream": "mediawiki.recentchange", "topic": "eqiad.mediawiki.recentchange", "partition": 0, "offset": 3855779714}, "id": 84572581, "type": "edit", "namespace": 3, "title": "Brukerdiskusjon:Haros", "comment": "/* Stor forbokstav / ucfirst */", "timestamp": 1652349936, "user": "Asav", "bot": false, "minor": false, "patrolled": true, "length": {"old": 110378, "new": 110380}, "revision": {"old": 22579494, "new": 22579495}, "server_url": "https://no.wikipedia.org", "server_name": "no.wikipedia.org", "server_script_path": "/w", "wiki": "nowiki", "parsedcomment": "<span dir=\"auto\"><span class=\"autocomment\"><a href=\"/wiki/Brukerdiskusjon:Haros#Stor_forbokstav_/_ucfirst\" title=\"Brukerdiskusjon:Haros\">→‎Stor forbokstav / ucfirst</a></span></span>"}
{"$schema": "/mediawiki/recentchange/1.0.0", "meta": {"uri": "https://es.wikipedia.org/wiki/Campo_de_la_calle_Industria", "request_id": "d45bd9af-3e2c-4aac-ae8f-e16d3340da76", "id": "7fb3956e-9bd2-4fa5-8659-72b266cdb45b", "dt": "2022-05-12T10:05:35Z", "domain": "es.wikipedia.org", "stream": "mediawiki.recentchange", "topic": "eqiad.mediawiki.recentchange", "partition": 0, "offset": 3855779718}, "id": 266270269, "type": "edit", "namespace": 0, "title": "Campo de la calle Industria", "comment": "/* Historia */", "timestamp": 1652349935, "user": "Raimon will", "bot": false, "minor": false, "length": {"old": 7566, "new": 7566}, "revision": {"old": 143485393, "new": 143485422}, "server_url": "https://es.wikipedia.org", "server_name": "es.wikipedia.org", "server_script_path": "/w", "wiki": "eswiki", "parsedcomment": "<span dir=\"auto\"><span class=\"autocomment\"><a href=\"/wiki/Campo_de_la_calle_Industria#Historia\" title=\"Campo de la calle Industria\">→‎Historia</a></span></span>"}
^CProcessed a total of 269 messages

Configure Pinot

Now let's configure Pinot to consume the data from Kafka.

We'll have the following schema:

{
    "schemaName": "wikipedia",
    "dimensionFieldSpecs": [
      {
        "name": "id",
        "dataType": "STRING"
      },
      {
        "name": "wiki",
        "dataType": "STRING"
      },
      {
        "name": "user",
        "dataType": "STRING"
      },
      {
        "name": "title",
        "dataType": "STRING"
      },
      {
        "name": "comment",
        "dataType": "STRING"
      },
      {
        "name": "stream",
        "dataType": "STRING"
      },
      {
        "name": "domain",
        "dataType": "STRING"
      },
      {
        "name": "topic",
        "dataType": "STRING"
      },
      {
        "name": "type",
        "dataType": "STRING"
      },
      {
        "name": "uri",
        "dataType": "STRING"
      },
      {
        "name": "bot",
        "dataType": "BOOLEAN"
      },
      {
        "name": "metaJson",
        "dataType": "STRING"
      }
    ],
    "dateTimeFieldSpecs": [
      {
        "name": "ts",
        "dataType": "TIMESTAMP",
        "format": "1:MILLISECONDS:EPOCH",
        "granularity": "1:MILLISECONDS"
      }
    ]
  }

schema.json

And the following table config:

{
    "tableName": "wikievents",
    "tableType": "REALTIME",
    "segmentsConfig": {
      "timeColumnName": "ts",
      "schemaName": "wikipedia",
      "replication": "1",
      "replicasPerPartition": "1"
    },

    "tableIndexConfig": {
      "invertedIndexColumns": [],
      "rangeIndexColumns": [],
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [],
      "bloomFilterColumns": [],
      "loadMode": "MMAP",
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "wiki_events",
        "stream.kafka.broker.list": "kafka-wiki:9093",
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
        "realtime.segment.flush.threshold.rows": "1000",
        "realtime.segment.flush.threshold.time": "24h",
        "realtime.segment.flush.segment.size": "100M"
      },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant",
      "tagOverrideConfig": {}
    },
      "noDictionaryColumns": [],
      "onHeapDictionaryColumns": [],
      "varLengthDictionaryColumns": [],
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false
    },
    "metadata": {},
    "quota": {},
    "routing": {},
    "query": {},
    "ingestionConfig": {
      "transformConfigs": [
        {
          "columnName": "metaJson",
          "transformFunction": "JSONFORMAT(meta)"
        },
        {
          "columnName": "id",
          "transformFunction": "JSONPATH(metaJson, '$.id')"
        },
        {
          "columnName": "stream",
          "transformFunction": "JSONPATH(metaJson, '$.stream')"
        },
        {
          "columnName": "domain",
          "transformFunction": "JSONPATH(metaJson, '$.domain')"
        },
        {
          "columnName": "topic",
          "transformFunction": "JSONPATH(metaJson, '$.topic')"
        },
        {
          "columnName": "uri",
          "transformFunction": "JSONPATH(metaJson, '$.uri')"
        },
        {
          "columnName": "ts",
          "transformFunction": "\"timestamp\" * 1000"
        }
      ]
    },
    "isDimTable": false
  }

table.json

The highlighted lines are how we connect Pinot to the Kafka topic that contains the events. Create the schema and table by running the following commnad:

docker exec -it pinot-controller-wiki bin/pinot-admin.sh AddTable \
  -tableConfigFile /config/table.json \
  -schemaFile /config/schema.json \
  -exec

Once you've done that, navigate to the Pinot UI and run the following query to check that the data has made its way into Pinot:

select domain, count(*) 
from wikievents 
group by domain
order by count(*) DESC
limit 10

As long as you see some records, everything is working as expected.

Building a Streamlit Dashboard

Now let's write some more queries against Pinot and display the results in Streamlit.

First, install the following libraries:

pip install streamlit pinotdb plotly pandas

Create a file called app.py and import libraries and write a header for the page:

import pandas as pd
import streamlit as st
from pinotdb import connect
import plotly.express as px

st.set_page_config(layout="wide")
st.header("Wikipedia Recent Changes")

app.py

Connect to Pinot and write a query that returns recent changes, along with the users who made the changes, and domains where they were made:

conn = connect(host='localhost', port=8099, path='/query/sql', scheme='http')

query = """select
  count(*) FILTER(WHERE  ts > ago('PT1M')) AS events1Min,
  count(*) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS events1Min2Min,     
  distinctcount(user) FILTER(WHERE  ts > ago('PT1M')) AS users1Min,
  distinctcount(user) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS users1Min2Min,
  distinctcount(domain) FILTER(WHERE  ts > ago('PT1M')) AS domains1Min,
  distinctcount(domain) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS domains1Min2Min
from wikievents 
where ts > ago('PT2M')
limit 1
"""

curs = conn.cursor()

curs.execute(query)
df_summary = pd.DataFrame(curs, columns=[item[0] for item in curs.description])

app.py

The highlighted part of the query shows how to count the number of events from the last minute and the minute before that. We then do a similar thing to count the number of unique users and domains.

Metrics

Now let's create some metrics based on that data:

metric1, metric2, metric3 = st.columns(3)
metric1.metric(label="Changes", value=df_summary['events1Min'].values[0],
    delta=float(df_summary['events1Min'].values[0] - df_summary['events1Min2Min'].values[0]))

metric2.metric(label="Users", value=df_summary['users1Min'].values[0],
    delta=float(df_summary['users1Min'].values[0] - df_summary['users1Min2Min'].values[0]))

metric3.metric(label="Domains", value=df_summary['domains1Min'].values[0],
    delta=float(df_summary['domains1Min'].values[0] - df_summary['domains1Min2Min'].values[0]))

app.py

Go back to the terminal and run the following command:

streamlit run app.py

Navigate to localhost:8501 to see the Streamlit app. You should see something like the following:

Streamlit Metrics

Changes per minute

Next, let's add a line chart that shows the number of changes being done to Wikimedia per minute. Add the following code to app.py:

query = """
select ToDateTime(DATETRUNC('minute', ts), 'yyyy-MM-dd hh:mm:ss') AS dateMin, count(*) AS changes, 
       distinctcount(user) AS users,
       distinctcount(domain) AS domains
from wikievents 
where ts > ago('PT1H')
group by dateMin
order by dateMin desc
LIMIT 30
"""

curs.execute(query)
df_ts = pd.DataFrame(curs, columns=[item[0] for item in curs.description])
df_ts_melt = pd.melt(df_ts, id_vars=['dateMin'], value_vars=['changes', 'users', 'domains'])

fig = px.line(df_ts_melt, x='dateMin', y="value", color='variable', color_discrete_sequence =['blue', 'red', 'green'])
fig['layout'].update(margin=dict(l=0,r=0,b=0,t=40), title="Changes/Users/Domains per minute")
fig.update_yaxes(range=[0, df_ts["changes"].max() * 1.1])
st.plotly_chart(fig, use_container_width=True)

app.py

Go back to the web browser and you should see something like this:

Streamlit Time Series

Auto Refresh

At the moment we need to refresh our web browser to update the metrics and line chart, but it would be much better if that happened automatically. Let's now add auto refresh functionality.

Add the following code just under the header at the top of app.py:

if not "sleep_time" in st.session_state:
    st.session_state.sleep_time = 2

if not "auto_refresh" in st.session_state:
    st.session_state.auto_refresh = True

auto_refresh = st.checkbox('Auto Refresh?', st.session_state.auto_refresh)

if auto_refresh:
    number = st.number_input('Refresh rate in seconds', value=st.session_state.sleep_time)
    st.session_state.sleep_time = number

app.py

And the following code at the very end:

if auto_refresh:
    time.sleep(number)
    st.experimental_rerun()

app.py

If we navigate back to our web browser, we'll see the following:

Streamlit Auto Refresh

The full script used in this example is shown below:

import pandas as pd
import streamlit as st
from pinotdb import connect
from datetime import datetime
import plotly.express as px
import time

st.set_page_config(layout="wide")

conn = connect(host='localhost', port=8099, path='/query/sql', scheme='http')

st.header("Wikipedia Recent Changes")

now = datetime.now()
dt_string = now.strftime("%d %B %Y %H:%M:%S")
st.write(f"Last update: {dt_string}")

# Use session state to keep track of whether we need to auto refresh the page and the refresh frequency

if not "sleep_time" in st.session_state:
    st.session_state.sleep_time = 2

if not "auto_refresh" in st.session_state:
    st.session_state.auto_refresh = True

auto_refresh = st.checkbox('Auto Refresh?', st.session_state.auto_refresh)

if auto_refresh:
    number = st.number_input('Refresh rate in seconds', value=st.session_state.sleep_time)
    st.session_state.sleep_time = number

# Find changes that happened in the last 1 minute
# Find changes that happened between 1 and 2 minutes ago

query = """
select count(*) FILTER(WHERE  ts > ago('PT1M')) AS events1Min,
        count(*) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS events1Min2Min,
        distinctcount(user) FILTER(WHERE  ts > ago('PT1M')) AS users1Min,
        distinctcount(user) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS users1Min2Min,
        distinctcount(domain) FILTER(WHERE  ts > ago('PT1M')) AS domains1Min,
        distinctcount(domain) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS domains1Min2Min
from wikievents 
where ts > ago('PT2M')
limit 1
"""

curs = conn.cursor()

curs.execute(query)
df_summary = pd.DataFrame(curs, columns=[item[0] for item in curs.description])


metric1, metric2, metric3 = st.columns(3)

metric1.metric(
    label="Changes",
    value=df_summary['events1Min'].values[0],
    delta=float(df_summary['events1Min'].values[0] - df_summary['events1Min2Min'].values[0])
)

metric2.metric(
    label="Users",
    value=df_summary['users1Min'].values[0],
    delta=float(df_summary['users1Min'].values[0] - df_summary['users1Min2Min'].values[0])
)

metric3.metric(
    label="Domains",
    value=df_summary['domains1Min'].values[0],
    delta=float(df_summary['domains1Min'].values[0] - df_summary['domains1Min2Min'].values[0])
)

# Find all the changes by minute in the last hour

query = """
select ToDateTime(DATETRUNC('minute', ts), 'yyyy-MM-dd hh:mm:ss') AS dateMin, count(*) AS changes, 
       distinctcount(user) AS users,
       distinctcount(domain) AS domains
from wikievents 
where ts > ago('PT10M')
group by dateMin
order by dateMin desc
LIMIT 30
"""

curs.execute(query)
df_ts = pd.DataFrame(curs, columns=[item[0] for item in curs.description])
df_ts_melt = pd.melt(df_ts, id_vars=['dateMin'], value_vars=['changes', 'users', 'domains'])

fig = px.line(df_ts_melt, x='dateMin', y="value", color='variable', color_discrete_sequence =['blue', 'red', 'green'])
fig['layout'].update(margin=dict(l=0,r=0,b=0,t=40), title="Changes/Users/Domains per minute")
fig.update_yaxes(range=[0, df_ts["changes"].max() * 1.1])
st.plotly_chart(fig, use_container_width=True)

# Refresh the page
if auto_refresh:
    time.sleep(number)
    st.experimental_rerun()

app.py

Summary

In this guide we've learnt how to publish data into Kafka from Wikimedia's event stream, ingest it from there into Pinot, and finally make sense of the data using SQL queries run from Streamlit.

Text search support

This page talks about support for text search in Pinot.

This text index method is recommended over the experimental native text index.

Click to skip the background info and go straight to the procedure to enable this text index.

Why do we need text search?

Pinot supports super-fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index, and sorted index.

This is useful for a query like the following, which looks for exact matches on two columns of type STRING and INT respectively:

SELECT COUNT(*) 
FROM Foo 
WHERE STRING_COL = 'ABCDCD' 
AND INT_COL > 2000

For arbitrary text data that falls into the BLOB/CLOB territory, we need more than exact matches. This often involves using regex, phrase, fuzzy queries on BLOB like data. Text indexes can efficiently perform arbitrary search on STRING columns where each column value is a large BLOB of text using the TEXT_MATCH function, like this:

SELECT COUNT(*) 
FROM Foo 
WHERE TEXT_MATCH (<column_name>, '<search_expression>')

where <column_name> is the column text index is created on and <search_expression> conforms to one of the following:

Search Expression Type

Example

Phrase query

TEXT_MATCH (<column_name>, '"distributed system"')

Term Query

TEXT_MATCH (<column_name>, 'Java')

Boolean Query

TEXT_MATCH (<column_name>, 'Java AND c++')

Prefix Query

TEXT_MATCH (<column_name>, 'stream*')

Regex Query

TEXT_MATCH (<column_name>, '/Exception.*/')

Not Query

TEXT_MATCH (<column_name>, ': NOT c%')

NOT TEXT_MATCH (<column_name>, 'c%')

Current restrictions

Pinot supports text search with the following requirements:

  • The column type should be STRING, or stored as STRING (e.g. JSON).

Sample Datasets

Text search should ideally be used on STRING columns where doing standard filter operations (EQUALITY, RANGE, BETWEEN) doesn't fit the bill because each column value is a reasonably large blob of text.

Apache Access Log

Consider the following snippet from an Apache access log. Each line in the log consists of arbitrary data (IP addresses, URLs, timestamps, symbols etc) and represents a column value. Data like this is a good candidate for doing text search.

Let's say the following snippet of data is stored in the ACCESS_LOG_COL column in a Pinot table.

109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-
109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
83.167.113.100 - - [12/Dec/2015:18:31:25 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
83.167.113.100 - - [12/Dec/2015:18:31:25 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
95.29.198.15 - - [12/Dec/2015:18:32:10 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
95.29.198.15 - - [12/Dec/2015:18:32:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
109.184.11.34 - - [12/Dec/2015:18:32:56 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
109.184.11.34 - - [12/Dec/2015:18:32:56 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
91.227.29.79 - - [12/Dec/2015:18:33:51 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"

Here are some examples of search queries on this data:

Count the number of GET requests.

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(ACCESS_LOG_COL, 'GET')

Count the number of POST requests that have administrator in the URL (administrator/index)

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(ACCESS_LOG_COL, 'post AND administrator AND index')

Count the number of POST requests that have a particular URL and handled by Firefox browser

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(ACCESS_LOG_COL, 'post AND administrator AND index AND firefox')

Resume text

Let's consider another example using text from job candidate resumes. Each line in this file represents skill-data from resumes of different candidates.

This data is stored in the SKILLS_COL column in a Pinot table. Each line in the input text represents a column value.

Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing
Java, Python, C++, Machine learning, building and deploying large scale production systems, concurrency, multi-threading, CPU processing
C++, Python, Tensor flow, database kernel, storage, indexing and transaction processing, building large scale systems, Machine learning
Amazon EC2, AWS, hadoop, big data, spark, building high performance scalable systems, building and deploying large scale production systems, concurrency, multi-threading, Java, C++, CPU processing
Distributed systems, database development, columnar query engine, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
CUDA, GPU, Python, Machine learning, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, database engine, cluster management, docker image building and distribution
Kubernetes, cluster management, operating systems, concurrency, multi-threading, apache airflow, Apache Spark,
Apache spark, Java, C++, query processing, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Big data stream processing, Apache Flink, Apache Beam, database kernel, distributed query engines for analytics and data warehouses
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems
Distributed systems, Apache Kafka, publish-subscribe, building and deploying large scale production systems, concurrency, multi-threading, C++, CPU processing, Java
Realtime stream processing, publish subscribe, columnar processing for data warehouses, concurrency, Java, multi-threading, C++,

Here are some examples of search queries on this data:

Count the number of candidates that have "machine learning" and "gpu processing": This is a phrase search (more on this further in the document) where we are looking for exact match of phrases "machine learning" and "gpu processing", not necessarily in the same order in the original data.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND "gpu processing"')

Count the number of candidates that have "distributed systems" and either 'Java' or 'C++': This is a combination of searching for exact phrase "distributed systems" along with other terms.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"distributed systems" AND (Java C++)')

Query Log

Next, consider a snippet from a log file containing SQL queries handled by a database. Each line (query) in the file represents a column value in the QUERY_LOG_COL column in a Pinot table.

SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560988800000 AND 1568764800000 GROUP BY dimensionCol3 TOP 2500
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560988800000 AND 1568764800000 GROUP BY dimensionCol3 TOP 2500
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1545436800000 AND 1553212800000 GROUP BY dimensionCol3 TOP 2500
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1537228800000 AND 1537660800000 GROUP BY dimensionCol3 TOP 2500
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1561366800000 AND 1561370399999 AND dimensionCol3 = 2019062409 LIMIT 10000
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1563807600000 AND 1563811199999 AND dimensionCol3 = 2019072215 LIMIT 10000
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1563811200000 AND 1563814799999 AND dimensionCol3 = 2019072216 LIMIT 10000
SELECT dimensionCol2, dimensionCol4, timestamp, dimensionCol5, dimensionCol6 FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1566327600000 AND 1566329400000 AND dimensionCol3 = 2019082019 LIMIT 10000
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560834000000 AND 1560837599999 AND dimensionCol3 = 2019061805 LIMIT 0
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560870000000 AND 1560871800000 AND dimensionCol3 = 2019061815 LIMIT 0
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560871800001 AND 1560873599999 AND dimensionCol3 = 2019061815 LIMIT 0
SELECT count(dimensionCol2) FROM FOO WHERE dimensionCol1 = 18616904 AND timestamp BETWEEN 1560873600000 AND 1560877199999 AND dimensionCol3 = 2019061816 LIMIT 0

Here are some examples of search queries on this data:

Count the number of queries that have GROUP BY

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(QUERY_LOG_COL, '"group by"')

Count the number of queries that have the SELECT count... pattern

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(QUERY_LOG_COL, '"select count"')

Count the number of queries that use BETWEEN filter on timestamp column along with GROUP BY

SELECT COUNT(*) 
FROM MyTable 
WHERE TEXT_MATCH(QUERY_LOG_COL, '"timestamp between" AND "group by"')

Read on for concrete examples on each kind of query and step-by-step guides covering how to write text search queries in Pinot.

A column in Pinot can be dictionary-encoded or stored RAW. In addition, we can create an inverted index and/or a sorted index on a dictionary-encoded column.

The text index is an addition to the type of per-column indexes users can create in Pinot. However, it only supports text index on a RAW column, not a dictionary-encoded column.

Multi-column text index

Since version 1.4.0, Pinot offers two types of text indexes:

  • per-column / single-column text index - that stores data separately for each indexed column. It's the type used prior to the 1.4.0 version.

  • per-segment / multi-column text index - that stores all indexed column's data together. Doing so reduces both RAM and disk space sizes and speeds up index creation, allowing efficient indexing of tens or hundreds or columns.

Aside from configuration, the new index type behaves the same as per-column index at query time.

When choosing between the two index types, you might consider the following :

Property \ Type
Per-Column
Per-segment

Querying speed

slower - especially when querying multiple columns

faster

Disk and memory usage

higher - each column uses separate set of Lucene files and document id mapping

lower - Lucene file size is smaller; only one document id mapping is used for all columns

Initial build time

higher - because each column uses separate Lucene files

lower - one set of Lucene files and one document id mapping is generated

Rebuild time

lower - rebuild affected columns only, other indexes are copied

higher - removes all files and rebuilds from scratch

Enable a per-column text index

Enable a text index on a column in the table configuration by adding a new section with the name "fieldConfigList".

"fieldConfigList":[
  {
     "name":"text_col_1",
     "encodingType":"RAW",
     "indexTypes":["TEXT"]
  },
  {
     "name":"text_col_2",
     "encodingType":"RAW",
     "indexTypes":["TEXT"]
  }
]

Each column that has a text index should also be specified as noDictionaryColumns in tableIndexConfig:

"tableIndexConfig": {
   "noDictionaryColumns": [
     "text_col_1",
     "text_col_2"
 ]}

You can configure text indexes in the following scenarios:

  • Adding a new table with text index enabled on one or more columns.

  • Adding a new column with text index enabled to an existing table.

  • Enabling a text index on an existing column.

When you're using a text index, add the indexed column to the noDictionaryColumns columns list to reduce unnecessary storage overhead.

For instructions on that configuration property, see the Raw value forward index documentation.

Enable a per-segment text index

Contrary to per-column text index, per-segment text index can only be configured once in table index configuration by adding multiColumnTextIndexConfig element:

"tableIndexConfig": {
   "multiColumnTextIndexConfig": {
      "columns": ["hobbies", "skills", "titles" ],
      "properties": {
         "caseSensitive": "false"
       }
       "perColumnProperties": {
          "titles": {
             "caseSensitive": "true"
          }
       }
 },

The config contains a list of columns to index - columns, settings meant for all columns - properties, and settings applied to particular column - perColumnProperties.

As shown in example above, index configuration allows for both:

  • setting shared index properties that apply to all columns with "properties". Allowed keys are : enableQueryCacheForTextIndex, luceneUseCompoundFile, luceneMaxBufferSizeMB, reuseMutableIndex and all allowed in perColumnProperties.

  • setting column-specific properties (overriding shared ones) with perColumnProperties. Allowed keys: useANDForMultiTermTextIndexQueries, enablePrefixSuffixMatchingInPhraseQueries, stopWordInclude, stopWordExclude, caseSensitive, luceneAnalyzerClass, luceneAnalyzerClassArgs, luceneAnalyzerClassArgTypes, luceneQueryParserClass.

Shared properties-only settings, e.g. luceneMaxBufferSizeMB , set in per column properties have no effect and will be ignored.

Text index creation

Once the text index is enabled on one or more columns through a table configuration, segment generation code will automatically create the text index (per column).

Text index is supported for both offline and real-time segments.

Text parsing and tokenization

The original text document (denoted by a value in the column that has text index enabled) is parsed, tokenized and individual "indexable" terms are extracted. These terms are inserted into the index.

Pinot's text index is built on top of Lucene. Lucene's standard english text tokenizer generally works well for most classes of text. To build a custom text parser and tokenizer to suit particular user requirements, this can be made configurable for the user to specify on a per-column text-index basis.

There is a default set of "stop words" built in Pinot's text index. This is a set of high frequency words in English that are excluded for search efficiency and index size, including:

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "than", "there", "these", 
"they", "this", "to", "was", "will", "with", "those"

Any occurrence of these words will be ignored by the tokenizer during index creation and search.

In some cases, users might want to customize the set. A good example would be when IT (Information Technology) appears in the text that collides with "it", or some context-specific words that are not informative in the search. To do this, one can config the words in fieldConfig to include/exclude from the default stop words:

"fieldConfigList":[
  {
     "name":"text_col_1",
     "encodingType":"RAW",
     "indexType":"TEXT",
     "properties": {
        "stopWordInclude": "incl1, incl2, incl3",
        "stopWordExclude": "it"
     }
  }
]

The words should be comma separated and in lowercase. Words appearing in both lists will be excluded as expected.

Writing text search queries

The TEXT_MATCH function enables using text search in SQL/PQL.

TEXT_MATCH(text_column_name, search_expression)

  • text_column_name - name of the column to do text search on.

  • search_expression - search query

You can use TEXT_MATCH function as part of queries in the WHERE clause, like this:

SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(...)
SELECT * FROM Foo WHERE TEXT_MATCH(...)

You can also use the TEXT_MATCH filter clause with other filter operators. For example:

SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(...) AND some_other_column_1 > 20000
SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(...) AND some_other_column_1 > 20000 AND some_other_column_2 < 100000

You can combine multiple TEXT_MATCH filter clauses:

SELECT COUNT(*) FROM Foo WHERE TEXT_MATCH(text_col_1, ....) AND TEXT_MATCH(text_col_2, ...)

TEXT_MATCH can be used in WHERE clause of all kinds of queries supported by Pinot.

  • Selection query which projects one or more columns

    • User can also include the text column name in select list

  • Aggregation query

  • Aggregation GROUP BY query

The search expression (the second argument to TEXT_MATCH function) is the query string that Pinot will use to perform text search on the column's text index.

TEXT_MATCH Query Options

The TEXT_MATCH function supports an optional third parameter for specifying Lucene query parser options at query time. This allows for flexible and advanced text search without changing table configuration.

Function Signature:

TEXT_MATCH(text_column_name, search_expression [, options])
  • text_column_name: Name of the column to perform text search on.

  • search_expression: The query string for text search.

  • options (optional): Comma-separated string of key-value pairs to control query parsing and search behavior.

Available Options:

Option
Values
Description

parser

CLASSIC, STANDARD, COMPLEX

Selects the Lucene query parser to use. Default is CLASSIC.

allowLeadingWildcard

true, false

Allows queries to start with a wildcard (e.g., *term). Default is false.

defaultOperator

AND, OR

Sets the default boolean operator for multi-term queries. Default is OR.

Examples:

-- Use CLASSIC parser with leading wildcard support
SELECT * FROM myTable WHERE TEXT_MATCH(myCol, '*search*', 'parser=CLASSIC, allowLeadingWildcard=true')

-- Use STANDARD parser with AND operator
SELECT * FROM myTable WHERE TEXT_MATCH(myCol, 'term1 term2', 'parser=STANDARD, defaultOperator=AND')

-- Use COMPLEX parser for advanced queries
SELECT * FROM myTable WHERE TEXT_MATCH(myCol, 'complex query', 'parser=COMPLEX')

Phrase query

This query is used to seek out an exact match of a given phrase, where terms in the user-specified phrase appear in the same order in the original text document.

The following example reuses the earlier example of resume text data containing 14 documents to walk through queries. In this sentence, "document" means the column value. The data is stored in the SKILLS_COL column and we have created a text index on this column.

Java, C++, worked on open source projects, coursera machine learning
Machine learning, Tensor flow, Java, Stanford university,
Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing
Java, Python, C++, Machine learning, building and deploying large scale production systems, concurrency, multi-threading, CPU processing
C++, Python, Tensor flow, database kernel, storage, indexing and transaction processing, building large scale systems, Machine learning
Amazon EC2, AWS, hadoop, big data, spark, building high performance scalable systems, building and deploying large scale production systems, concurrency, multi-threading, Java, C++, CPU processing
Distributed systems, database development, columnar query engine, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
CUDA, GPU, Python, Machine learning, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, database engine, cluster management, docker image building and distribution
Kubernetes, cluster management, operating systems, concurrency, multi-threading, apache airflow, Apache Spark,
Apache spark, Java, C++, query processing, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Big data stream processing, Apache Flink, Apache Beam, database kernel, distributed query engines for analytics and data warehouses
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems
Distributed systems, Apache Kafka, publish-subscribe, building and deploying large scale production systems, concurrency, multi-threading, C++, CPU processing, Java
Realtime stream processing, publish subscribe, columnar processing for data warehouses, concurrency, Java, multi-threading, C++,
C++, Java, Python, realtime streaming systems, Machine learning, spark, Kubernetes, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution
Database engine, OLAP systems, OLTP transaction processing at large scale, concurrency, multi-threading, GO, building large scale systems

This example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "Distributed systems":

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"Distributed systems"')

The search expression is '\"Distributed systems\"'

  • The search expression is always specified within single quotes '<your expression>'

  • Since we are doing a phrase search, the phrase should be specified within double quotes inside the single quotes and the double quotes should be escaped

    • '\"<your phrase>\"'

The above query will match the following documents:

Distributed systems, Java, C++, Go, distributed query engines for analytics and data warehouses, Machine learning, spark, Kubernetes, transaction processing
Distributed systems, database development, columnar query engine, database kernel, storage, indexing and transaction processing, building large scale systems
Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
Distributed systems, Java, database engine, cluster management, docker image building and distribution
Distributed systems, Apache Kafka, publish-subscribe, building and deploying large scale production systems, concurrency, multi-threading, C++, CPU processing, Java
Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution

But it won't match the following document:

Distributed data processing, systems design experience

This is because the phrase query looks for the phrase occurring in the original document "as is". The terms as specified by the user in phrase should be in the exact same order in the original document for the document to be considered as a match.

NOTE: Matching is always done in a case-insensitive manner.

The next example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "query processing":

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"query processing"')

The above query will match the following documents:

Apache spark, Java, C++, query processing, transaction processing, distributed storage, concurrency, multi-threading, apache airflow
Databases, columnar query processing, Apache Arrow, distributed systems, Machine learning, cluster management, docker image building and distribution"

Term query

Term queries are used to search for individual terms.

This example will query the SKILLS_COL column to look for documents where each matching document MUST contain the term 'Java'.

As mentioned earlier, the search expression is always within single quotes. However, since this is a term query, we don't have to use double quotes within single quotes.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, 'Java')

Composite query using Boolean operators

The Boolean operators AND and OR are supported and we can use them to build a composite query. Boolean operators can be used to combine phrase and term queries in any arbitrary manner

This example queries the SKILLS_COL column to look for documents where each matching document MUST contain the phrases "machine learning" and "tensor flow". This combines two phrases using the AND Boolean operator.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND "Tensor Flow"')

The above query will match the following documents:

Machine learning, Tensor flow, Java, Stanford university,
C++, Python, Tensor flow, database kernel, storage, indexing and transaction processing, building large scale systems, Machine learning
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems

This example queries the SKILLS_COL column to look for documents where each document MUST contain the phrase "machine learning" and the terms 'gpu' and 'python'. This combines a phrase and two terms using Boolean operators.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"Machine learning" AND gpu AND python')

The above query will match the following documents:

CUDA, GPU, Python, Machine learning, database kernel, storage, indexing and transaction processing, building large scale systems
CUDA, GPU processing, Tensor flow, Pandas, Python, Jupyter notebook, spark, Machine learning, building high performance scalable systems

When using Boolean operators to combine term(s) and phrase(s) or both, note that:

  • The matching document can contain the terms and phrases in any order.

  • The matching document may not have the terms adjacent to each other (if this is needed, use appropriate phrase query).

Use of the OR operator is implicit. In other words, if phrase(s) and term(s) are not combined using AND operator in the search expression, the OR operator is used by default:

This example queries the SKILLS_COL column to look for documents where each document MUST contain ANY one of:

  • phrase "distributed systems" OR

  • term 'java' OR

  • term 'C++'.

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"distributed systems" Java C++')

Grouping using parentheses is supported:

This example queries the SKILLS_COL column to look for documents where each document MUST contain

  • phrase "distributed systems" AND

  • at least one of the terms Java or C++

Here the terms Java and C++ are grouped without any operator, which implies the use of OR. The root operator AND is used to combine this with phrase "distributed systems"

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, '"distributed systems" AND (Java C++)')

Prefix query

Prefix queries can be done in the context of a single term. We can't use prefix matches for phrases.

This example queries the SKILLS_COL column to look for documents where each document MUST contain text like stream, streaming, streams etc

SELECT SKILLS_COL 
FROM MyTable 
WHERE TEXT_MATCH(SKILLS_COL, 'stream*')

The above query will match the following documents:

Distributed systems, Java, realtime streaming systems, Machine learning, spark, Kubernetes, distributed storage, concurrency, multi-threading
Big data stream processing, Apache Flink, Apache Beam, database kernel, distributed query engines for analytics and data warehouses
Realtime stream processing, publish subscribe, columnar processing for data warehouses, concurrency, Java, multi-threading, C++,
C++, Java, Python, realtime streaming systems, Machine learning, spark, Kubernetes, transaction processing, distributed storage, concurrency, multi-threading, apache airflow

Regular Expression Query

Phrase and term queries work on the fundamental logic of looking up the terms in the text index. The original text document (a value in the column with text index enabled) is parsed, tokenized, and individual "indexable" terms are extracted. These terms are inserted into the index.

Based on the nature of the original text and how the text is segmented into tokens, it is possible that some terms don't get indexed individually. In such cases, it is better to use regular expression queries on the text index.

Consider a server log as an example where we want to look for exceptions. A regex query is suitable here as it is unlikely that 'exception' is present as an individual indexed token.

Syntax of a regex query is slightly different from queries mentioned earlier. The regular expression is written between a pair of forward slashes (/).

SELECT SKILLS_COL 
FROM MyTable 
WHERE text_match(SKILLS_COL, '/.*Exception/')

The above query will match any text document containing "exception".

Phrase search with wildcard term matching

Phrase search with wildcard and prefix term matching can match patterns like "pache pino" to the text "Apache Pinot" directly. The kind of queries is very common in use case like log search where user needs to search substrings across term boundary in long text. To enable such search (which can be more costly because Lucene by default does not allow * to start a pattern to avoid costly term matching), one can add a new config key to the column text index config:

"fieldConfigList":[
  {
     "name":"text_col_1",
     "encodingType":"RAW",
     "indexType":"TEXT",
     "properties": {
        "enablePrefixSuffixMatchingInPhraseQueries": "true"
     }
  }
]

With this config enabled, one can now perform the pharse wildcard search using the following syntax like

SELECT SKILLS_COL 
FROM MyTable 
WHERE text_match(SKILLS_COL, '*pache pino*')

to match the string "Apache pinot" in the SIKLLS_COL. Boolean expressions like 'pache pino AND apche luce' are are supported.

Deciding Query Types

Combining phrase and term queries using Boolean operators and grouping lets you build a complex text search query expression.

The key thing to remember is that phrases should be used when the order of terms in the document is important and when separating the phrase into individual terms doesn't make sense from end user's perspective.

An example would be phrase "machine learning".

TEXT_MATCH(column, '"machine learning"')

However, if we are searching for documents matching Java and C++ terms, using phrase query "Java C++" will actually result in in partial results (could be empty too) since now we are relying the on the user specifying these skills in the exact same order (adjacent to each other) in the resume text.

TEXT_MATCH(column, '"Java C++"')

Term query using Boolean AND operator is more appropriate for such cases

TEXT_MATCH(column, 'Java AND C++')

Text Index Tuning

To improve Lucene index creation time, some configs have been provided. Field Config properties luceneUseCompoundFile and luceneMaxBufferSizeMB can provide faster index writing at but may increase file descriptors and/or memory pressure.

Cluster Configuration for Text Search

When text search queries contain too many terms or clauses, Lucene may throw TooManyClauses exceptions, causing query failures. This commonly occurs with:

  • Complex boolean queries with many OR conditions

  • Wildcard queries that expand to many terms

  • Queries with large numbers of search terms To handle such cases, you can increase the maximum clause count at the cluster level. See the cluster configuration reference for the pinot.lucene.max.clause.count setting.

JSON index

This page describes configuring the JSON index for Apache Pinot.

The JSON index can be applied to JSON string columns to accelerate value lookups and filtering for the column.

When to use JSON index

JSON strings can be used to represent arrays, maps, and nested fields without forcing a fixed schema. While JSON strings are flexible, filtering on JSON string columns is expensive, so consider the use case.

Suppose we have some JSON records similar to the following sample record stored in the person column:

{
  "name": "adam",
  "age": 30,
  "country": "us",
  "addresses":
  [
    {
      "number" : 112,
      "street" : "main st",
      "country" : "us"
    },
    {
      "number" : 2,
      "street" : "second st",
      "country" : "us"
    },
    {
      "number" : 3,
      "street" : "third st",
      "country" : "ca"
    }
  ]
}

Without an index, to look up the key and filter records based on the value, Pinot must scan and reconstruct the JSON object from the JSON string for every record, look up the key and then compare the value.

For example, in order to find all persons whose name is "adam", the query will look like:

SELECT *
FROM mytable
WHERE JSON_EXTRACT_SCALAR(person, '$.name', 'STRING') = 'adam'

The JSON index is designed to accelerate the filtering on JSON string columns without scanning and reconstructing all the JSON objects.

Enable and configure a JSON index

To enable the JSON index, you can configure the following options in the table configuration:

Config Key
Description
Type
Default

maxLevels

Max levels to flatten the json object (array is also counted as one level)

int

-1 (unlimited)

excludeArray

Whether to exclude array when flattening the object

boolean

false (include array)

disableCrossArrayUnnest

Whether to not unnest multiple arrays (unique combination of all elements in those arrays). If document contains two arrays holding, respectively M and N elements, then flattening produces M*N documents. If number of such combinations reaches 100k, error with "Got too many combinations" message is thrown.

boolean

false (calculate unique combination of all elements)

includePaths

Only include the given paths, e.g. "$.a.b", "$.a.c[*]" (mutual exclusive with excludePaths). Paths under the included paths will be included, e.g. "$.a.b.c" will be included when "$.a.b" is configured to be included.

Set<String>

null (include all paths)

excludePaths

Exclude the given paths, e.g. "$.a.b", "$.a.c[*]" (mutual exclusive with includePaths). Paths under the excluded paths will also be excluded, e.g. "$.a.b.c" will be excluded when "$.a.b" is configured to be excluded.

Set<String>

null (include all paths)

excludeFields

Exclude the given fields, e.g. "b", "c", even if it is under the included paths.

Set<String>

null (include all fields)

indexPaths

Index the given paths, e.g. *.*, a.**. Paths matches the indexed paths will be indexed, e.g. a.** will index everything whose first layer is "a", *.* will index everything with maxLevels=2. This config could work together with other configs, e.g. includePaths, excludePaths, maxLevels but usually does not have to because it should be flexible enough to catch any scenarios.

Set<String>

null that is equivalent to ** (include all fields)

maxValueLength

If the value of a json node (not the whole document) is longer than given value then replace it with $SKIPPED$ before indexing.

int

0 (disabled)

skipInvalidJson

If set, while adding json to index, instead of throwing exception, replace ill-formed json with empty key/path and $SKIPPED$ value .

boolean

false (disabled)

Recommended way to configure

The recommended way to configure a JSON index is in the fieldConfigList.indexes object, within the json key.

json index defined in tableConfig
{
  "fieldConfigList": [
    {
      "name": "person",
      "indexes": {
        "json": {
          "maxLevels": 2,
          "excludeArray": false,
          "disableCrossArrayUnnest": true,
          "includePaths": null,
          "excludePaths": null,
          "excludeFields": null,
          "indexPaths": null
        }
      }
    }
  ],
  ...
}

All options are optional, so the following is a valid configuration that use the default parameter values:

json index defined in tableConfig
{
  "fieldConfigList": [
    {
      "name": "person",
      "indexes": {
        "json": {}
      }
    }
  ],
  ...
}

Deprecated ways to configure JSON indexes

There are two older ways to configure the indexes that can be configured in the tableIndexConfig section inside table config.

The first one uses the same JSON explained above, but it is defined inside tableIndexConfig.jsonIndexConfigs.<column name>:

older way to configure json indexes in table config
{
  "tableIndexConfig": {
    "jsonIndexConfigs": {
      "person": {
        "maxLevels": 2,
        "excludeArray": false,
        "disableCrossArrayUnnest": true,
        "includePaths": null,
        "excludePaths": null,
        "excludeFields": null,
        "indexPaths": null
      },
      ...
    },
    ...
  }
}

Like in the previous case, all parameters are optional, so the following is also valid:

json index with default config
{
  "tableIndexConfig": {
    "jsonIndexConfigs": {
      "person": {},
      ...
    },
    ...
  }
}

The last option does not support to configure any parameter. In order to use this option, add the name of the column in tableIndexConfig.jsonIndexColumns like in this example:

json index with default config
{
  "tableIndexConfig": {
    "jsonIndexColumns": [
      "person",
      ...
    ],
    ...
  }
}

Example:

With the following JSON document:

{
  "name": "adam",
  "age": 20,
  "addresses": [
    {
      "country": "us",
      "street": "main st",
      "number": 1
    },
    {
      "country": "ca",
      "street": "second st",
      "number": 2
    }
  ],
  "skills": [
    "english",
    "programming"
  ]
}

Using the default setting, we will flatten the document into the following records:

{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1,
  "skills[1]": "programming"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "addresses[1].number": 2,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "addresses[1].number": 2,
  "skills[1]": "programming"
}

With maxValueLength set to 9:

{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1,
  "skills[1]": "$SKIPPED$"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "addresses[1].number": 2,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "addresses[1].number": 2,
  "skills[1]": "$SKIPPED$"
}

With maxLevels set to 1:

{
  "name": "adam",
  "age": 20
}

With maxLevels set to 2:

{
  "name": "adam",
  "age": 20,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "skills[1]": "programming"
}

With excludeArray set to true:

{
  "name": "adam",
  "age": 20
}

With disableCrossArrayUnnest set to true:

{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1
},
{
  "name": "adam",
  "age": 20,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "skills[1]": "programming"
}

When cross array un-nesting is disabled, then number of documents produced during JSON flattening is the sum of all array sizes, e.g. 2+2 = 4 in the example above.

With disableCrossArrayUnnest set to false:

{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].number": 1,
  "addresses[0].street": "main st",
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].number": 1,
  "addresses[0].street": "main st",
  "skills[1]": "programming"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].number": 2,
  "addresses[1].street": "second st",
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].number": 2,
  "addresses[1].street": "second st",
  "skills[1]": "programming"
}

When cross array un-nesting is enabled, then number of documents produced during JSON flattening is the product of all array sizes, e.g. 2*2 = 4 in the example above. If JSON contains multiple large nested arrays, it might be necessary to disable cross array un-nesting (disableCrossArrayUnnest=true) to avoid hitting the 100k flattened documents limit and triggering 'Got to many combinations' error.

With includePaths set to ["$.name", "$.addresses[*].country"]:

{
  "name": "adam",
  "addresses[0].country": "us"
},
{
  "name": "adam",
  "addresses[1].country": "ca"
}

With excludePaths set to ["$.age", "$.addresses[*].number"]:

{
  "name": "adam",
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "skills[0]": "english"
},
{
  "name": "adam",
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "skills[1]": "programming"
},
{
  "name": "adam",
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "skills[0]": "english"
},
{
  "name": "adam",
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "skills[1]": "programming"
}

With excludeFields set to ["age", "street"]:

{
  "name": "adam",
  "addresses[0].country": "us",
  "addresses[0].number": 1,
  "skills[0]": "english"
},
{
  "name": "adam",
  "addresses[0].country": "us",
  "addresses[0].number": 1,
  "skills[1]": "programming"
},
{
  "name": "adam",
  "addresses[1].country": "ca",
  "addresses[1].number": 2,
  "skills[0]": "english"
},
{
  "name": "adam",
  "addresses[1].country": "ca",
  "addresses[1].number": 2,
  "skills[1]": "programming"
}

With indexPaths set to ["*", "address..country"]:

{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
}

With skipInvalidJson set to true, if we corrupt the original JSON, e.g. to

{ _invalid_json_
  "name": "adam",
  "age": 20,
  "addresses": [...]
  "skills": [...]
}

then flattening will be produce:

{ "": "$SKIPPED$" }

Note that the JSON index can only be applied to STRING/JSON columns whose values are JSON strings.

To reduce unnecessary storage overhead when using a JSON index, we recommend that you add the indexed column to the noDictionaryColumns columns list.

For instructions on that configuration property, see the Raw value forward index documentation.

How to use the JSON index

The JSON index can be used via the JSON_MATCH predicate for filtering: JSON_MATCH(<column>, '<filterExpression>'). For example, to find every entry with the name "adam":

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name"=''adam''')

Note that the quotes within the filter expression need to be escaped.

The JSON index can also be used via the JSON_EXTRACT_INDEX predicate for value extraction (optionally with filtering): JSON_EXTRACT_INDEX(<column>, '<jsonPath>', ['resultsType'], ['filter']). For example, to extract every value for path $.name when the path $.id is less than 10:

SELECT jsonextractindex(repo, '$.name', 'STRING', 'dummyValue', '"$.id" < 10')
FROM mytable

More in-depth examples can be found in the JSON_EXTRACT_INDEX function documentation.

Supported filter expressions

Simple key lookup

Find all persons whose name is "adam":

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name"=''adam''')

or

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name" IN (''adam'')')

Chained key lookup

Find all persons who have an address (one of the addresses) with number 112:

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].number"=112')

Find all persons who have at least one address that is not in the US:

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country" != ''us''')

or

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country" NOT IN (''us'') ')

Regex based lookup

Find all persons who have an address (one of the addresses) where the street contains the term 'st':

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, 'REGEXP_LIKE("$.addresses[*].street", ''.*st.*'')')

Range lookup

Find all persons whose age is greater than 18:

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.age" > 18')

Find all persons whose age is between 20 and 40 (inclusive):

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.age" BETWEEN 20 AND 40')

Nested filter expression

Find all persons whose name is "adam" and also have an address (one of the addresses) with number 112:

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name"=''adam'' AND "$.addresses[*].number"=112')

NOT IN and != can't be used in nested filter expressions in Pinot versions older than 1.2.0. Note that IS NULL cannot be used in nested filter expressions currently.

Array access

Find all persons whose first address has number 112:

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[0].number"=112')

Since JSON index works based on flattened JSON documents, if cross array un-nesting is disabled ( disableCrossArrayUnnest = true ), then querying more than one array in a single JSON_MATCH function call returns empty result, e.g.

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country"=''us'' AND "$.skills[*]"=''english''')

In such cases expression should be split into multiple JSON_MATCH calls, e.g.

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country"=''us''')
AND   JSON_MATCH(person, '"$.skills[*]"=''english''')

Existence check

Find all persons who have a phone field within the JSON:

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.phone" IS NOT NULL')

Find all persons whose first address does not contain floor field within the JSON:

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[0].floor" IS NULL')

JSON context is maintained

The JSON context is maintained for object elements within an array, meaning the filter won't cross-match different objects in the array.

To find all persons who live on "main st" in "ca":

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].street"=''main st'' AND "$.addresses[*].country"=''ca''')

This query won't match "adam" because none of his addresses matches both the street and the country.

If you don't want JSON context, use multiple separate JSON_MATCH predicates. For example, to find all persons who have addresses on "main st" and have addresses in "ca" (matches need not have the same address):

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].street"=''main st''') 
  AND JSON_MATCH(person, '"$.addresses[*].country"=''ca''')

This query will match "adam" because one of his addresses matches the street and another one matches the country.

The array index is maintained as a separate entry within the element, so in order to query different elements within an array, multiple JSON_MATCH predicates are required. For example, to find all persons who have first address on "main st" and second address on "second st":

SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[0].street"=''main st''') 
  AND JSON_MATCH(person, '"$.addresses[1].street"=''second st''')

Supported JSON values

Object

See examples above.

Array

["item1", "item2", "item3"]

To find the records with array element "item1" in "arrayCol":

SELECT ...
FROM mytable
WHERE JSON_MATCH(arrayCol, '"$[*]"=''item1''')

To find the records with second array element "item2" in "arrayCol":

SELECT ...
FROM mytable
WHERE JSON_MATCH(arrayCol, '"$[1]"=''item2''')

Value

123
1.23
"Hello World"

To find the records with value 123 in "valueCol":

SELECT ...
FROM mytable
WHERE JSON_MATCH(valueCol, '"$"=123')

Null

null

To find the records with null in "nullableCol":

SELECT ...
FROM mytable
WHERE JSON_MATCH(nullableCol, '"$" IS NULL')

Limitations

  1. The key (left-hand side) of the filter expression must be the leaf level of the JSON object, for example, "$.addresses[*]"='main st' won't work.

GitHub Events Stream

Steps for setting up a Pinot cluster and a real-time table which consumes from the GitHub events stream.

In this recipe you will set up an Apache Pinot cluster and a real-time table which consumes data flowing from a GitHub events stream. The stream is based on GitHub pull requests and uses Kafka.

In this recipe you will perform the following steps:

  1. Set up a Pinot cluster, to do which you will:

    a. Start zookeeper.

    b. Start the controller.

    c. Start the broker.

    d. Start the server.

  2. Set up a Kafka cluster.

  3. Create a Kafka topic, which will be called pullRequestMergedEvents.

  4. Create a real-time table called pullRequestMergedEvents and a schema.

  5. Start a task which reads from the GitHub events API and publishes events about merged pull requests to the topic.

  6. Query the real-time data.

Steps

Use either Docker images or launcher scripts

Pull the Docker image

Get the latest Docker image.

export PINOT_VERSION=latest
export PINOT_IMAGE=apachepinot/pinot:${PINOT_VERSION}
docker pull ${PINOT_IMAGE}

Long version

Set up the Pinot cluster

Follow the instructions in Advanced Pinot Setup to set up a Pinot cluster with the components:

  • Zookeeper

  • Controller

  • Broker

  • Server

  • Kafka

Create a Kafka topic

Create a Kafka topic called pullRequestMergedEvents for the demo.

docker exec \
  -it kafka \
  /opt/bitnami/kafka/bin/kafka-topics.sh \
  --bootstrap-server kafka:9092 --partitions=1 \
  --replication-factor=1 --create \
  --topic pullRequestMergedEvents

Add a Pinot table and schema

The schema is present at examples/stream/githubEvents/pullRequestMergedEvents_schema.json and is also pasted below

pullRequestMergedEvents_schema.json
{
  "schemaName": "pullRequestMergedEvents",
  "dimensionFieldSpecs": [
    {
      "name": "title",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "labels",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "userId",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "userType",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "authorAssociation",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "mergedBy",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "assignees",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "authors",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "committers",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "requestedReviewers",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "requestedTeams",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "reviewers",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "commenters",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "repo",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "organization",
      "dataType": "STRING",
      "defaultNullValue": ""
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "count",
      "dataType": "LONG",
      "defaultNullValue": 1
    },
    {
      "name": "numComments",
      "dataType": "LONG"
    },
    {
      "name": "numReviewComments",
      "dataType": "LONG"
    },
    {
      "name": "numCommits",
      "dataType": "LONG"
    },
    {
      "name": "numLinesAdded",
      "dataType": "LONG"
    },
    {
      "name": "numLinesDeleted",
      "dataType": "LONG"
    },
    {
      "name": "numFilesChanged",
      "dataType": "LONG"
    },
    {
      "name": "numAuthors",
      "dataType": "LONG"
    },
    {
      "name": "numCommitters",
      "dataType": "LONG"
    },
    {
      "name": "numReviewers",
      "dataType": "LONG"
    },
    {
      "name": "numCommenters",
      "dataType": "LONG"
    },
    {
      "name": "createdTimeMillis",
      "dataType": "LONG"
    },
    {
      "name": "elapsedTimeMillis",
      "dataType": "LONG"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "mergedTimeMillis",
      "dataType": "TIMESTAMP",
      "format": "1:MILLISECONDS:TIMESTAMP",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

The table config is present at examples/stream/githubEvents/docker/pullRequestMergedEvents_realtime_table_config.json and is also pasted below.

Note If you're setting this up on a pre-configured cluster, set the properties stream.kafka.zk.broker.url and stream.kafka.broker.list correctly, depending on the configuration of your Kafka cluster.

pullRequestMergedEvents_realtime_table_config.json
{
  "tableName": "pullRequestMergedEvents",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "mergedTimeMillis",
    "timeType": "MILLISECONDS",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "60",
    "schemaName": "pullRequestMergedEvents",
    "replication": "1",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "invertedIndexColumns": [
      "organization",
      "repo"
    ],
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "pullRequestMergedEvents",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.zk.broker.url": "pinot-zookeeper:2181/kafka",
      "stream.kafka.broker.list": "kafka:9092",
      "realtime.segment.flush.threshold.time": "12h",
      "realtime.segment.flush.threshold.rows": "100000",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}

Add the table and schema using the following command:

$ docker run \
    --network=pinot-demo \
    --name pinot-streaming-table-creation \
    ${PINOT_IMAGE} AddTable \
    -schemaFile examples/stream/pullRequestMergedEvents/pullRequestMergedEvents_schema.json \
    -tableConfigFile examples/stream/pullRequestMergedEvents/docker/pullRequestMergedEvents_realtime_table_config.json \
    -controllerHost pinot-controller \
    -controllerPort 9000 \
    -exec
Executing command: AddTable -tableConfigFile examples/stream/pullRequestMergedEvents/docker/pullRequestMergedEvents_realtime_table_config.json -schemaFile examples/stream/pullRequestMergedEvents/pullRequestMergedEvents_schema.json -controllerHost pinot-controller -controllerPort 9000 -exec
Sending request: http://pinot-controller:9000/schemas to controller: 20c241022a96, version: Unknown
{"status":"Table pullRequestMergedEvents_REALTIME succesfully added"}

Publish events

Start streaming GitHub events into the Kafka topic:

Prerequisites

Generate a personal access token on GitHub.

$ docker run --rm -ti \
    --network=pinot-demo \
    --name pinot-github-events-into-kafka \
    -d ${PINOT_IMAGE} StreamGitHubEvents \
    -schemaFile examples/stream/pullRequestMergedEvents/pullRequestMergedEvents_schema.json \
    -topic pullRequestMergedEvents \
    -personalAccessToken <your_github_personal_access_token> \
    -kafkaBrokerList kafka:9092

Short version

The short method of setting things up is to use the following command. Make sure to stop any previously running Pinot services.

$ docker run --rm -ti \
    --network=pinot-demo \
    --name pinot-github-events-quick-start \
     ${PINOT_IMAGE} GitHubEventsQuickStart \
    -personalAccessToken <your_github_personal_access_token> 

Get Pinot

Follow the instructions in Build from source to get the latest Pinot code

Long version

Set up the Pinot cluster

Follow the instructions in Advanced Pinot Setup to set up the Pinot cluster with the components:

  • Zookeeper

  • Controller

  • Broker

  • Server

  • Kafka

Create a Kafka topic

Download Apache Kafka.

Create a Kafka topic called pullRequestMergedEvents for the demo.

$ bin/kafka-topics.sh \
  --create \
  --bootstrap-server localhost:19092 \
  --replication-factor 1 \
  --partitions 1 \
  --topic pullRequestMergedEvents

Add a Pinot table and schema

Schema can be found at /examples/stream/githubevents/ in the release, and is also pasted below:

{
  "schemaName": "pullRequestMergedEvents",
  "dimensionFieldSpecs": [
    {
      "name": "title",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "labels",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "userId",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "userType",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "authorAssociation",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "mergedBy",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "assignees",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "authors",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "committers",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "requestedReviewers",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "requestedTeams",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "reviewers",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "commenters",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": ""
    },
    {
      "name": "repo",
      "dataType": "STRING",
      "defaultNullValue": ""
    },
    {
      "name": "organization",
      "dataType": "STRING",
      "defaultNullValue": ""
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "count",
      "dataType": "LONG",
      "defaultNullValue": 1
    },
    {
      "name": "numComments",
      "dataType": "LONG"
    },
    {
      "name": "numReviewComments",
      "dataType": "LONG"
    },
    {
      "name": "numCommits",
      "dataType": "LONG"
    },
    {
      "name": "numLinesAdded",
      "dataType": "LONG"
    },
    {
      "name": "numLinesDeleted",
      "dataType": "LONG"
    },
    {
      "name": "numFilesChanged",
      "dataType": "LONG"
    },
    {
      "name": "numAuthors",
      "dataType": "LONG"
    },
    {
      "name": "numCommitters",
      "dataType": "LONG"
    },
    {
      "name": "numReviewers",
      "dataType": "LONG"
    },
    {
      "name": "numCommenters",
      "dataType": "LONG"
    },
    {
      "name": "createdTimeMillis",
      "dataType": "LONG"
    },
    {
      "name": "elapsedTimeMillis",
      "dataType": "LONG"
    }
  ],
  "timeFieldSpec": {
    "incomingGranularitySpec": {
      "timeType": "MILLISECONDS",
      "timeFormat": "EPOCH",
      "dataType": "LONG",
      "name": "mergedTimeMillis"
    }
  }
}

The table config can be found at /examples/stream/githubevents/ in the release, and is also pasted below.

Note

If you're setting this up on a pre-configured cluster, set the properties stream.kafka.zk.broker.url and stream.kafka.broker.list correctly, depending on the configuration of your Kafka cluster.

{
  "tableName": "pullRequestMergedEvents",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "mergedTimeMillis",
    "timeType": "MILLISECONDS",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "60",
    "schemaName": "pullRequestMergedEvents",
    "replication": "1",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "invertedIndexColumns": [
      "organization",
      "repo"
    ],
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "pullRequestMergedEvents",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.zk.broker.url": "localhost:2191/kafka",
      "stream.kafka.broker.list": "localhost:19092",
      "realtime.segment.flush.threshold.time": "12h",
      "realtime.segment.flush.threshold.rows": "100000",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}

Add the table and schema using the command:

$ bin/pinot-admin.sh AddTable \
  -tableConfigFile $PATH_TO_CONFIGS/examples/stream/githubEvents/pullRequestMergedEvents_realtime_table_config.json \
  -schemaFile $PATH_TO_CONFIGS/examples/stream/githubEvents/pullRequestMergedEvents_schema.json \
  -exec

Publish events

Start streaming GitHub events into the Kafka topic

Prerequisites

Generate a personal access token on GitHub.

$ bin/pinot-admin.sh StreamGitHubEvents \
  -topic pullRequestMergedEvents \
  -personalAccessToken <your_github_personal_access_token> \
  -kafkaBrokerList localhost:19092 \
  -schemaFile $PATH_TO_CONFIGS/examples/stream/githubEvents/pullRequestMergedEvents_schema.json

Short version

For a single command to setup all the above steps

$ bin/pinot-admin.sh GitHubEventsQuickStart \
  -personalAccessToken <your_github_personal_access_token>

Kubernetes cluster

If you already have a Kubernetes cluster with Pinot and Kafka (see Running Pinot in Kubernetes), first create the topic, then set up the table and streaming using

$ cd kubernetes/helm
$ kubectl apply -f pinot-github-realtime-events.yml

Query

Browse to the Query Console to view the data.

Visualize with SuperSet

You can use SuperSet to visualize this data. Some of the interesting insights we captures were

List the most active organizations during the lockdown

Repositories by number of commits in the Apache organization

To integrate with SuperSet you can check out the SuperSet Integrations page.

0.9.0

Summary

This release introduces a new features: Segment Merge and Rollup to simplify users day to day operational work. A new metrics plugin is added to support dropwizard. As usual, new functionalities and many UI/ Performance improvements.

The release was cut from the following commit: and the following cherry-picks: ,

Support Segment Merge and Roll-up

LinkedIn operates a large multi-tenant cluster that serves a business metrics dashboard, and noticed that their tables consisted of millions of small segments. This was leading to slow operations in Helix/Zookeeper, long running queries due to having too many tasks to process, as well as using more space because of a lack of compression.

To solve this problem they added the Segment Merge task, which compresses segments based on timestamps and rolls up/aggregates older data. The task can be run on a schedule or triggered manually via the Pinot REST API.

At the moment this feature is only available for offline tables, but will be added for real-time tables in a future release.

Major Changes:

  • Integrate enhanced SegmentProcessorFramework into MergeRollupTaskExecutor ()

  • Merge/Rollup task scheduler for offline tables. ()

  • Fix MergeRollupTask uploading segments not updating their metadata ()

  • MergeRollupTask integration tests ()

  • Add mergeRollupTask delay metrics ()

  • MergeRollupTaskGenerator enhancement: enable parallel buckets scheduling ()

  • Use maxEndTimeMs for merge/roll-up delay metrics. ()

UI Improvement

This release also sees improvements to Pinot’s query console UI.

  • Cmd+Enter shortcut to run query in query console ()

  • Showing tooltip in SQL Editor ()

  • Make the SQL Editor box expandable ()

  • Fix tables ordering by number of segments ()

SQL Improvements

There have also been improvements and additions to Pinot’s SQL implementation.

New functions:

  • IN ()

  • LASTWITHTIME ()

  • ID_SET on MV columns ()

  • Raw results for Percentile TDigest and Est (),

  • Add timezone as argument in function toDateTime ()

New predicates are supported:

  • LIKE()

  • REGEXP_EXTRACT()

  • FILTER()

Query compatibility improvements:

  • Infer data type for Literal ()

  • Support logical identifier in predicate ()

  • Support JSON queries with top-level array path expression. ()

  • Support configurable group by trim size to improve results accuracy ()

Performance Improvements

This release contains many performance improvement, you may sense it for you day to day queries. Thanks to all the great contributions listed below:

  • Reduce the disk usage for segment conversion task ()

  • Simplify association between Java Class and PinotDataType for faster mapping ()

  • Avoid creating stateless ParseContextImpl once per jsonpath evaluation, avoid varargs allocation ()

  • Replace MINUS with STRCMP ()

  • Bit-sliced range index for int, long, float, double, dictionarized SV columns ()

  • Use MethodHandle to access vectorized unsigned comparison on JDK9+ ()

  • Add option to limit thread usage per query ()

  • Improved range queries ()

  • Faster bitmap scans ()

  • Optimize EmptySegmentPruner to skip pruning when there is no empty segments ()

  • Map bitmaps through a bounded window to avoid excessive disk pressure ()

  • Allow RLE compression of bitmaps for smaller file sizes ()

  • Support raw index properties for columns with JSON and RANGE indexes ()

  • Enhance BloomFilter rule to include IN predicate() ()

  • Introduce LZ4_WITH_LENGTH chunk compression type ()

  • Enhance ColumnValueSegmentPruner and support bloom filter prefetch ()

  • Apply the optimization on dictIds within the segment to DistinctCountHLL aggregation func ()

  • During segment pruning, release the bloom filter after each segment is processed ()

  • Fix JSONPath cache inefficient issue ()

  • Optimize getUnpaddedString with SWAR padding search ()

  • Lighter weight LiteralTransformFunction, avoid excessive array fills ()

  • Inline binary comparison ops to prevent function call overhead ()

  • Memoize literals in query context in order to deduplicate them ()

Other Notable New Features and Changes

  • Human Readable Controller Configs ()

  • Add the support of geoToH3 function ()

  • Add Apache Pulsar as Pinot Plugin () ()

  • Add dropwizard metrics plugin ()

  • Introduce OR Predicate Execution On Star Tree Index ()

  • Allow to extract values from array of objects with jsonPathArray ()

  • Add Realtime table metadata and indexes API. ()

  • Support array with mixing data types ()

  • Support force download segment in reload API ()

  • Show uncompressed znRecord from zk api ()

  • Add debug endpoint to get minion task status. ()

  • Validate CSV Header For Configured Delimiter ()

  • Add auth tokens and user/password support to ingestion job command ()

  • Add option to store the hash of the upsert primary key ()

  • Add null support for time column ()

  • Add mode aggregation function ()

  • Support disable swagger in Pinot servers ()

  • Delete metadata properly on table deletion ()

  • Add basic Obfuscator Support ()

  • Add AWS sts dependency to enable auth using web identity token. ()()

  • Mask credentials in debug endpoint /appconfigs ()

  • Fix /sql query endpoint now compatible with auth ()

  • Fix case sensitive issue in BasicAuthPrincipal permission check ()

  • Fix auth token injection in SegmentGenerationAndPushTaskExecutor ()

  • Add segmentNameGeneratorType config to IndexingConfig ()

  • Support trigger PeriodicTask manually ()

  • Add endpoint to check minion task status for a single task. ()

  • Showing partial status of segment and counting CONSUMING state as good segment status ()

  • Add "num rows in segments" and "num segments queried per host" to the output of Realtime Provisioning Rule ()

  • Check schema backward-compatibility when updating schema through addSchema with override ()

  • Optimize IndexedTable ()

  • Support indices remove in V3 segment format ()

  • Optimize TableResizer ()

  • Introduce resultSize in IndexedTable ()

  • Offset based real-time consumption status checker ()

  • Add causes to stack trace return ()

  • Create controller resource packages config key ()

  • Enhance TableCache to support schema name different from table name ()

  • Add validation for realtimeToOffline task ()

  • Unify CombineOperator multi-threading logic ()

  • Support no downtime rebalance for table with 1 replica in TableRebalancer ()

  • Introduce MinionConf, move END_REPLACE_SEGMENTS_TIMEOUT_MS to minion config instead of task config. ()

  • Adjust tuner api ()

  • Adding config for metrics library ()

  • Add geo type conversion scalar functions ()

  • Add BOOLEAN_ARRAY and TIMESTAMP_ARRAY types ()

  • Add MV raw forward index and MV BYTES data type ()

  • Enhance TableRebalancer to offload the segments from most loaded instances first ()

  • Improve get tenant API to differentiate offline and real-time tenants ()

  • Refactor query rewriter to interfaces and implementations to allow customization ()

  • In ServiceStartable, apply global cluster config in ZK to instance config ()

  • Make dimension tables creation bypass tenant validation ()

  • Allow Metadata and Dictionary Based Plans for No Op Filters ()

  • Reject query with identifiers not in schema ()

  • Round Robin IP addresses when retry uploading/downloading segments ()

  • Support multi-value derived column in offline table reload ()

  • Support segmentNamePostfix in segment name ()

  • Add select segments API ()

  • Controller getTableInstance() call now returns the list of live brokers of a table. ()

  • Allow MV Field Support For Raw Columns in Text Indices ()

  • Allow override distinctCount to segmentPartitionedDistinctCount ()

  • Add a quick start with both UPSERT and JSON index ()

  • Add revertSegmentReplacement API ()

  • Smooth segment reloading with non blocking semantic ()

  • Clear the reused record in PartitionUpsertMetadataManager ()

  • Replace args4j with picocli ()

  • Handle datetime column consistently ()()

  • Allow to carry headers with query requests () ()

  • Allow adding JSON data type for dimension column types ()

  • Separate SegmentDirectoryLoader and tierBackend concepts ()

  • Implement size balanced V4 raw chunk format ()

  • Add presto-pinot-driver lib ()

Major Bug fixes

  • Fix null pointer exception for non-existed metric columns in schema for JDBC driver ()

  • Fix the config key for TASK_MANAGER_FREQUENCY_PERIOD ()

  • Fixed pinot java client to add zkClient close ()

  • Ignore query json parse errors ()

  • Fix shutdown hook for PinotServiceManager () ()

  • Make STRING to BOOLEAN data type change as backward compatible schema change ()

  • Replace gcp hardcoded values with generic annotations ()

  • Fix segment conversion executor for in-place conversion ()

  • Fix reporting consuming rate when the Kafka partition level consumer isn't stopped ()

  • Fix the issue with concurrent modification for segment lineage ()

  • Fix TableNotFound error message in PinotHelixResourceManager ()

  • Fix upload LLC segment endpoint truncated download URL ()

  • Fix task scheduling on table update ()

  • Fix metric method for ONLINE_MINION_INSTANCES metric ()

  • Fix JsonToPinotSchema behavior to be consistent with AvroSchemaToPinotSchema ()

  • Fix currentOffset volatility in consuming segment()

  • Fix misleading error msg for missing URI ()

  • Fix the correctness of getColumnIndices method ()

  • Fix SegmentZKMetadta time handling ()

  • Fix retention for cleaning up segment lineage ()

  • Fix segment generator to not return illegal filenames ()

  • Fix missing LLC segments in segment store by adding controller periodic task to upload them ()

  • Fix parsing error messages returned to FileUploadDownloadClient ()

  • Fix manifest scan which drives /version endpoint ()

  • Fix missing rate limiter if brokerResourceEV becomes null due to ZK connection ()

  • Fix race conditions between segment merge/roll-up and purge (or convertToRawIndex) tasks: ()

  • Fix pql double quote checker exception ()

  • Fix minion metrics exporter config ()

  • Fix segment unable to retry issue by catching timeout exception during segment replace ()

  • Add Exception to Broker Response When Not All Segments Are Available (Partial Response) ()

  • Fix segment generation commands ()

  • Return non zero from main with exception ()

  • Fix parquet plugin shading error ()

  • Fix the lowest partition id is not 0 for LLC ()

  • Fix star-tree index map when column name contains '.' ()

  • Fix cluster manager URLs encoding issue()

  • Fix fieldConfig nullable validation ()

  • Fix verifyHostname issue in FileUploadDownloadClient ()

  • Fix TableCache schema to include the built-in virtual columns ()

  • Fix DISTINCT with AS function ()

  • Fix SDF pattern in DataPreprocessingHelper ()

  • Fix fields missing issue in the source in ParquetNativeRecordReader ()

13c9ee9
668b5e0
ee887b9
#7180
#7178
#7289
#7283
#7368
#7481
#7617
#7359
#7387
#7381
#7564
#7542
#7584
#7355
#7226
#7552
#7214
#7114
#7566
#7332
#7347
#7511
#7241
#7193
#7402
#7412
#7394
#7454
#7487
#7492
#7513
#7530
#7531
#7535
#7582
#7615
#7444
#7624
#7655
#7654
#7630
#7668
#7409
#7708
#7707
#7709
#7720
#7173
#7182
#7223
#7247
#7263
#7184
#7208
#7169
#7234
#7249
#7304
#7300
#7237
#7233
#7246
#7269
#7318
#7341
#7329
#7407
#7017
#7445
#7452
#7230
#7354
#7464
#7346
#7174
#7353
#7327
#7282
#7374
#7373
#7301
#7392
#7420
#7267
#7460
#7488
#7525
#7523
#7450
#7532
#7516
#7553
#7551
#7573
#7581
#7595
#7574
#7548
#7576
#7593
#7559
#7563
#7590
#7585
#7632
#7646
#7651
#7556
#7638
#7664
#7669
#7662
#7675
#7676
#7665
#7645
#7705
#7696
#7712
#7718
#7737
#7661
#7384
#7175
#7198
#7196
#7165
#7251
#7253
#7259
#6985
#7265
#7322
#7343
#7340
#7361
#7362
#7363
#7366
#7365
#7367
#7370
#7375
#7424
#7085
#6778
#7428
#7456
#7470
#7427
#7485
#7496
#7509
#7397
#7527
#7482
#7570
#7066
#7623
#7639
#7648
#7703
#7706
#7678
#7721
#7742

Star-tree index

This page describes the indexing techniques available in Apache Pinot.

In this page you will learn what a star-tree index is and gain a conceptual understanding of how one works.

Unlike other index techniques which work on a single column, the star-tree index is built on multiple columns and utilizes pre-aggregated results to significantly reduce the number of values to be processed, resulting in improved query performance.

One of the biggest challenges in real-time OLAP systems is achieving and maintaining tight SLAs on latency and throughput on large data sets. Existing techniques such as sorted index or inverted index help improve query latencies, but speed-ups are still limited by the number of documents that need to be processed to compute results. On the other hand, pre-aggregating the results ensures a constant upper bound on query latencies, but can lead to storage space explosion.

Use the star-tree index to utilize pre-aggregated documents to achieve both low query latencies and efficient use of storage space for aggregation and group-by queries.

Existing solutions

Consider the following data set, which is used here as an example to discuss these indexes:

Country
Browser
Locale
Impressions

CA

Chrome

en

400

CA

Firefox

fr

200

MX

Safari

es

300

MX

Safari

en

100

USA

Chrome

en

600

USA

Firefox

es

200

USA

Firefox

en

400

Sorted index

In this approach, data is sorted on a primary key, which is likely to appear as filter in most queries in the query set.

This reduces the time to search the documents for a given primary key value from linear scan O(n) to binary search O(logn), and also keeps good locality for the documents selected.

While this is a significant improvement over linear scan, there are still a few issues with this approach:

  • While sorting on one column does not require additional space, sorting on additional columns requires additional storage space to re-index the records for the various sort orders.

  • While search time is reduced from O(n) to O(logn), overall latency is still a function of the total number of documents that need to be processed to answer a query.

Inverted index

In this approach, for each value of a given column, we maintain a list of document id’s where this value appears.

Below are the inverted indexes for columns ‘Browser’ and ‘Locale’ for our example data set:

Browser
Doc Id

Firefox

1,5,6

Chrome

0,4

Safari

2,3

Locale
Doc Id

en

0,3,4,6

es

2,5

fr

1

For example, if we want to get all the documents where ‘Browser’ is ‘Firefox’, we can look up the inverted index for ‘Browser’ and identify that it appears in documents [1, 5, 6].

Using an inverted index, we can reduce the search time to constant time O(1). The query latency, however, is still a function of the selectivity of the query: it increases with the number of documents that need to be processed to answer the query.

Pre-aggregation

In this technique, we pre-compute the answer for a given query set upfront.

In the example below, we have pre-aggregated the total impressions for each country:

Country
Impressions

CA

600

MX

400

USA

1200

With this approach, answering queries about total impressions for a country is a value lookup, because we have eliminated the need to process a large number of documents. However, to be able to answer queries that have multiple predicates means we would need to pre-aggregate for various combinations of different dimensions, which leads to an exponential increase in storage space.

Star-tree solution

On one end of the spectrum we have indexing techniques that improve search times with a limited increase in space, but don't guarantee a hard upper bound on query latencies. On the other end of the spectrum, we have pre-aggregation techniques that offer a hard upper bound on query latencies, but suffer from exponential explosion of storage space

The star-tree data structure offers a configurable trade-off between space and time and lets us achieve a hard upper bound for query latencies for a given use case. The following sections cover the star-tree data structure, and explain how Pinot uses this structure to achieve low latencies with high throughput.

Definitions

Tree structure

The star-tree index stores data in a structure that consists of the following properties:

Star-tree index structure
  • Root node (Orange): Single root node, from which the rest of the tree can be traversed.

  • Leaf node (Blue): A leaf node can containing at most T records, where T is configurable.

  • Non-leaf node (Green): Nodes with more than T records are further split into children nodes.

  • Star node (Yellow): Non-leaf nodes can also have a special child node called the star node. This node contains the pre-aggregated records after removing the dimension on which the data was split for this level.

  • Dimensions split order ([D1, D2]): Nodes at a given level in the tree are split into children nodes on all values of a particular dimension. The dimensions split order is an ordered list of dimensions that is used to determine the dimension to split on for a given level in the tree.

Node properties

The properties stored in each node are as follows:

  • Dimension: The dimension that the node is split on

  • Start/End Document Id: The range of documents this node points to

  • Aggregated Document Id: One single document that is the aggregation result of all documents pointed by this node

Index generation

The star-tree index is generated in the following steps:

  • The data is first projected as per the dimensionsSplitOrder. Only the dimensions from the split order are reserved, others are dropped. For each unique combination of reserved dimensions, metrics are aggregated per configuration. The aggregated documents are written to a file and served as the initial star-tree documents (separate from the original documents).

  • Sort the star-tree documents based on the dimensionsSplitOrder. It is primary-sorted on the first dimension in this list, and then secondary sorted on the rest of the dimensions based on their order in the list. Each node in the tree points to a range in the sorted documents.

  • The tree structure can be created recursively (starting at root node) as follows:

    • If a node has more than T records, it is split into multiple children nodes, one for each value of the dimension in the split order corresponding to current level in the tree.

    • A star node can be created (per configuration) for the current node, by dropping the dimension being split on, and aggregating the metrics for rows containing dimensions with identical values. These aggregated documents are appended to the end of the star-tree documents.

      If there is only one value for the current dimension, a star node won’t be created because the documents under the star node are identical to the single node.

  • The above step is repeated recursively until there are no more nodes to split.

  • Multiple star-trees can be generated based on different configurations (dimensionsSplitOrder, aggregations, T)

Aggregation

Aggregation is configured as a pair of aggregation functions and the column to apply the aggregation.

All types of aggregation function that have a bounded-sized intermediate result are supported.

Supported functions

  • COUNT

  • MIN

  • MAX

  • SUM

  • SUM_PRECISION

    • The maximum precision can be optionally configured in functionParameters using the key precision. For example: {"precision": 20}.

  • AVG

  • MIN_MAX_RANGE

  • PERCENTILE_EST

  • PERCENTILE_RAW_EST

  • PERCENTILE_TDIGEST

    • The compression factor for the TDigest histogram can be optionally configured in functionParameters using the key compressionFactor. For example: {"compressionFactor": 200}. If not configured, the default value of 100 will be used.

  • PERCENTILE_RAW_TDIGEST

    • The compression factor for the TDigest histogram can be optionally configured in functionParameters using the key compressionFactor. For example: {"compressionFactor": 200}. If not configured, the default value of 100 will be used.

  • DISTINCT_COUNT_BITMAP

    • NOTE: The intermediate result RoaringBitmap is not bounded-sized, use carefully on high cardinality columns.

  • DISTINCT_COUNT_HLL

    • The log2m value for the HyperLogLog structure can be optionally configured in functionParameters , for example: {"log2m": 16}. If not configured, the default value of 8 will be used. Remember that a larger log2m value leads to better accuracy but also a larger memory footprint.

  • DISTINCT_COUNT_RAW_HLL

    • The log2m value for the HyperLogLog structure can be optionally configured in functionParameters , for example: {"log2m": 16}. If not configured, the default value of 8 will be used. Remember that a larger log2m value leads to better accuracy but also a larger memory footprint.

  • DISTINCT_COUNT_HLL_PLUS

    • The p (precision value of normal set) and sp (precision value of sparse set) values for the HyperLogLogPlus structure can be optionally configured in functionParameters, for example: {"p": 16, "sp": 32}. If not configured, p will have the default value of 14 and sp will have the default value of 0.

  • DISTINCT_COUNT_RAW_HLL_PLUS

    • The p (precision value of normal set) and sp (precision value of sparse set) values for the HyperLogLogPlus structure can be optionally configured in functionParameters, for example: {"p": 16, "sp": 32}. If not configured, p will have the default value of 14 and sp will have the default value of 0.

  • DISTINCT_COUNT_THETA_SKETCH

    • The nominalEntries value for the Theta Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with DISTINCT_COUNT_THETA_SKETCH having nominalEntries=8192 or less for any power of 2.

  • DISTINCT_COUNT_RAW_THETA_SKETCH

    • The nominalEntries value for the Theta Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with DISTINCT_COUNT_RAW_THETA_SKETCH having nominalEntries=8192 or less for any power of 2.

  • DISTINCT_COUNT_TUPLE_SKETCH

    • The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with DISTINCT_COUNT_TUPLE_SKETCH having nominalEntries=8192 or less for any power of 2.

  • DISTINCT_COUNT_RAW_INTEGER_SUM_TUPLE_SKETCH

    • The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with DISTINCT_COUNT_RAW_INTEGER_SUM_TUPLE_SKETCH having nominalEntries=8192 or less for any power of 2.

  • SUM_VALUES_INTEGER_SUM_TUPLE_SKETCH

    • The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with SUM_VALUES_INTEGER_SUM_TUPLE_SKETCH having nominalEntries=8192 or less for any power of 2.

  • AVG_VALUE_INTEGER_SUM_TUPLE_SKETCH

    • The nominalEntries value for the Tuple Sketch can be optionally configured in functionParameters, for example: {"nominalEntries": 4096}. If not configured, the default value of 16384 will be used. Note that the nominalEntries provided at query time should be less than or equal to the value used to construct the star-tree index. For instance, a star-tree index with {"nominalEntries": 8192} can be used with AVG_VALUE_INTEGER_SUM_TUPLE_SKETCH having nominalEntries=8192 or less for any power of 2.

  • DISTINCT_COUNT_CPC_SKETCH

    • The lgK value for the CPC Sketch can be optionally configured in functionParameters, for example: {"lgK": 13}. If not configured, the default value of 12 will be used. Note that the nominalEntries provided at query time should be 2 ^ lgK in order for a star-tree index to be used. For instance, a star-tree index with {"lgK": 13} can be used with DISTINCTCOUNTCPCSKETCH having nominalEntries=8192.

  • DISTINCT_COUNT_RAW_CPC_SKETCH

  • DISTINCT_COUNT_ULL

    • The p value (precision parameter) for the UltraLogLog structure can be optionally configured in functionParameters, for example: {"p": 20}. If not configured, the default value of 12 will be used.

  • DISTINCT_COUNT_RAW_ULL

    • The p value (precision parameter) for the UltraLogLog structure can be optionally configured in functionParameters, for example: {"p": 20}. If not configured, the default value of 12 will be used.

Unsupported functions

  • DISTINCT_COUNT

    • Intermediate result Set is unbounded.

  • SEGMENT_PARTITIONED_DISTINCT_COUNT:

    • Intermediate result Set is unbounded.

  • PERCENTILE

    • Intermediate result List is unbounded.

Functions to be supported

  • ST_UNION

Index generation configuration

Multiple index generation configurations can be provided to generate multiple star-trees. Each configuration should contain the following properties:

Property
Description

dimensionsSplitOrder

An ordered list of dimension names can be specified to configure the split order. Only the dimensions in this list are reserved in the aggregated documents. The nodes will be split based on the order of this list. For example, split at level i is performed on the values of dimension at index i in the list. - The star-tree dimension does not have to be a dimension column in the table, it can also be time column, date-time column, or metric column if necessary. - The star-tree dimension column should be dictionary encoded in order to generate the star-tree index. - All columns in the filter and group-by clause of a query should be included in this list in order to use the star-tree index.

skipStarNodeCreationForDimensions

(Optional, default empty): A list of dimension names for which to not create the Star-Node.

functionColumnPairs

A list of aggregation function and column pairs (split by double underscore “__”). E.g. SUM__Impressions (SUM of column Impressions) or COUNT__*.

aggregationConfigs

Check

maxLeafRecords

(Optional, default 10000): The threshold T to determine whether to further split each node.

`functionColumnPairs` and `aggregationConfigs` are interchangeable. Consider using `aggregationConfigs` since it supports additional parameters like compression.

AggregationConfigs

All aggregations of a query should be included in `aggregationConfigs` or in `functionColumnPairs` in order to use the star-tree index.

Property
Description

columnName

(Required) Name of the column to aggregate. The column can be either dictionary encoded or raw.

aggregationFunction

(Required) Name of the aggregation function to use.

compressionCodec

(Optional, default PASS_THROUGH, introduced in release 1.1.0) Used to configure the compression enabled on the star-tree-index. Useful when aggregating on columns that contain big values. For example, a BYTES column containing HLL counters serialisations used to calculate DISTINCTCOUNTHLL. In this case setting "compressionCodec": "LZ4" can significantly reduce the space used by the index. Equivalent to compressionCodec in

deriveNumDocsPerChunk

(Optional, introduced in release 1.2.0) Equivalent to deriveNumDocsPerChunk in

indexVersion

(Optional, introduced in release 1.2.0) Equivalent to rawIndexWriterVersion in

targetMaxChunkSize

(Optional, introduced in release 1.2.0) Equivalent to targetMaxChunkSize in

targetDocsPerChunk

(Optional, introduced in release 1.2.0) Equivalent to targetDocsPerChunk in

functionParameters

(Optional) A configuration map used to pass in additional configurations to the aggregation function. For example, on DISTINCTCOUNTHLL, this could look like {"log2m": 16} in order to build the star-tree index using DISTINCTCOUNTHLL with a non-default value for log2m. Note that the index will only be used for queries using the same value for log2m with DISTINCTCOUNTHLL.

Default index generation configuration

A default star-tree index can be added to a segment by using the boolean config enableDefaultStarTree under the tableIndexConfig.

A default star-tree will have the following configuration:

  • All dictionary-encoded single-value dimensions with cardinality smaller or equal to a threshold (10000) will be included in the dimensionsSplitOrder, sorted by their cardinality in descending order.

  • All dictionary-encoded Time/DateTime columns will be appended to the _dimensionsSplitOrder _following the dimensions, sorted by their cardinality in descending order. Here we assume that time columns will be included in most queries as the range filter column and/or the group by column, so for better performance, we always include them as the last elements in the dimensionsSplitOrder.

  • Include COUNT(*) and SUM for all numeric metrics in the functionColumnPairs.

  • Use default maxLeafRecords (10000).

Example

For our example data set, in order to solve the following query efficiently:

SELECT SUM(Impressions) 
FROM myTable 
WHERE Country = 'USA' 
AND Browser = 'Chrome' 
GROUP BY Locale

We may configure the star-tree index as follows:

"tableIndexConfig": {
  "starTreeIndexConfigs": [{
    "dimensionsSplitOrder": [
      "Country",
      "Browser",
      "Locale"
    ],
    "skipStarNodeCreationForDimensions": [
    ],
    "functionColumnPairs": [
      "SUM__Impressions"
    ],
    "maxLeafRecords": 1
  }],
  ...
}

Alternatively using aggregationConfigs instead of functionColumnPairs and enabling compression on the aggregation:

"tableIndexConfig": {
  "starTreeIndexConfigs": [{
    "dimensionsSplitOrder": [
      "Country",
      "Browser",
      "Locale"
    ],
    "skipStarNodeCreationForDimensions": [
    ],
    "aggregationConfigs": [
      {
        "columnName": "Impressions",
        "aggregationFunction": "SUM",
        "compressionCodec": "LZ4"
      }
    ],
    "maxLeafRecords": 1
  }],
  ...
}

Note: In above example configs maxLeafRecords is set to 1 so that all of the dimension combinations are pre-aggregated for clarity in visual below.

The star-tree and documents should be something like below:

Tree structure

The values in the parentheses are the aggregated sum of Impressions for all the documents under the node.

Star-tree documents

Country
Browser
Locale
SUM__Impressions

CA

Chrome

en

400

CA

Firefox

fr

200

MX

Safari

en

100

MX

Safari

es

300

USA

Chrome

en

600

USA

Firefox

en

400

USA

Firefox

es

200

CA

*

en

400

CA

*

fr

200

CA

*

*

600

MX

Safari

*

400

USA

Firefox

*

600

USA

*

en

1000

USA

*

es

200

USA

*

*

1200

*

Chrome

en

1000

*

Firefox

en

400

*

Firefox

es

200

*

Firefox

fr

200

*

Firefox

*

800

*

Safari

en

100

*

Safari

es

300

*

Safari

*

400

*

*

en

1500

*

*

es

500

*

*

fr

200

*

*

*

2200

Query execution

For query execution, the idea is to first check metadata to determine whether the query can be solved with the star-tree documents, then traverse the Star-Tree to identify documents that satisfy all the predicates. After applying any remaining predicates that were missed while traversing the star-tree to the identified documents, apply aggregation/group-by on the qualified documents.

The algorithm to traverse the tree can be described as follows:

  • Start from root node.

  • For each level, what child node(s) to select depends on whether there are any predicates/group-by on the split dimension for the level in the query.

    • If there is no predicate or group-by on the split dimension, select the Star-Node if exists, or all child nodes to traverse further.

    • If there are predicate(s) on the split dimension, select the child node(s) that satisfy the predicate(s).

    • If there is no predicate, but there is a group-by on the split dimension, select all child nodes except Star-Node.

  • Recursively repeat the previous step until all leaf nodes are reached, or all predicates are satisfied.

  • Collect all the documents pointed by the selected nodes.

    • If all predicates and group-by's are satisfied, pick the single aggregated document from each selected node.

    • Otherwise, collect all the documents in the document range from each selected node.note

Predicates

Supported Predicates

  • EQ (=)

  • NOT EQ (!=)

  • IN

  • NOT IN

  • RANGE (>, >=, <, <=, BETWEEN)

  • AND

Unsupported Predicates

  • REGEXP_LIKE: It is intentionally left unsupported because it requires scanning the entire dictionary.

  • IS NULL: Currently NULL value info is not stored in star-tree index, and the dimension will be indexed as default value. A workaround is to do col = <default> instead.

  • IS NOT NULL: Same as IS NULL. A workaround is to do col != <default>.

Limited Support Predicates

  • OR

    • It can be applied to predicates on the same dimension, e.g. WHERE d1 < 10 OR d1 > 50)

    • It CANNOT be applied to predicates on multiple dimensions because star-tree index will double counting with pre-aggregated results.

  • NOT (Added since 1.2.0)

    • It can be applied to simple predicate and NOT

    • It CANNOT be applied on top of AND/OR because star-tree index will double counting with pre-aggregated results.

In scenarios where you have a transform on a column(s) which is in the dimension split order (should include all columns that are either a predicate or a group by column in target query(ies)) AND used in a group-by, then Star-tree index will get applied automatically. If a transform is applied to a column(s) which is used in predicate (WHERE clause) then Star-tree index won't apply.

For e.g if query contains round(colA,600) as roundedValue from tableA group by roundedValue and colA is included in dimensionSplitOrder then Pinot will use the pre-aggregated records to first scan matching records and then apply transform round() to derive roundedValue.

AggregationConfigs
Raw value forward index
Raw value forward index
Raw value forward index
Raw value forward index
Raw value forward index

0.11.0

Summary

Apache Pinot 0.11.0 has introduced many new features to extend the query abilities, e.g. the Multi-Stage query engine enables Pinot to do distributed joins, more sql syntax(DML support), query functions and indexes(Text index, Timestamp index) supported for new use cases. And as always, more integrations with other systems(E.g. Spark3, Flink).

Note: there is a major upgrade for Apache Helix to 1.0.4, so make sure you upgrade the system in the order of:

Helix Controller -> Pinot Controller -> Pinot Broker -> Pinot server

Multi-Stage Query Engine

The new multi-stage query engine (a.k.a V2 query engine) is designed to support more complex SQL semantics such as JOIN, OVER window, MATCH_RECOGNIZE and eventually, make Pinot support closer to full ANSI SQL semantics. More to read:

Pause Stream Consumption on Apache Pinot

Pinot operators can pause real-time consumption of events while queries are being executed, and then resume consumption when ready to do so again.\

More to read:

Gap-filling function

The gapfilling functions allow users to interpolate data and perform powerful aggregations and data processing over time series data. More to read:

Add support for Spark 3.x ()

Long waiting feature for segment generation on Spark 3.x.

Add Flink Pinot connector ()

Similar to the Spark Pinot connector, this allows Flink users to dump data from the Flink application to Pinot.

Show running queries and cancel query by id ()

This feature allows better fine-grained control on pinot queries.

Timestamp Index ()

This allows users to have better query performance on the timestamp column for lower granularity. See:

Native Text Indices ()

Wanna search text in real time? The new text indexing engine in Pinot supports the following capabilities:

  1. New operator: LIKE

  1. New operator: CONTAINS

  1. Native text index, built from the ground up, focusing on Pinot’s time series use cases and utilizing existing Pinot indices and structures(inverted index, bitmap storage).

  2. Real Time Text Index

Read more:

Adding DML definition and parse SQL InsertFile ()

Now you can use INSERT INTO [database.]table FROM FILE dataDirURI OPTION ( k=v ) [, OPTION (k=v)]* to load data into Pinot from a file using Minion. See:

Deduplication ()

This feature supports enabling deduplication for real-time tables, via a top-level table config. At a high level, primaryKey (as defined in the table schema) hashes are stored into in-memory data structures, and each incoming row is validated against it. Duplicate rows are dropped.

The expectation while using this feature is for the stream to be partitioned by the primary key, strictReplicaGroup routing to be enabled, and the configured stream consumer type to be low level. These requirements are therefore mandated via table config API's input validations.

Functions support and changes:

  • Add support for functions arrayConcatLong, arrayConcatFloat, arrayConcatDouble ()

  • Add support for regexpReplace scalar function ()

  • Add support for Base64 Encode/Decode Scalar Functions ()

  • Optimize like to regexp conversion to do not include unnecessary ^._ and ._$ ()

  • Support DISTINCT on multiple MV columns ()

  • Support DISTINCT on single MV column ()

  • Add histogram aggregation function ()

  • Optimize dateTimeConvert scalar function to only parse the format once ()

  • Support conjugates for scalar functions, add more scalar functions ()

  • add FIRSTWITHTIME aggregate function support ()

  • Add PercentileSmartTDigestAggregationFunction ()

  • Simplify the parameters for DistinctCountSmartHLLAggregationFunction ()

  • add scalar function for cast so it can be calculated at compile time ()

  • Scalable Gapfill Implementation for Avg/Count/Sum ()

  • Add commonly used math, string and date scalar functions in Pinot ()

  • Datetime transform functions ()

  • Scalar function for url encoding and decoding ()

  • Add support for IS NULL and NOT IS NULL in transform functions ()

  • Support st_contains using H3 index ()

The full list of features introduced in this release

  • add query cancel APIs on controller backed by those on brokers ()

  • Add an option to search input files recursively in ingestion job. The default is set to true to be backward compatible. ()

  • Adding endpoint to download local log files for each component ()

  • Add metrics to track controller segment download and upload requests in progress ()

  • add a freshness based consumption status checker ()

  • Force commit consuming segments ()

  • Adding kafka offset support for period and timestamp ()

  • Make upsert metadata manager pluggable ()

  • Adding logger utils and allow change logger level at runtime ()

  • Proper null handling in equality, inequality and membership operators for all SV column data types ()

  • support to show running queries and cancel query by id ()

  • Enhance upsert metadata handling ()

  • Proper null handling in Aggregation functions for SV data types ()

  • Add support for IAM role based credentials in Kinesis Plugin ()

  • Task genrator debug api ()

  • Add Segment Lineage List API ()

  • [colocated-join] Adds Support for instancePartitionsMap in Table Config ()

  • Support pause/resume consumption of real-time tables ()

  • Minion tab in Pinot UI ()

  • Add Protocol Buffer Stream Decoder ()

  • Update minion task metadata ZNode path ()

  • add /tasks/{taskType}/{tableNameWithType}/debug API ()

  • Defined a new broker metric for total query processing time ()

  • Proper null handling in SELECT, ORDER BY, DISTINCT, and GROUP BY ()

  • fixing REGEX OPTION parser ()

  • Enable key value byte stitching in PulsarMessageBatch ()

  • Add property to skip adding hadoop jars to package ()

  • Support DISTINCT on multiple MV columns ()

  • Implement Mutable FST Index ()

  • Support DISTINCT on single MV column ()

  • Add controller API for reload segment task status ()

  • Spark Connector, support for TIMESTAMP and BOOLEAN fields ()

  • Allow moveToFinalLocation in METADATA push based on config () ()

  • allow up to 4GB per bitmap index ()

  • Deprecate debug options and always use query options ()

  • Streamed segment download & untar with rate limiter to control disk usage ()

  • Improve the Explain Plan accuracy ()

  • allow to set https as the default scheme ()

  • Add histogram aggregation function ()

  • Allow table name with dots by a PinotConfiguration switch ()

  • Disable Groovy function by default ()

  • Deduplication ()

  • Add pluggable client auth provider ()

  • Adding pinot file system command ()

  • Allow broker to automatically rewrite expensive function to its approximate counterpart ()

  • allow to take data outside the time window by negating the window filter ()

  • Support BigDecimal raw value forward index; Support BigDecimal in many transforms and operators ()

  • Ingestion Aggregation Feature ()

  • Enable uploading segments to real-time tables ()

  • Package kafka 0.9 shaded jar to pinot-distribution ()

  • Simplify the parameters for DistinctCountSmartHLLAggregationFunction ()

  • Add PercentileSmartTDigestAggregationFunction ()

  • Add support for Spark 3.x ()

  • Adding DML definition and parse SQL InsertFile ()

  • endpoints to get and delete minion task metadata ()

  • Add query option to use more replica groups ()

  • Only discover public methods annotated with @ScalarFunction ()

  • Support single-valued BigDecimal in schema, type conversion, SQL statements and minimum set of transforms. ()

  • Add connection based FailureDetector ()

  • Add endpoints for some finer control on minion tasks ()

  • Add adhoc minion task creation endpoint ()

  • Rewrite PinotQuery based on expression hints at instance/segment level ()

  • Allow disabling dict generation for High cardinality columns ()

  • add segment size metric on segment push ()

  • Implement Native Text Operator ()

  • Change default memory allocation for consuming segments from on-heap to off-heap ()

  • New Pinot storage metrics for compressed tar.gz and table size w/o replicas ()

  • add a experiment API for upsert heap memory estimation ()

  • Timestamp type index ()

  • Upgrade Helix to 1.0.4 in Pinot ()

  • Allow overriding expression in query through query config ()

  • Always handle null time values ()

  • Add prefixesToRename config for renaming fields upon ingestion ()

  • Added multi column partitioning for offline table ()

  • Automatically update broker resource on broker changes ()

Vulnerability fixs

Pinot has resolved all the high-level vulnerabilities issues:

  • Add a new workflow to check vulnerabilities using trivy ()

  • Disable Groovy function by default ()

  • Upgrade netty due to security vulnerability ()

  • Upgrade protobuf as the current version has security vulnerability ()

  • Upgrade to hadoop 2.10.1 due to cves ()

  • Upgrade Helix to 1.0.4 ()

  • Upgrade thrift to 0.15.0 ()

  • Upgrade jetty due to security issue ()

  • Upgrade netty ()

  • Upgrade snappy version ()

Bug fixs

  • Nested arrays and map not handled correctly for complex types ()

  • Fix empty data block not returning schema ()

  • Allow mvn build with development webpack; fix instances default value ()

  • Fix the race condition of reflection scanning classes ()

  • Fix ingress manifest for controller and broker ()

  • Fix jvm processors count ()

  • Fix grpc query server not setting max inbound msg size ()

  • Fix upsert replace ()

  • Fix the race condition for partial upsert record read ()

  • Fix log msg, as it missed one param value ()

  • Fix authentication issue when auth annotation is not required ()

  • Fix segment pruning that can break server subquery ()

  • Fix the NPE for ADLSGen2PinotFS ()

  • Fix cross merge ()

  • Fix LaunchDataIngestionJobCommand auth header ()

  • Fix catalog skipping ()

  • Fix adding util for getting URL from InstanceConfig ()

  • Fix string length in MutableColumnStatistics ()

  • Fix instance details page loading table for tenant ()

  • Fix thread safety issue with java client ()

  • Fix allSegmentLoaded check ()

  • Fix bug in segmentDetails table name parsing; style the new indexes table ()

  • Fix pulsar close bug ()

  • Fix REGEX OPTION parser ()

  • Avoid reporting negative values for server latency. ()

  • Fix getConfigOverrides in MinionQuickstart ()

  • Fix segment generation error handling ()

  • Fix multi stage engine serde ()

  • Fix server discovery ()

  • Fix Upsert config validation to check for metrics aggregation ()

  • Fix multi value column index creation ()

  • Fix grpc port assignment in multiple server quickstart ()

  • Spark Connector GRPC reader fix for reading real-time tables ()

  • Fix auth provider for minion ()

  • Fix metadata push mode in IngestionUtils ()

  • Misc fixes on segment validation for uploaded real-time segments ()

  • Fix a typo in ServerInstance.startQueryServer() ()

  • Fix the issue of server opening up query server prematurely ()

  • Fix regression where case order was reversed, add regression test ()

  • Fix dimension table load when server restart or reload table ()

  • Fix when there're two index filter operator h3 inclusion index throw exception ()

  • Fix the race condition of reading time boundary info ()

  • Fix pruning in expressions by max/min/bloom ()

  • Fix GcsPinotFs listFiles by using bucket directly ()

  • Fix column data type store for data table ()

  • Fix the potential NPE for timestamp index rewrite ()

  • Fix on timeout string format in KinesisDataProducer ()

  • Fix bug in segment rebalance with replica group segment assignment ()

  • Fix the upsert metadata bug when adding segment with same comparison value ()

  • Fix the deadlock in ClusterChangeMediator ()

  • Fix BigDecimal ser/de on negative scale ()

  • Fix table creation bug for invalid real-time consumer props ()

  • Fix the bug of missing dot to extract sub props from ingestion job filesytem spec and minion segmentNameGeneratorSpec ()

  • Fix to query inconsistencies under heavy upsert load (resolves ) ()

  • Fix ChildTraceId when using multiple child threads, make them unique ()

  • Fix the group-by reduce handling when query times out ()

  • Fix a typo in BaseBrokerRequestHandler ()

  • Fix TIMESTAMP data type usage during segment creation ()

  • Fix async-profiler install ()

  • Fix ingestion transform config bugs. ()

  • Fix upsert inconsistency by snapshotting the validDocIds before reading the numDocs ()

  • Fix bug when importing files with the same name in different directories ()

  • Fix the missing NOT handling ()

  • Fix setting of metrics compression type in RealtimeSegmentConverter ()

  • Fix segment status checker to skip push in-progress segments ()

  • Fix datetime truncate for multi-day ()

  • Fix redirections for routes with access-token ()

  • Fix CSV files surrounding space issue ()

  • Fix suppressed exceptions in GrpcBrokerRequestHandler()

Connect to Dash

In this Apache Pinot guide, we'll learn how visualize data using the Dash web framework.

In this guide you'll learn how to visualize data from Apache Pinot using Plotly's web framework. Dash is the most downloaded, trusted Python framework for building ML & data science web apps.

We're going to use Dash to build a real-time dashboard to visualize the changes being made to Wikimedia properties.

Real-Time Dashboard Architecture

Startup components

We're going to use the following Docker compose file, which spins up instances of Zookeeper, Kafka, along with a Pinot controller, broker, and server:

docker-compose.yml

Run the following command to launch all the components:

Wikimedia recent changes stream

Wikimedia provides provides a continuous stream of structured event data describing changes made to various Wikimedia properties. The events are published over HTTP using the Server-Side Events (SSE) Protocol.

You can find the endpoint at:

We'll need to install the SSE client library to consume this data:

Next, create a file called wiki.py that contains the following:

wiki.py

The highlighted section shows how we connect to the recent changes feed using the SSE client library.

Let's run this script as shown below:

We'll see the following (truncated) output:

Output

Ingest recent changes into Kafka

Now we're going to import each of the events into Apache Kafka. First let's create a Kafka topic called wiki_events with 5 partitions:

Create a new file called wiki_to_kafka.py and import the following libraries:

wiki_to_kafka.py

Add these functions:

wiki_to_kafka.py

And now let's add the code that calls the recent changes API and imports events into the wiki_events topic:

wiki_to_kafka.py

The highlighted parts of this script indicate where events are ingested into Kafka and then flushed to disk.

If we run this script:

We'll see a message every time 100 messages are pushed to Kafka, as shown below:

Output

Explore Kafka

Let's check that the data has made its way into Kafka.

The following command returns the message offset for each partition in the wiki_events topic:

Output

Looks good. We can also stream all the messages in this topic by running the following command:

Output

Configure Pinot

Now let's configure Pinot to consume the data from Kafka.

We'll have the following schema:

schema.json

And the following table config:

table.json

The highlighted lines are how we connect Pinot to the Kafka topic that contains the events. Create the schema and table by running the following commnad:

Once you've done that, navigate to the and run the following query to check that the data has made its way into Pinot:

As long as you see some records, everything is working as expected.

Building a Dash Dashboard

Now let's write some more queries against Pinot and display the results in Dash.

First, install the following libraries:

Create a file called dashboard.py and import libraries and write a header for the page:

app.py

Connect to Pinot and write a query that returns recent changes, along with the users who made the changes, and domains where they were made:

app.py

The highlighted part of the query shows how to count the number of events from the last minute and the minute before that. We then do a similar thing to count the number of unique users and domains.

Metrics

Now let's create some metrics based on that data.

First, let's create a couple of helper functions for creating these metrics:

dash_utils.py

And now let's add the following import to app.py:

app.py

And the following code at the end of the file:

app.py

Go back to the terminal and run the following command:

Navigate to to see the Dash app. You should see something like the following:

Dash Metrics

Changes per minute

Next, let's add a line chart that shows the number of changes being done to Wikimedia per minute. Update app.py as follows:

app.py

Go back to the web browser and you should see something like this:

Dash Time Series

Auto Refresh

At the moment we need to refresh our web browser to update the metrics and line chart, but it would be much better if that happened automatically. Let's now add auto refresh functionality.

This will require some restructuring of our application so that each component is rendered from a function annotated with a callback that causes the function to be called on an interval.

The app layout now looks like this:

app.py

  • interval-component is configured to fire a callback every 1,000 milliseconds.

  • latest-timestamp is a container that will contain the latest timestamp.

  • indicators will contain indicators with the latest counts of users, domains, and changes.

  • time-series will contain the time series line chart.

The timestamp is refreshed by the following callback function:

app.py

The indicators are refreshed by this function:

app.py

And finally, the following function refreshes the line chart:

app.py

If we navigate back to our web browser, we'll see the following:

Dash Auto Refresh

The full script used in this example is shown below:

dashboard.py

Summary

In this guide we've learnt how to publish data into Kafka from Wikimedia's event stream, ingest it from there into Pinot, and finally make sense of the data using SQL queries run from Dash.

select * FROM foo where text_col LIKE 'a%'
select * from foo where text_col CONTAINS 'bar'
https://docs.pinot.apache.org/developers/advanced/v2-multi-stage-query-engine
https://medium.com/apache-pinot-developer-blog/pause-stream-consumption-on-apache-pinot-772a971ef403
https://www.startree.ai/blog/gapfill-function-for-time-series-datasets-in-pinot
#8560
#8233
#9171
#8343
https://docs.pinot.apache.org/basics/indexing/timestamp-index
#8384
https://medium.com/@atri.jiit/text-search-time-series-style-681af37ba42e
#8557
https://docs.pinot.apache.org/basics/data-import/from-query-console
#8708
#9131
#9123
#9114
#8893
#8873
#8857
#8724
#8939
#8582
#7647
#8181
#8565
#8566
#8535
#8647
#8304
#8397
#8378
#8264
#8498
#9276
#9265
#9259
#9258
#9244
#9197
#9193
#9186
#9180
#9173
#9171
#9095
#9086
#9071
#9058
#9005
#9006
#8989
#8986
#8970
#8978
#8972
#8959
#8949
#8941
#8927
#8905
#8897
#8888
#8873
#8861
#8857
#8828
#8825
#8823
#8815
#8796
#8768
#8753
#8738
#8729
#8724
#8713
#8711
#8708
#8670
#8659
#8655
#8640
#8622
#8611
#8584
#8569
#8566
#8565
#8560
#8557
#8551
#8550
#8544
#8503
#8491
#8486
#8465
#8451
#8398
#8387
#8384
#8380
#8358
#8355
#8343
#8325
#8319
#8310
#8273
#8255
#8249
#9044
#8711
#8328
#8287
#8478
#8325
#8427
#8348
#8346
#8494
#9235
#9222
#9179
#9167
#9135
#9138
#9126
#9132
#9130
#9124
#9110
#9090
#9088
#9087
#9070
#9069
#8856
#9059
#9035
#8971
#9010
#8958
#8913
#8905
#8892
#8858
#8812
#8689
#8664
#8781
#8848
#8834
#8824
#8831
#8802
#8786
#8794
#8785
#8748
#8721
#8707
#8685
#8672
#8656
#8648
#8633
#8631
#8598
#8590
#8572
#8553
#8509
#8511
#7958
#7971
#8443
#8450
#8448
#8407
#8404
#8394
#8392
#8337
#8366
#8350
#8323
#8327
#8285
#9028
#8272
version: '3.7'
services:
  zookeeper:
    image: zookeeper:3.5.6
    container_name: "zookeeper-wiki"
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
  kafka:
    image: wurstmeister/kafka:latest
    restart: unless-stopped
    container_name: "kafka-wiki"
    ports:
      - "9092:9092"
    expose:
      - "9093"
    depends_on:
      - zookeeper
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper-wiki:2181/kafka
      KAFKA_BROKER_ID: 0
      KAFKA_ADVERTISED_HOST_NAME: kafka-wiki
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-wiki:9093,OUTSIDE://localhost:9092
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,OUTSIDE:PLAINTEXT
  pinot-controller:
    image: apachepinot/pinot:0.10.0
    command: "StartController -zkAddress zookeeper-wiki:2181 -dataDir /data"
    container_name: "pinot-controller-wiki"
    volumes:
      - ./config:/config
      - ./data:/data
    restart: unless-stopped
    ports:
      - "9000:9000"
    depends_on:
      - zookeeper
  pinot-broker:
    image: apachepinot/pinot:0.10.0
    command: "StartBroker -zkAddress zookeeper-wiki:2181"
    restart: unless-stopped
    container_name: "pinot-broker-wiki"
    volumes:
      - ./config:/config
    ports:
      - "8099:8099"
    depends_on:
      - pinot-controller
  pinot-server:
    image: apachepinot/pinot:0.10.0
    command: "StartServer -zkAddress zookeeper-wiki:2181"
    restart: unless-stopped
    container_name: "pinot-server-wiki"
    volumes:
      - ./config:/config
    depends_on:
      - pinot-broker
docker-compose up
pip install sseclient-py
import json
import pprint
import sseclient
import requests

def with_requests(url, headers):
    """Get a streaming response for the given event feed using requests."""    
    return requests.get(url, stream=True, headers=headers)

url = 'https://stream.wikimedia.org/v2/stream/recentchange'
headers = {'Accept': 'text/event-stream'}
response = with_requests(url, headers)
client = sseclient.SSEClient(response)

for event in client.events():
    stream = json.loads(event.data)
    pprint.pprint(stream)
python wiki.py
{'$schema': '/mediawiki/recentchange/1.0.0',
 'bot': False,
 'comment': '[[:File:Storemyr-Fagerbakken landskapsvernområde HVASSER '
            'Oslofjorden Norway (Protected coastal forest Recreational area '
            'hiking trails) Rituell-kultisk steinstreng sørøst i skogen (small '
            'archeological stone string) Vår (spring) 2021-04-24.jpg]] removed '
            'from category',
 'id': 1923506287,
 'meta': {'domain': 'commons.wikimedia.org',
          'dt': '2022-05-12T09:57:00Z',
          'id': '3800228e-43d8-440d-8034-c68977742653',
          'offset': 3855767440,
          'partition': 0,
          'request_id': '930b17cc-f14a-4656-afa1-d15b79a8f666',
          'stream': 'mediawiki.recentchange',
          'topic': 'eqiad.mediawiki.recentchange',
          'uri': 'https://commons.wikimedia.org/wiki/Category:Iron_Age_in_Norway'},
 'namespace': 14,
 'parsedcomment': '<a '
                  'href="/wiki/File:Storemyr-Fagerbakken_landskapsvernomr%C3%A5de_HVASSER_Oslofjorden_Norway_(Protected_coastal_forest_Recreational_area_hiking_trails)_Rituell-kultisk_steinstreng_s%C3%B8r%C3%B8st_i_skogen_(small_archeological_stone_string)_V%C3%A5r_(spring)_2021-04-24.jpg" '
                  'title="File:Storemyr-Fagerbakken landskapsvernområde '
                  'HVASSER Oslofjorden Norway (Protected coastal forest '
                  'Recreational area hiking trails) Rituell-kultisk '
                  'steinstreng sørøst i skogen (small archeological stone '
                  'string) Vår (spring) '
                  '2021-04-24.jpg">File:Storemyr-Fagerbakken '
                  'landskapsvernområde HVASSER Oslofjorden Norway (Protected '
                  'coastal forest Recreational area hiking trails) '
                  'Rituell-kultisk steinstreng sørøst i skogen (small '
                  'archeological stone string) Vår (spring) 2021-04-24.jpg</a> '
                  'removed from category',
 'server_name': 'commons.wikimedia.org',
 'server_script_path': '/w',
 'server_url': 'https://commons.wikimedia.org',
 'timestamp': 1652349420,
 'title': 'Category:Iron Age in Norway',
 'type': 'categorize',
 'user': 'Krg',
 'wiki': 'commonswiki'}
{'$schema': '/mediawiki/recentchange/1.0.0',
 'bot': False,
 'comment': '[[:File:Storemyr-Fagerbakken landskapsvernområde HVASSER '
            'Oslofjorden Norway (Protected coastal forest Recreational area '
            'hiking trails) Rituell-kultisk steinstreng sørøst i skogen (small '
            'archeological stone string) Vår (spring) 2021-04-24.jpg]] removed '
            'from category',
 'id': 1923506289,
 'meta': {'domain': 'commons.wikimedia.org',
          'dt': '2022-05-12T09:57:00Z',
          'id': '2b819d20-beca-46a5-8ce3-b2f3b73d2cbe',
          'offset': 3855767441,
          'partition': 0,
          'request_id': '930b17cc-f14a-4656-afa1-d15b79a8f666',
          'stream': 'mediawiki.recentchange',
          'topic': 'eqiad.mediawiki.recentchange',
          'uri': 'https://commons.wikimedia.org/wiki/Category:Cultural_heritage_monuments_in_F%C3%A6rder'},
 'namespace': 14,
 'parsedcomment': '<a '
                  'href="/wiki/File:Storemyr-Fagerbakken_landskapsvernomr%C3%A5de_HVASSER_Oslofjorden_Norway_(Protected_coastal_forest_Recreational_area_hiking_trails)_Rituell-kultisk_steinstreng_s%C3%B8r%C3%B8st_i_skogen_(small_archeological_stone_string)_V%C3%A5r_(spring)_2021-04-24.jpg" '
                  'title="File:Storemyr-Fagerbakken landskapsvernområde '
                  'HVASSER Oslofjorden Norway (Protected coastal forest '
                  'Recreational area hiking trails) Rituell-kultisk '
                  'steinstreng sørøst i skogen (small archeological stone '
                  'string) Vår (spring) '
                  '2021-04-24.jpg">File:Storemyr-Fagerbakken '
                  'landskapsvernområde HVASSER Oslofjorden Norway (Protected '
                  'coastal forest Recreational area hiking trails) '
                  'Rituell-kultisk steinstreng sørøst i skogen (small '
                  'archeological stone string) Vår (spring) 2021-04-24.jpg</a> '
                  'removed from category',
 'server_name': 'commons.wikimedia.org',
 'server_script_path': '/w',
 'server_url': 'https://commons.wikimedia.org',
 'timestamp': 1652349420,
 'title': 'Category:Cultural heritage monuments in Færder',
 'type': 'categorize',
 'user': 'Krg',
 'wiki': 'commonswiki'}
docker exec -it kafka-wiki kafka-topics.sh \
  --bootstrap-server localhost:9092 \
  --create \
  --topic wiki_events \
  --partitions 5
import json
import sseclient
import datetime
import requests
import time
from confluent_kafka import Producer
def with_requests(url, headers):
    """Get a streaming response for the given event feed using requests."""    
    return requests.get(url, stream=True, headers=headers)

def acked(err, msg):
    if err is not None:
        print("Failed to deliver message: {0}: {1}"
              .format(msg.value(), err.str()))

def json_serializer(obj):
    if isinstance(obj, (datetime.datetime, datetime.date)):
        return obj.isoformat()
    raise "Type %s not serializable" % type(obj)
producer = Producer({'bootstrap.servers': 'localhost:9092'})

url = 'https://stream.wikimedia.org/v2/stream/recentchange'
headers = {'Accept': 'text/event-stream'}
response = with_requests(url, headers) 
client = sseclient.SSEClient(response)

events_processed = 0
while True:
    try: 
        for event in client.events():
            stream = json.loads(event.data)
            payload = json.dumps(stream, default=json_serializer, ensure_ascii=False).encode('utf-8')
            producer.produce(topic='wiki_events', 
              key=str(stream['meta']['id']), value=payload, callback=acked)

            events_processed += 1
            if events_processed % 100 == 0:
                print(f"{str(datetime.datetime.now())} Flushing after {events_processed} events")
                producer.flush()
    except Exception as ex:
        print(f"{str(datetime.datetime.now())} Got error:" + str(ex))
        response = with_requests(url, headers) 
        client = sseclient.SSEClient(response)
        time.sleep(2)
python wiki_to_kafka.py
2022-05-12 10:58:34.449326 Flushing after 100 events
2022-05-12 10:58:39.151599 Flushing after 200 events
2022-05-12 10:58:43.399528 Flushing after 300 events
2022-05-12 10:58:47.350277 Flushing after 400 events
2022-05-12 10:58:50.847959 Flushing after 500 events
2022-05-12 10:58:54.768228 Flushing after 600 events
docker exec -it kafka-wiki kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list localhost:9092 \
  --topic wiki_events
wiki_events:0:42
wiki_events:1:61
wiki_events:2:52
wiki_events:3:56
wiki_events:4:58
docker exec -it kafka-wiki kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 \
  --topic wiki_events \
  --from-beginning
...
{"$schema": "/mediawiki/recentchange/1.0.0", "meta": {"uri": "https://en.wikipedia.org/wiki/Super_Wings", "request_id": "6f82e64d-220f-41f4-88c3-2e15f03ae504", "id": "c30cd735-1ead-405e-94d1-49fbe7c40411", "dt": "2022-05-12T10:05:36Z", "domain": "en.wikipedia.org", "stream": "mediawiki.recentchange", "topic": "eqiad.mediawiki.recentchange", "partition": 0, "offset": 3855779703}, "type": "log", "namespace": 0, "title": "Super Wings", "comment": "", "timestamp": 1652349936, "user": "2001:448A:50E0:885B:FD1D:2D04:233E:7647", "bot": false, "log_id": 0, "log_type": "abusefilter", "log_action": "hit", "log_params": {"action": "edit", "filter": "550", "actions": "tag", "log": 32575794}, "log_action_comment": "2001:448A:50E0:885B:FD1D:2D04:233E:7647 triggered [[Special:AbuseFilter/550|filter 550]], performing the action \"edit\" on [[Super Wings]]. Actions taken: Tag ([[Special:AbuseLog/32575794|details]])", "server_url": "https://en.wikipedia.org", "server_name": "en.wikipedia.org", "server_script_path": "/w", "wiki": "enwiki", "parsedcomment": ""}
{"$schema": "/mediawiki/recentchange/1.0.0", "meta": {"uri": "https://no.wikipedia.org/wiki/Brukerdiskusjon:Haros", "request_id": "a20c9692-f301-4faf-9373-669bebbffff4", "id": "566ee63e-8e86-4a7e-a1f3-562704306509", "dt": "2022-05-12T10:05:36Z", "domain": "no.wikipedia.org", "stream": "mediawiki.recentchange", "topic": "eqiad.mediawiki.recentchange", "partition": 0, "offset": 3855779714}, "id": 84572581, "type": "edit", "namespace": 3, "title": "Brukerdiskusjon:Haros", "comment": "/* Stor forbokstav / ucfirst */", "timestamp": 1652349936, "user": "Asav", "bot": false, "minor": false, "patrolled": true, "length": {"old": 110378, "new": 110380}, "revision": {"old": 22579494, "new": 22579495}, "server_url": "https://no.wikipedia.org", "server_name": "no.wikipedia.org", "server_script_path": "/w", "wiki": "nowiki", "parsedcomment": "<span dir=\"auto\"><span class=\"autocomment\"><a href=\"/wiki/Brukerdiskusjon:Haros#Stor_forbokstav_/_ucfirst\" title=\"Brukerdiskusjon:Haros\">→‎Stor forbokstav / ucfirst</a></span></span>"}
{"$schema": "/mediawiki/recentchange/1.0.0", "meta": {"uri": "https://es.wikipedia.org/wiki/Campo_de_la_calle_Industria", "request_id": "d45bd9af-3e2c-4aac-ae8f-e16d3340da76", "id": "7fb3956e-9bd2-4fa5-8659-72b266cdb45b", "dt": "2022-05-12T10:05:35Z", "domain": "es.wikipedia.org", "stream": "mediawiki.recentchange", "topic": "eqiad.mediawiki.recentchange", "partition": 0, "offset": 3855779718}, "id": 266270269, "type": "edit", "namespace": 0, "title": "Campo de la calle Industria", "comment": "/* Historia */", "timestamp": 1652349935, "user": "Raimon will", "bot": false, "minor": false, "length": {"old": 7566, "new": 7566}, "revision": {"old": 143485393, "new": 143485422}, "server_url": "https://es.wikipedia.org", "server_name": "es.wikipedia.org", "server_script_path": "/w", "wiki": "eswiki", "parsedcomment": "<span dir=\"auto\"><span class=\"autocomment\"><a href=\"/wiki/Campo_de_la_calle_Industria#Historia\" title=\"Campo de la calle Industria\">→‎Historia</a></span></span>"}
^CProcessed a total of 269 messages
{
    "schemaName": "wikipedia",
    "dimensionFieldSpecs": [
      {
        "name": "id",
        "dataType": "STRING"
      },
      {
        "name": "wiki",
        "dataType": "STRING"
      },
      {
        "name": "user",
        "dataType": "STRING"
      },
      {
        "name": "title",
        "dataType": "STRING"
      },
      {
        "name": "comment",
        "dataType": "STRING"
      },
      {
        "name": "stream",
        "dataType": "STRING"
      },
      {
        "name": "domain",
        "dataType": "STRING"
      },
      {
        "name": "topic",
        "dataType": "STRING"
      },
      {
        "name": "type",
        "dataType": "STRING"
      },
      {
        "name": "uri",
        "dataType": "STRING"
      },
      {
        "name": "bot",
        "dataType": "BOOLEAN"
      },
      {
        "name": "metaJson",
        "dataType": "STRING"
      }
    ],
    "dateTimeFieldSpecs": [
      {
        "name": "ts",
        "dataType": "TIMESTAMP",
        "format": "1:MILLISECONDS:EPOCH",
        "granularity": "1:MILLISECONDS"
      }
    ]
  }
{
    "tableName": "wikievents",
    "tableType": "REALTIME",
    "segmentsConfig": {
      "timeColumnName": "ts",
      "schemaName": "wikipedia",
      "replication": "1",
      "replicasPerPartition": "1"
    },

    "tableIndexConfig": {
      "invertedIndexColumns": [],
      "rangeIndexColumns": [],
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [],
      "bloomFilterColumns": [],
      "loadMode": "MMAP",
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "wiki_events",
        "stream.kafka.broker.list": "kafka-wiki:9093",
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
        "realtime.segment.flush.threshold.rows": "1000",
        "realtime.segment.flush.threshold.time": "24h",
        "realtime.segment.flush.segment.size": "100M"
      },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant",
      "tagOverrideConfig": {}
    },
      "noDictionaryColumns": [],
      "onHeapDictionaryColumns": [],
      "varLengthDictionaryColumns": [],
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false
    },
    "metadata": {},
    "quota": {},
    "routing": {},
    "query": {},
    "ingestionConfig": {
      "transformConfigs": [
        {
          "columnName": "metaJson",
          "transformFunction": "JSONFORMAT(meta)"
        },
        {
          "columnName": "id",
          "transformFunction": "JSONPATH(metaJson, '$.id')"
        },
        {
          "columnName": "stream",
          "transformFunction": "JSONPATH(metaJson, '$.stream')"
        },
        {
          "columnName": "domain",
          "transformFunction": "JSONPATH(metaJson, '$.domain')"
        },
        {
          "columnName": "topic",
          "transformFunction": "JSONPATH(metaJson, '$.topic')"
        },
        {
          "columnName": "uri",
          "transformFunction": "JSONPATH(metaJson, '$.uri')"
        },
        {
          "columnName": "ts",
          "transformFunction": "\"timestamp\" * 1000"
        }
      ]
    },
    "isDimTable": false
  }
docker exec -it pinot-controller-wiki bin/pinot-admin.sh AddTable \
  -tableConfigFile /config/table.json \
  -schemaFile /config/schema.json \
  -exec
select domain, count(*) 
from wikievents 
group by domain
order by count(*) DESC
limit 10
pip install dash pinotdb plotly pandas
import pandas as pd
from dash import Dash, html, dcc
import plotly.graph_objects as go
from pinotdb import connect
import plotly.express as px

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = Dash(__name__, external_stylesheets=external_stylesheets)
app.title = "Wiki Recent Changes Dashboard"
conn = connect(host='localhost', port=8099, path='/query/sql', scheme='http')

query = """select
  count(*) FILTER(WHERE  ts > ago('PT1M')) AS events1Min,
  count(*) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS events1Min2Min,  
  distinctcount(user) FILTER(WHERE  ts > ago('PT1M')) AS users1Min,
  distinctcount(user) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS users1Min2Min,
  distinctcount(domain) FILTER(WHERE  ts > ago('PT1M')) AS domains1Min,
  distinctcount(domain) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS domains1Min2Min
from wikievents 
where ts > ago('PT2M')
limit 1
"""

curs = conn.cursor()

curs.execute(query)
df_summary = pd.DataFrame(curs, columns=[item[0] for item in curs.description])
from dash import html, dash_table
import plotly.graph_objects as go

def add_delta_trace(fig, title, value, last_value, row, column):
    fig.add_trace(go.Indicator(
        mode = "number+delta",
        title= {'text': title},
        value = value,
        delta = {'reference': last_value, 'relative': True},
        domain = {'row': row, 'column': column})
    )

def add_trace(fig, title, value, row, column):
    fig.add_trace(go.Indicator(
        mode = "number",
        title= {'text': title},
        value = value,
        domain = {'row': row, 'column': column})
    )
from dash_utils import add_delta_trace, add_trace
fig = go.Figure(layout=go.Layout(height=300))
if df_summary["events1Min"][0] > 0:
    if df_summary["events1Min"][0] > 0:
        add_delta_trace(fig, "Changes", df_summary["events1Min"][0], df_summary["events1Min2Min"][0], 0, 0)
        add_delta_trace(fig, "Users", df_summary["users1Min"][0], df_summary["users1Min2Min"][0], 0, 1)
        add_delta_trace(fig, "Domain", df_summary["domains1Min"][0], df_summary["domains1Min2Min"][0], 0, 2)
    else:
        add_trace(fig, "Changes", df_summary["events1Min"][0], 0, 0)
        add_trace(fig, "Users", df_summary["users1Min2Min"][0], 0, 1)
        add_trace(fig, "Domains", df_summary["domains1Min2Min"][0], 0, 2)
    fig.update_layout(grid = {"rows": 1, "columns": 3,  'pattern': "independent"},) 
else:
    fig.update_layout(annotations = [{"text": "No events found", "xref": "paper", "yref": "paper", "showarrow": False, "font": {"size": 28}}])

app.layout = html.Div([
    html.H1("Wiki Recent Changes Dashboard", style={'text-align': 'center'}),
    html.Div(id='content', children=[
        dcc.Graph(figure=fig)
    ])
])

if __name__ == '__main__':
    app.run_server(debug=True)    
python dashboard.py
query = """
select ToDateTime(DATETRUNC('minute', ts), 'yyyy-MM-dd hh:mm:ss') AS dateMin, count(*) AS changes, 
    distinctcount(user) AS users,
    distinctcount(domain) AS domains
from wikievents 
where ts > ago('PT2M')
group by dateMin
order by dateMin desc
LIMIT 30
"""

curs.execute(query)
df_ts = pd.DataFrame(curs, columns=[item[0] for item in curs.description])
df_ts_melt = pd.melt(df_ts, id_vars=['dateMin'], value_vars=['changes', 'users', 'domains'])

line_chart = px.line(df_ts_melt, x='dateMin', y="value", color='variable', color_discrete_sequence =['blue', 'red', 'green'])
line_chart['layout'].update(margin=dict(l=0,r=0,b=0,t=40), title="Changes/Users/Domains per minute")
line_chart.update_yaxes(range=[0, df_ts["changes"].max() * 1.1])

app.layout = html.Div([
    html.H1("Wiki Recent Changes Dashboard", style={'text-align': 'center'}),
    html.Div(id='content', children=[
        dcc.Graph(figure=fig),
        dcc.Graph(figure=line_chart),
    ])
])
app.layout = html.Div([
    html.H1("Wiki Recent Changes Dashboard", style={'text-align': 'center'}),
    html.Div(id='latest-timestamp', style={"padding": "5px 0", "text-align": "center"}),
    dcc.Interval(
            id='interval-component',
            interval=1 * 1000,
            n_intervals=0
        ),
    html.Div(id='content', children=[
        dcc.Graph(id="indicators"),
        dcc.Graph(id="time-series"),
    ])
])
@app.callback(
    Output(component_id='latest-timestamp', component_property='children'),
    Input('interval-component', 'n_intervals'))
def timestamp(n):
    return html.Span(f"Last updated: {datetime.datetime.now()}")
@app.callback(Output(component_id='indicators', component_property='figure'),
              Input('interval-component', 'n_intervals'))
def indicators(n):
    query = """
    select count(*) FILTER(WHERE  ts > ago('PT1M')) AS events1Min,
        count(*) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS events1Min2Min,
        distinctcount(user) FILTER(WHERE  ts > ago('PT1M')) AS users1Min,
        distinctcount(user) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS users1Min2Min,
        distinctcount(domain) FILTER(WHERE  ts > ago('PT1M')) AS domains1Min,
        distinctcount(domain) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS domains1Min2Min
    from wikievents 
    where ts > ago('PT2M')
    limit 1
    """

    curs = connection.cursor()
    curs.execute(query)
    df_summary = pd.DataFrame(curs, columns=[item[0] for item in curs.description])
    curs.close()

    fig = go.Figure(layout=go.Layout(height=300))
    if df_summary["events1Min"][0] > 0:
        if df_summary["events1Min"][0] > 0:
            add_delta_trace(fig, "Changes", df_summary["events1Min"][0], df_summary["events1Min2Min"][0], 0, 0)
            add_delta_trace(fig, "Users", df_summary["users1Min"][0], df_summary["users1Min2Min"][0], 0, 1)
            add_delta_trace(fig, "Domain", df_summary["domains1Min"][0], df_summary["domains1Min2Min"][0], 0, 2)
        else:
            add_trace(fig, "Changes", df_summary["events1Min"][0], 0, 0)
            add_trace(fig, "Users", df_summary["users1Min2Min"][0], 0, 1)
            add_trace(fig, "Domains", df_summary["domains1Min2Min"][0], 0, 2)
        fig.update_layout(grid = {"rows": 1, "columns": 3,  'pattern': "independent"},) 
    else:
        fig.update_layout(annotations = [{"text": "No events found", "xref": "paper", "yref": "paper", "showarrow": False, "font": {"size": 28}}])
    return fig
@app.callback(Output(component_id='time-series', component_property='figure'),
    Input('interval-component', 'n_intervals'))
def time_series(n):
    query = """
    select ToDateTime(DATETRUNC('minute', ts), 'yyyy-MM-dd hh:mm:ss') AS dateMin, count(*) AS changes, 
        distinctcount(user) AS users,
        distinctcount(domain) AS domains
    from wikievents 
    where ts > ago('PT1H')
    group by dateMin
    order by dateMin desc
    LIMIT 30
    """

    curs = connection.cursor()
    curs.execute(query)
    df_ts = pd.DataFrame(curs, columns=[item[0] for item in curs.description])
    curs.close()

    df_ts_melt = pd.melt(df_ts, id_vars=['dateMin'], value_vars=['changes', 'users', 'domains'])

    line_chart = px.line(df_ts_melt, x='dateMin', y="value", color='variable', color_discrete_sequence =['blue', 'red', 'green'])
    line_chart['layout'].update(margin=dict(l=0,r=0,b=0,t=40), title="Changes/Users/Domains per minute")
    line_chart.update_yaxes(range=[0, df_ts["changes"].max() * 1.1])
    return line_chart
import pandas as pd
from dash import Dash, html, dash_table, dcc, Input, Output
import plotly.graph_objects as go
from pinotdb import connect
from dash_utils import add_delta_trace, add_trace
import plotly.express as px
import datetime

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = Dash(__name__, external_stylesheets=external_stylesheets)
app.title = "Wiki Recent Changes Dashboard"

connection = connect(host="localhost", port="8099", path="/query/sql", scheme=( "http"))


@app.callback(Output(component_id='indicators', component_property='figure'),
              Input('interval-component', 'n_intervals'))
def indicators(n):
    query = """
    select count(*) FILTER(WHERE  ts > ago('PT1M')) AS events1Min,
        count(*) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS events1Min2Min,
        distinctcount(user) FILTER(WHERE  ts > ago('PT1M')) AS users1Min,
        distinctcount(user) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS users1Min2Min,
        distinctcount(domain) FILTER(WHERE  ts > ago('PT1M')) AS domains1Min,
        distinctcount(domain) FILTER(WHERE  ts <= ago('PT1M') AND ts > ago('PT2M')) AS domains1Min2Min
    from wikievents 
    where ts > ago('PT2M')
    limit 1
    """

    curs = connection.cursor()
    curs.execute(query)
    df_summary = pd.DataFrame(curs, columns=[item[0] for item in curs.description])
    curs.close()

    fig = go.Figure(layout=go.Layout(height=300))
    if df_summary["events1Min"][0] > 0:
        if df_summary["events1Min"][0] > 0:
            add_delta_trace(fig, "Changes", df_summary["events1Min"][0], df_summary["events1Min2Min"][0], 0, 0)
            add_delta_trace(fig, "Users", df_summary["users1Min"][0], df_summary["users1Min2Min"][0], 0, 1)
            add_delta_trace(fig, "Domain", df_summary["domains1Min"][0], df_summary["domains1Min2Min"][0], 0, 2)
        else:
            add_trace(fig, "Changes", df_summary["events1Min"][0], 0, 0)
            add_trace(fig, "Users", df_summary["users1Min2Min"][0], 0, 1)
            add_trace(fig, "Domains", df_summary["domains1Min2Min"][0], 0, 2)
        fig.update_layout(grid = {"rows": 1, "columns": 3,  'pattern': "independent"},) 
    else:
        fig.update_layout(annotations = [{"text": "No events found", "xref": "paper", "yref": "paper", "showarrow": False, "font": {"size": 28}}])
    return fig

@app.callback(Output(component_id='time-series', component_property='figure'),
    Input('interval-component', 'n_intervals'))
def time_series(n):
    query = """
    select ToDateTime(DATETRUNC('minute', ts), 'yyyy-MM-dd hh:mm:ss') AS dateMin, count(*) AS changes, 
        distinctcount(user) AS users,
        distinctcount(domain) AS domains
    from wikievents 
    where ts > ago('PT1H')
    group by dateMin
    order by dateMin desc
    LIMIT 30
    """

    curs = connection.cursor()
    curs.execute(query)
    df_ts = pd.DataFrame(curs, columns=[item[0] for item in curs.description])
    curs.close()

    df_ts_melt = pd.melt(df_ts, id_vars=['dateMin'], value_vars=['changes', 'users', 'domains'])

    line_chart = px.line(df_ts_melt, x='dateMin', y="value", color='variable', color_discrete_sequence =['blue', 'red', 'green'])
    line_chart['layout'].update(margin=dict(l=0,r=0,b=0,t=40), title="Changes/Users/Domains per minute")
    line_chart.update_yaxes(range=[0, df_ts["changes"].max() * 1.1])
    return line_chart

@app.callback(
    Output(component_id='latest-timestamp', component_property='children'),
    Input('interval-component', 'n_intervals'))
def timestamp(n):
    return html.Span(f"Last updated: {datetime.datetime.now()}")

app.layout = html.Div([
    html.H1("Wiki Recent Changes Dashboard", style={'text-align': 'center'}),
    html.Div(id='latest-timestamp', style={"padding": "5px 0", "text-align": "center"}),
    dcc.Interval(
            id='interval-component',
            interval=1 * 1000,
            n_intervals=0
        ),
    html.Div(id='content', children=[
        dcc.Graph(id="indicators"),
        dcc.Graph(id="time-series"),
    ])
])

if __name__ == '__main__':
    app.run_server(debug=True)
Dash
stream.wikimedia.org/v2/stream/recentchange
Pinot UI
localhost:8501

0.10.0

Summary

This release introduces some new great features, performance enhancements, UI improvements, and bug fixes which are described in details in the following sections. The release was cut from this commit fd9c58a.

Dependency Graph

The dependency graph for plug-and-play architecture that was introduced in release 0.3.0 has been extended and now it contains new nodes for Pinot Segment SPI.

Dependency graph after introducing pinot-segment-api.

SQL Improvements

  • Implement NOT Operator (#8148)

  • Add DistinctCountSmartHLLAggregationFunction which automatically store distinct values in Set or HyperLogLog based on cardinality (#8189)

  • Add LEAST and GREATEST functions (#8100)

  • Handle SELECT * with extra columns (#7959)

  • Add FILTER clauses for aggregates (#7916)

  • Add ST_Within function (#7990)

  • Handle semicolon in query (#7861)

  • Add EXPLAIN PLAN (#7568)

UI Enhancements

  • Show Reported Size and Estimated Size in human readable format in UI (#8199)

  • Make query console state URL based (#8194)

  • Improve query console to not show query result when multiple columns have the same name (#8131)

  • Improve Pinot dashboard tenant view to show correct amount of servers and brokers (#8115)

  • Fix issue with opening new tabs from Pinot Dashboard (#8021)

  • Fix issue with Query console going blank on syntax error (#8006)

  • Make query stats always show even there's error (#7981)

  • Implement OIDC auth workflow in UI (#7121)

  • Add tooltip and modal for table status (#7899)

  • Add option to wrap lines in custom code mirror (#7857)

  • Add ability to comment out queries with cmd + / (#7841)

  • Return exception when unavailable segments on empty broker response (#7823)

  • Properly handle the case where segments are missing in externalview (#7803)

  • Add TIMESTAMP to datetime column Type (#7746)

Performance Improvements

  • Reuse regex matcher in dictionary based LIKE queries (#8261)

  • Early terminate orderby when columns already sorted (#8228)

  • Do not do another pass of Query Automaton Minimization (#8237)

  • Improve RangeBitmap by upgrading RoaringBitmap (#8206)

  • Optimize geometry serializer usage when literal is available (#8167)

  • Improve performance of no-dictionary group by (#8195)

  • Allocation free DataBlockCache lookups (#8140)

  • Prune unselected THEN statements in CaseTransformFunction (#8138)

  • Aggregation delay conversion to double (#8139)

  • Reduce object allocation rate in ExpressionContext or FunctionContext (#8124)

  • Lock free DimensionDataTableManager (#8102)

  • Improve json path performance during ingestion by upgrading JsonPath (#7819)

  • Reduce allocations and speed up StringUtil.sanitizeString (#8013)

  • Faster metric scans - ForwardIndexReader (#7920)

  • Unpeel group by 3 ways to enable vectorization (#7949)

  • Power of 2 fixed size chunks (#7934)

  • Don't use mmap for compression except for huge chunks (#7931)

  • Exit group-by marking loop early (#7935)

  • Improve performance of base chunk forward index write (#7930)

  • Cache JsonPaths to prevent compilation per segment (#7826)

  • Use LZ4 as default compression mode (#7797)

  • Peel off special case for 1 dimensional groupby (#7777)

  • Bump roaringbitmap version to improve range queries performance (#7734)

Other Notable Features

  • Adding NoopPinotMetricFactory and corresponding changes (#8270)

  • Allow to specify fixed segment name for SegmentProcessorFramework (#8269)

  • Move all prestodb dependencies into a separated module (#8266)

  • Include docIds in Projection and Transform block (#8262)

  • Automatically update broker resource on broker changes (#8249)

  • Update ScalarFunction annotation from name to names to support function alias. (#8252)

  • Implemented BoundedColumnValue partition function (#8224)

  • Add copy recursive API to pinotFS (#8200)

  • Add Support for Getting Live Brokers for a Table (without type suffix) (#8188)

  • Pinot docker image - cache prometheus rules (#8241)

  • In BrokerRequestToQueryContextConverter, remove unused filterExpressionContext (#8238)

  • Adding retention period to segment delete REST API (#8122)

  • Pinot docker image - upgrade prometheus and scope rulesets to components (#8227)

  • Allow segment name postfix for SegmentProcessorFramework (#8230)

  • Superset docker image - update pinotdb version in superset image (#8231)

  • Add retention period to deleted segment files and allow table level overrides (#8176)

  • Remove incubator from pinot and superset (#8223)

  • Adding table config overrides for disabling groovy (#8196)

  • Optimise sorted docId iteration order in mutable segments (#8213)

  • Adding secure grpc query server support (#8207)

  • Move Tls configs and utils from pinot-core to pinot-common (#8210)

  • Reduce allocation rate in LookupTransformFunction (#8204)

  • Allow subclass to customize what happens pre/post segment uploading (#8203)

  • Enable controller service auto-discovery in Jersey framework (#8193)

  • Add support for pushFileNamePattern in pushJobSpec (#8191)

  • Add additionalMatchLabels to helm chart (#7177)

  • Simulate rsvps after meetup.com retired the feed (#8180)

  • Adding more checkstyle rules (#8197)

  • Add persistence.extraVolumeMounts and persistence.extraVolumes to Kubernetes statefulsets (#7486)

  • Adding scala profile for kafka 2.x build and remove root pom scala dependencies (#8174)

  • Allow real-time data providers to accept non-kafka producers (#8190)

  • Enhance revertReplaceSegments api (#8166)

  • Adding broker level config for disabling Pinot queries with Groovy (#8159)

  • Make presto driver query pinot server with SQL (#8186)

  • Adding controller config for disabling Groovy in ingestionConfig (#8169)

  • Adding main method for LaunchDataIngestionJobCommand for spark-submit command (#8168)

  • Add auth token for segment replace rest APIs (#8146)

  • Add allowRefresh option to UploadSegment (#8125)

  • Add Ingress to Broker and Controller helm charts (#7997)

  • Improve progress reporter in SegmentCreationMapper (#8129)

  • St_* function error messages + support literal transform functions (#8001)

  • Add schema and segment crc to SegmentDirectoryContext (#8127)

  • Extend enableParallePushProtection support in UploadSegment API (#8110)

  • Support BOOLEAN type in Config Recommendation Engine (#8055)

  • Add a broker metric to distinguish exception happens when acquire channel lock or when send request to server (#8105)

  • Add pinot.minion prefix on minion configs for consistency (#8109)

  • Enable broker service auto-discovery in Jersey framework (#8107)

  • Timeout if waiting server channel lock takes a long time (#8083)

  • Wire EmptySegmentPruner to routing config (#8067)

  • Support for TIMESTAMP data type in Config Recommendation Engine (#8087)

  • Listener TLS customization (#8082)

  • Add consumption rate limiter for LLConsumer (#6291)

  • Implement Real Time Mutable FST (#8016)

  • Allow quickstart to get table files from filesystem (#8093)

  • Add support for instant segment deletion (#8077)

  • Add a config file to override quickstart configs (#8059)

  • Add pinot server grpc metadata acl (#8030)

  • Move compatibility verifier to a separate module (#8049)

  • Move hadoop and spark ingestion libs from plugins directory to external-plugins (#8048)

  • Add global strategy for partial upsert (#7906)

  • Upgrade kafka to 2.8.1 (#7883)

  • Created EmptyQuickstart command (#8024)

  • Allow SegmentPushUtil to push real-time segment (#8032)

  • Add ignoreMerger for partial upsert (#7907)

  • Make task timeout and concurrency configurable (#8028)

  • Return 503 response from health check on shut down (#7892)

  • Pinot-druid-benchmark: set the multiValueDelimiterEnabled to false when importing TPC-H data (#8012)

  • Cleanup: Remove remaining occurrences of incubator. (#8023)

  • Refactor segment loading logic in BaseTableDataManager to decouple it with local segment directory (#7969)

  • Improving segment replacement/revert protocol (#7995)

  • PinotConfigProvider interface (#7984)

  • Enhance listSegments API to exclude the provided segments from the output (#7878)

  • Remove outdated broker metric definitions (#7962)

  • Add skip key for realtimeToOffline job validation (#7921)

  • Upgrade async-http-client (#7968)

  • Allow Reloading Segments with Multiple Threads (#7893)

  • Ignore query options in commented out queries (#7894)

  • Remove TableConfigCache which does not listen on ZK changes (#7943)

  • Switch to zookeeper of helm 3.0x (#7955)

  • Use a single react hook for table status modal (#7952)

  • Add debug logging for real-time ingestion (#7946)

  • Separate the exception for transform and indexing for consuming records (#7926)

  • Disable JsonStatementOptimizer (#7919)

  • Make index readers/loaders pluggable (#7897)

  • Make index creator provision pluggable (#7885)

  • Support loading plugins from multiple directories (#7871)

  • Update helm charts to honour readinessEnabled probes flags on the Controller, Broker, Server and Minion StatefulSets (#7891)

  • Support non-selection-only GRPC server request handler (#7839)

  • GRPC broker request handler (#7838)

  • Add validator for SDF (#7804)

  • Support large payload in zk put API (#7364)

  • Push JSON Path evaluation down to storage layer (#7820)

  • When upserting new record, index the record before updating the upsert metadata (#7860)

  • Add Post-Aggregation Gapfilling functionality. (#7781)

  • Clean up deprecated fields from segment metadata (#7853)

  • Remove deprecated method from StreamMetadataProvider (#7852)

  • Obtain replication factor from tenant configuration in case of dimension table (#7848)

  • Use valid bucket end time instead of segment end time for merge/rollup delay metrics (#7827)

  • Make pinot start components command extensible (#7847)

  • Make upsert inner segment update atomic (#7844)

  • Clean up deprecated ZK metadata keys and methods (#7846)

  • Add extraEnv, envFrom to statefulset help template (#7833)

  • Make openjdk image name configurable (#7832)

  • Add getPredicate() to PredicateEvaluator interface (#7840)

  • Make split commit the default commit protocol (#7780)

  • Pass Pinot connection properties from JDBC driver (#7822)

  • Add Pinot client connection config to allow skip fail on broker response exception (#7816)

  • Change default range index version to v2 (#7815)

  • Put thread timer measuring inside of wall clock timer measuring (#7809)

  • Add getRevertReplaceSegmentRequest method in FileUploadDownloadClient (#7796)

  • Add JAVA_OPTS env var in docker image (#7799)

  • Split thread cpu time into three metrics (#7724)

  • Add config for enabling real-time offset based consumption status checker (#7753)

  • Add timeColumn, timeUnit and totalDocs to the json segment metadata (#7765)

  • Set default Dockerfile CMD to -help (#7767)

  • Add getName() to PartitionFunction interface (#7760)

  • Support Native FST As An Index Subtype for FST Indices (#7729)

  • Add forceCleanup option for 'startReplaceSegments' API (#7744)

  • Add config for keystore types, switch tls to native implementation, and add authorization for server-broker tls channel (#7653)

  • Extend FileUploadDownloadClient to send post request with json body (#7751)

Major Bug Fixes

  • Fix string comparisons (#8253)

  • Bugfix for order-by all sorted optimization (#8263)

  • Fix dockerfile (#8239)

  • Ensure partition function never return negative partition (#8221)

  • Handle indexing failures without corrupting inverted indexes (#8211)

  • Fixed broken HashCode partitioning (#8216)

  • Fix segment replace test (#8209)

  • Fix filtered aggregation when it is mixed with regular aggregation (#8172)

  • Fix FST Like query benchmark to remove SQL parsing from the measurement (#8097)

  • Do not identify function types by throwing exceptions (#8137)

  • Fix regression bug caused by sharing TSerializer across multiple threads (#8160)

  • Fix validation before creating a table (#8103)

  • Check cron schedules from table configs after subscribing child changes (#8113)

  • Disallow duplicate segment name in tar file (#8119)

  • Fix storage quota checker NPE for Dimension Tables (#8132)

  • Fix TraceContext NPE issue (#8126)

  • Update gcloud libraries to fix underlying issue with api's with CMEK (#8121)

  • Fix error handling in jsonPathArray (#8120)

  • Fix error handling in json functions with default values (#8111)

  • Fix controller config validation failure for customized TLS listeners (#8106)

  • Validate the numbers of input and output files in HadoopSegmentCreationJob (#8098)

  • Broker Side validation for the query with aggregation and col but without group by (#7972)

  • Improve the proactive segment clean-up for REVERTED (#8071)

  • Allow JSON forward indexes (#8073)

  • Fix the PinotLLCRealtimeSegmentManager on segment name check (#8058)

  • Always use smallest offset for new partitionGroups (#8053)

  • Fix RealtimeToOfflineSegmentsTaskExecutor to handle time gap (#8054)

  • Refine segment consistency checks during segment load (#8035)

  • Fixes for various JDBC issues (#7784)

  • Delete tmp- segment directories on server startup (#7961)

  • Fix ByteArray datatype column metadata getMaxValue NPE bug and expose maxNumMultiValues (#7918)

  • Fix the issues that Pinot upsert table's uploaded segments get deleted when a server restarts. (#7979)

  • Fixed segment upload error return (#7957)

  • Fix QuerySchedulerFactory to plug in custom scheduler (#7945)

  • Fix the issue with grpc broker request handler not started correctly (#7950)

  • Fix real-time ingestion when an entire batch of messages is filtered out (#7927)

  • Move decode method before calling acquireSegment to avoid reference count leak (#7938)

  • Fix semaphore issue in consuming segments (#7886)

  • Add bootstrap mode for PinotServiceManager to avoid glitch for health check (#7880)

  • Fix the broker routing when segment is deleted (#7817)

  • Fix obfuscator not capturing secretkey and keytab (#7794)

  • Fix segment merge delay metric when there is empty bucket (#7761)

  • Fix QuickStart by adding types for invalid/missing type (#7768)

  • Use oldest offset on newly detected partitions (#7756)

  • Fix javadoc to compatible with jdk8 source (#7754)

  • Handle null segment lineage ZNRecord for getSelectedSegments API (#7752)

  • Handle fields missing in the source in ParquetNativeRecordReader (#7742)

Backward Incompatible Changes

  • Fix the issue with HashCode partitioning function (#8216)

  • Fix the issue with validation on table creation (#8103)

  • Change PinotFS API's (#8603)

GapFill Function For Time-Series Dataset

GapFill function is experimental, and has limited support, validation and error reporting.

GapFill Function is only supported with the single-stage query engine (v1).

Many of the datasets are time series in nature, tracking state change of an entity over time. The granularity of recorded data points might be sparse or the events could be missing due to network and other device issues in the IOT environment. But analytics applications which are tracking the state change of these entities over time, might be querying for values at lower granularity than the metric interval.

Here is the sample data set tracking the status of parking lots in parking space.

lotId
event_time
is_occupied

P1

2021-10-01 09:01:00.000

1

P2

2021-10-01 09:17:00.000

1

P1

2021-10-01 09:33:00.000

0

P1

2021-10-01 09:47:00.000

1

P3

2021-10-01 10:05:00.000

1

P2

2021-10-01 10:06:00.000

0

P2

2021-10-01 10:16:00.000

1

P2

2021-10-01 10:31:00.000

0

P3

2021-10-01 11:17:00.000

0

P1

2021-10-01 11:54:00.000

0

We want to find out the total number of parking lots that are occupied over a period of time which would be a common use case for a company that manages parking spaces.

Let us take 30 minutes' time bucket as an example:

timeBucket/lotId
P1
P2
P3

2021-10-01 09:00:00.000

1

1

2021-10-01 09:30:00.000

0,1

2021-10-01 10:00:00.000

0,1

1

2021-10-01 10:30:00.000

0

2021-10-01 11:00:00.000

0

2021-10-01 11:30:00.000

0

If you look at the above table, you will see a lot of missing data for parking lots inside the time buckets. In order to calculate the number of occupied park lots per time bucket, we need gap fill the missing data.

The Ways of Gap Filling the Data

There are two ways of gap filling the data: FILL_PREVIOUS_VALUE and FILL_DEFAULT_VALUE.

FILL_PREVIOUS_VALUE means the missing data will be filled with the previous value for the specific entity, in this case, park lot, if the previous value exists. Otherwise, it will be filled with the default value.

FILL_DEFAULT_VALUE means that the missing data will be filled with the default value. For numeric column, the default value is 0. For Boolean column type, the default value is false. For TimeStamp, it is January 1, 1970, 00:00:00 GMT. For STRING, JSON and BYTES, it is empty String. For Array type of column, it is empty array.

We will leverage the following the query to calculate the total occupied parking lots per time bucket.

Aggregation/Gapfill/Aggregation

Query Syntax

SELECT time_col, SUM(status) AS occupied_slots_count
FROM (
    SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
                   '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
                    TIMESERIESON(lotId)), lotId, status
    FROM (
        SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
               lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
        FROM parking_data
        WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
        GROUP BY 1, 2
        ORDER BY 1
        LIMIT 100)
    LIMIT 100)
GROUP BY 1
LIMIT 100

In the example above, TIMESERIESON(column_name) element is obligatory, and column_name must point to actual table column. It can't be a literal or expression.

Moreover, if the innermost query contains GROUP BY clause then (contrary to regular queries) it must contain an aggregate function, otherwise Select and Gapfill should be in the same sql statement error is returned.

Workflow

The most nested sql will convert the raw event table to the following table.

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P1

2021-10-01 09:30:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

The second most nested sql will gap fill the returned data as following:

timeBucket/lotId
P1
P2
P3

2021-10-01 09:00:00.000

1

1

0

2021-10-01 09:30:00.000

1

1

0

2021-10-01 10:00:00.000

1

1

1

2021-10-01 10:30:00.000

1

0

1

2021-10-01 11:00:00.000

1

0

0

2021-10-01 11:30:00.000

0

0

0

The outermost query will aggregate the gapfilled data as follows:

timeBucket
totalNumOfOccuppiedSlots

2021-10-01 09:00:00.000

2

2021-10-01 09:30:00.000

2

2021-10-01 10:00:00.000

3

2021-10-01 10:30:00.000

2

2021-10-01 11:00:00.000

1

2021-10-01 11:30:00.000

0

There is one assumption we made here that the raw data is sorted by the timestamp. The Gapfill and Post-Gapfill Aggregation will not sort the data.

The above example just shows the use case where the three steps happen:

  1. The raw data will be aggregated;

  2. The aggregated data will be gapfilled;

  3. The gapfilled data will be aggregated.

There are three more scenarios we can support.

Select/Gapfill

If we want to gapfill the missing data per half an hour time bucket, here is the query:

Query Syntax

SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
               '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
               TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
FROM parking_data
WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
ORDER BY 1
LIMIT 100

Workflow

At first the raw data will be transformed as follows:

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P1

2021-10-01 09:30:00.000

0

P1

2021-10-01 09:30:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

0

P2

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

Then it will be gapfilled as follows:

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P3

2021-10-01 09:00:00.000

0

P1

2021-10-01 09:30:00.000

0

P1

2021-10-01 09:30:00.000

1

P2

2021-10-01 09:30:00.000

1

P3

2021-10-01 09:30:00.000

0

P1

2021-10-01 10:00:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

0

P2

2021-10-01 10:00:00.000

1

P1

2021-10-01 10:30:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 10:30:00.000

1

P1

2021-10-01 11:00:00.000

1

P2

2021-10-01 11:00:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

P2

2021-10-01 11:30:00.000

0

P3

2021-10-01 11:30:00.000

0

Aggregate/Gapfill

Query Syntax

SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
               '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
               TIMESERIESON(lotId)), lotId, status
FROM (
    SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
           lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
    FROM parking_data
    WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
    GROUP BY 1, 2
    ORDER BY 1
    LIMIT 100)
LIMIT 100

Workflow

The nested sql will convert the raw event table to the following table.

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P1

2021-10-01 09:30:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

The outer sql will gap fill the returned data as following:

timeBucket/lotId
P1
P2
P3

2021-10-01 09:00:00.000

1

1

0

2021-10-01 09:30:00.000

1

1

0

2021-10-01 10:00:00.000

1

1

1

2021-10-01 10:30:00.000

1

0

1

2021-10-01 11:00:00.000

1

0

0

2021-10-01 11:30:00.000

0

0

0

Gapfill/Aggregate

Query Syntax

SELECT time_col, SUM(is_occupied) AS occupied_slots_count
FROM (
    SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
           '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
           TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
    FROM parking_data
    WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
    ORDER BY 1
    LIMIT 100)
GROUP BY 1
LIMIT 100

Workflow

The raw data will be transformed as following at first:

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P1

2021-10-01 09:30:00.000

0

P1

2021-10-01 09:30:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

0

P2

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

The transformed data will be gap filled as follows:

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P3

2021-10-01 09:00:00.000

0

P1

2021-10-01 09:30:00.000

0

P1

2021-10-01 09:30:00.000

1

P2

2021-10-01 09:30:00.000

1

P3

2021-10-01 09:30:00.000

0

P1

2021-10-01 10:00:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

0

P2

2021-10-01 10:00:00.000

1

P1

2021-10-01 10:30:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 10:30:00.000

1

P2

2021-10-01 10:30:00.000

0

P1

2021-10-01 11:00:00.000

1

P2

2021-10-01 11:00:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

P2

2021-10-01 11:30:00.000

0

P3

2021-10-01 11:30:00.000

0

The aggregation will generate the following table:

timeBucket
totalNumOfOccuppiedSlots

2021-10-01 09:00:00.000

2

2021-10-01 09:30:00.000

2

2021-10-01 10:00:00.000

3

2021-10-01 10:30:00.000

2

2021-10-01 11:00:00.000

1

2021-10-01 11:30:00.000

0

1.3.0

Release Notes for 1.3.0

This release brings significant improvements, including enhancements to the multistage query engine and the introduction of an experimental time series query engine for efficient analysis. Key features include database query quotas, cursor-based pagination for large result sets, multi-stream ingestion, and new function support for URL and GeoJson. Security vulnerabilities and several bug fixes and performance enhancements have been addressed, ensuring a more robust and versatile platform.

Multistage Engine Improvements

Reuse common expressions in a query (spool) ,

Refines query plan reuse in Apache Pinot by allowing reuse across stages instead of subtrees. Stages are natural boundaries in the query plan, divided into pull-based operators. To execute queries, Pinot introduces stages connected by MailboxSendOperator and MailboxReceiveOperator. The proposal modifies MailboxSendOperator to send data to multiple stages, transforming stage connections into a Directed Acyclic Graph (DAG) for greater efficiency and flexibility.

Segment Plan for MultiStage Queries ,

It focuses on providing comprehensive execution plans, including physical operator details. The new explain mode aligns with Calcite terminology and uses a broker-server communication flow to analyze and transform query plans into explained physical plans without executing them. A new ExplainedPlanNode is introduced to enrich query execution plans with physical details, ensuring better transparency and debugging capabilities for users.

DataBlock Serde Performance Improvements ,

Improve the performance of DataBlock building, serialization, and deserialization by reducing memory allocation and copies without altering the binary format. Benchmarks show 1x to 3x throughput gains, with significant reductions in memory allocation, minimizing GC-related latency issues in production. The improvement is achieved by changes to the buffers and the addition of a couple of stream classes.

Notable Improvements and Bug Fixes

  • Allow adding and subtracting timestamp types.

  • Remove PinotAggregateToSemiJoinRule to avoid mistakenly removing DISTINCT from the IN clause.

  • Support the use of timestamp indexes.

  • Support for polymorphic scalar comparison functions(=, !=, >, >=, <, <=).

  • Optimized MergeEqInFilterOptimizer by reducing the hash computation of expression.

  • Add support for is_enable_group_trim aggregate option.

  • Add support for is_leaf_return_final_result aggregate option.

  • Override the return type from NOW to TIMESTAMP.

  • Fix broken BIG_DECIMAL aggregations (MIN / MAX / SUM / AVG).

  • Add cluster configuration to limit the number of multi-stage queries running concurrently.

  • Allow filter for lookup JOIN.

  • Fix the bug where the query option is completely overridden when generating a leaf stage query.

  • Fix timestamp literal handling in the multi-stage query engine.

  • Add TLS support to mailboxes used in the multi-stage engine.

  • Allow configuring TLS between brokers and servers.

  • Add tablesQueried metadata to BrokerResponse for multi-stage queries.

  • Apply filter reduce expressions Calcite rule at the end to prevent literal-only filter pushdown to leaf stage.

  • Add support for all data types in return type inference from string literals for JSON extract functions.

  • Add support for the IGNORE NULLS option for the FIRST_VALUE and LAST_VALUE window functions.

  • Fix for window frame upper bound offset extraction in PinotWindowExchangeNodeInsertRule.

  • Add support for defining custom window frame bounds for window functions.

  • Improvements to allow using DISTINCTCOUNTTHETASKETCH with filter arguments.

  • Fix for ROUND scalar function in the multi-stage query engine.

  • Support for enabling LogicalProject pushdown optimizations to eliminate the exchange of unused columns.

  • Support for COALESCE as a variadic scalar function.

  • Support for lookup join.

  • Add NULLIF scalar function.

  • Compute all groups for the group by queries with only filtered aggregations.

  • Add broker API to run a query on both query engines and compare results.

  • Database handling improvement in the multi-stage engine.

  • Adds per-block row tracking for CROSS JOINs to prevent OOM while allowing large joins to function within memory limits.

  • OOM Protection Support for Multi-Stage Queries. ,

  • Refactor function registry for multi-stage engine.

  • Enforce max rows in join limit on joined rows with left input.

  • Argument type is used to look up the function for the literal-only query.

  • Ensure broker queries fail when the multi-stage engine is disabled, aligning behaviour with the controller to improve user experience.

Timeseries Engine Support in Pinot

Introduction of a Generic Time Series Query Engine in Apache Pinot, enabling native support for various time-series query languages (e.g., PromQL, M3QL) through a pluggable framework. This enhancement addresses limitations in Pinot’s current SQL-based query engines for time-series analysis, providing optimized performance and usability for observability use cases, especially those requiring high-cardinality metrics.

NOTE: Timeseries Engine support in Pinot is currently in an Experimental state.

Key Features

Pluggable Time-Series Query Language:

  • Pinot will support multiple time-series query languages, such as PromQL and Uber’s M3QL, via plugins like pinot-m3ql.

  • Example queries:

    • Plot hourly order counts for specific merchants.

    • Perform week-over-week analysis of order counts.

  • These plugins will leverage a new SPI module to enable seamless integration of custom query languages.

Pluggable Time-Series Operators:

  • Custom operators specific to each query language (e.g., nonNegativeDerivative or holt_winters) can be implemented within language-specific plugins without modifying Pinot’s core code.

  • Extensible operator abstractions will allow stakeholders to define unique time-series analysis functions.

Advantages of the New Engine:

  • Optimized for Time-Series Data: Processes data in series rather than rows, improving performance and simplifying the addition of complex analysis functions.

  • Reduced Complexity in Pinot Core: The engine reuses existing components like the Multi-Stage Engine (MSE) Query Scheduler, Query Dispatcher, and Mailbox. At the same time, language parsers and planners remain modular in plugins.

  • Improved Usability: Users can run concise and powerful time-series queries in their preferred language, avoiding the verbosity and limitations of SQL.

Impact on Observability Use Cases:

This new engine significantly enhances Pinot’s ability to handle complex time-series analyses efficiently, making it an ideal database for high-cardinality metrics and observability workloads.

The improvement is a step forward in transforming Pinot into a robust and versatile platform for time-series analytics, enabling seamless integration of diverse query languages and custom operators.

Here are some of the key PRs that have been merged as part of this feature:

  • Pinot time series engine SPI.

  • Add combine and segment level operators for time series.

  • Working E2E quickstart for time series engine.

  • Handling NULL cases in sum, min, max series builders.

  • Remove unnecessary time series materialization and minor cleanups.

  • Fix offset handling and effective time filter and enable Group-By expressions.

  • Enabling JSON column for Group-By in time series.

  • Fix bug in handling empty filters in time series.

  • Minor time series engine improvements.

  • Fix time series query correctness issue.

  • Define time series ID and broker response name tag semantics.

  • Use num docs from the value block in the time series aggregation operator.

  • Make time buckets half open on the left.

  • Fix Server Selection Bug + Enforce Timeout.

  • Response Size Limit, Metrics and Series Limit.

  • Refactor to enable Broker reduction.

  • Enable streaming response for time series.

  • Add time series exchange operator, plan node and serde.

  • Add support for partial aggregate and complex intermediate type.

  • Complete support for multi-server queries.

Database Query Quota

Introduces the ability to impose query rate limits at the database level, covering all queries made to tables within a database. A database-level rate limiter is implemented, and a new method, acquireDatabase(databaseName), is added to the QueryQuotaManager interface to check database query quotas.

Database Query Quota Configuration

  • Query and storage quotas are now provisioned similarly to table quotas but managed separately in a DatabaseConfig znode.

  • Details about the DatabaseConfig znode:

    • It does not represent a logical database entity.

    • Its absence does not prevent table creation under a database.

    • Deletion does not remove tables within the database.

Default and Override Quotas

  • A default query quota (databaseMaxQueriesPerSecond: 1000) is provided in ClusterConfig.

  • Overrides for specific databases can be configured via znodes (e.g., PROPERTYSTORE/CONFIGS/DATABASE/).

APIs for Configuration

Method
Path
Description

Dynamic Quota Updates

  • Quotas are determined by a combination of default cluster-level quotas and database-specific overrides.

  • Per-broker quotas are adjusted dynamically based on the number of live brokers.

  • Updates are handled via:

    • A custom DatabaseConfigRefreshMessage is sent to brokers upon database config changes.

    • A ClusterConfigChangeListener in ClusterChangeMediator to process updates in cluster configs.

    • Adjustments to per-broker quotas upon broker resource changes.

    • Creation of database rate limiters during the OFFLINE -> ONLINE state transition of tables in BrokerResourceOnlineOfflineStateModel.

This feature provides fine-grained control over query rate limits, ensuring scalability and efficient resource management for databases within Pinot.

Binary Workload Scheduler for Constrained Execution

Introduction of the BinaryWorkloadScheduler, which categorizes queries into two distinct workloads to ensure cluster stability and prioritize critical operations:

Workload Categories:

1. Primary Workload:

  • Default category for all production traffic.

  • Queries are executed using an unbounded FCFS (First-Come, First-Served) scheduler.

  • Designed for high-priority, critical queries to maintain consistent availability and performance.

2. Secondary Workload:

  • Reserved for ad-hoc queries, debugging tools, dashboards/notebooks, development environments, and one-off tests.

  • Imposes several constraints to minimize impact on the primary workload:

    • Limited concurrent queries: Caps the number of in-progress queries, with excess queries queued.

    • Thread restrictions: Limits the number of worker threads per query and across all queries in the secondary workload.

    • Queue pruning: Queries stuck in the queue too long are pruned based on time or queue length.

Key Benefits:

  • Prioritization: Guarantees the primary workload remains unaffected by resource-intensive or long-running secondary queries.

  • Stability: Protects cluster availability by preventing incidents caused by poorly optimized or excessive ad-hoc queries.

  • Scalability: Efficiently manages traffic in multi-tenant clusters, maintaining service reliability across workloads.

Cursors Support ,

Cursor support will allow Pinot clients to consume query results in smaller chunks. This feature allows clients to work with lesser resources esp. memory. Application logic is more straightforward with cursors. For example an app UI paginates through results in a table or a graph. Cursor support has been implemented using APIs.

API

Method
Path
Description

SPI

The feature provides two SPIs to extend the feature to support other implementations:

  • ResponseSerde: Serialize/Deserialize the response.

  • ResponseStore: Store responses in a storage system. Both SPIs use Java SPI and the default ServiceLoader to find implementation of the SPIs. All implementation should be annotated with AutoService to help generate files for discovering the implementations.

URL Functions Support

Implemented various URL functions to handle multiple aspects of URL processing, including extraction, encoding/decoding, and manipulation, making them useful for tasks involving URL parsing and modification

URL Extraction Methods

  • urlProtocol(String url): Extracts the protocol (scheme) from the URL.

  • urlDomain(String url): Extracts the domain from the URL.

  • urlDomainWithoutWWW(String url): Extracts the domain without the leading "www." if present.

  • urlTopLevelDomain(String url): Extracts the top-level domain (TLD) from the URL.

  • urlFirstSignificantSubdomain(String url): Extracts the first significant subdomain from the URL.

  • cutToFirstSignificantSubdomain(String url): Extracts the first significant subdomain and the top-level domain from the URL.

  • cutToFirstSignificantSubdomainWithWWW(String url): Returns the part of the domain that includes top-level subdomains up to the "first significant subdomain", without stripping "www.".

  • urlPort(String url): Extracts the port from the URL.

  • urlPath(String url): Extracts the path from the URL without the query string.

  • urlPathWithQuery(String url): Extracts the path from the URL with the query string.

  • urlQuery(String url): Extracts the query string without the initial question mark (?) and excludes the fragment (#) and everything after it.

  • urlFragment(String url): Extracts the fragment identifier (without the hash symbol) from the URL.

  • urlQueryStringAndFragment(String url): Extracts the query string and fragment identifier from the URL.

  • extractURLParameter(String url, String name): Extracts the value of a specific query parameter from the URL.

  • extractURLParameters(String url): Extracts all query parameters from the URL as an array of name=value pairs.

  • extractURLParameterNames(String url): Extracts all parameter names from the URL query string.

  • urlHierarchy(String url): Generates a hierarchy of URLs truncated at path and query separators.

  • urlPathHierarchy(String url): Generates a hierarchy of path elements from the URL, excluding the protocol and host.

URL Manipulation Methods

  • urlEncode(String url): Encodes a string into a URL-safe format.

  • urlDecode(String url) Decodes a URL-encoded string.

  • urlEncodeFormComponent(String url): Encodes the URL string following RFC-1866 standards, with spaces encoded as +.

  • urlDecodeFormComponent(String url): Decodes the URL string following RFC-1866 standards, with + decoded as a space.

  • urlNetloc(String url): Extracts the network locality (username:password@host:port) from the URL.

  • cutWWW(String url): Removes the leading "www." from a URL’s domain.

  • cutQueryString(String url): Removes the query string, including the question mark.

  • cutFragment(String url): Removes the fragment identifier, including the number sign.

  • cutQueryStringAndFragment(String url): Removes both the query string and fragment identifier.

  • cutURLParameter(String url, String name): Removes a specific query parameter from a URL.

  • cutURLParameters(String url, String[] names): Removes multiple specific query parameters from a URL.

Multi Stream Ingestion Support ,

  • Add support to ingest from multiple source by a single table

  • Use existing interface (TableConfig) to define multiple streams

  • Separate the partition id definition between Stream and Pinot segment

  • Compatible with existing stream partition auto expansion logics The feature does not change any existing interfaces. Users could define the table config in the same way and combine with any other transform functions or instance assignment strategies.

New Scalar Functions Support.

  • intDiv and intDivOrZero: Perform integer division, with intDivOrZero returning zero for division by zero or when dividing a minimal negative number by minus one.

  • isFinite, isInfinite, and isNaN: Check if a double value is finite, infinite, or NaN, respectively.

  • ifNotFinite: Returns a default value if the given value is not finite.

  • moduloOrZero and positiveModulo: Variants of the modulo operation, with moduloOrZero returning zero for division by zero or when dividing a minimal negative number by minus one.

  • negate: Returns the negation of a double value.

  • gcd and lcm: Calculate the greatest common divisor and least common multiple of two long values, respectively.

  • hypot: Computes the hypotenuse of a right-angled triangle given the lengths of the other two sides.

  • byteswapInt and byteswapLong: Perform byte swapping on integer and long values.

GeoJSON Support

Add support for GeoJSON Scalar functions:

Supported data types:

  • Point

  • LineString

  • Polygon

  • MultiPoint

  • MultiLineString

  • MultiPolygon

  • GeometryCollection

  • Feature

  • FeatureCollection

Improved Implementation of Distinct Operators.

Main optimizations:

  • Add per data type DistinctTable and utilize primitive type if possible

  • Specialize single-column case to reduce overhead

  • Allow processing null values with dictionary based operators

  • Specialize unlimited LIMIT case

  • Do not create priority queue before collecting LIMIT values

  • Add support for null ordering

Upsert Improvements

Features and Improvements

Track New Segments for Upsert Tables

  • Improvement for addressing a race condition where newly uploaded segments may be processed by the server before brokers add them to the routing table, potentially causing queries to miss valid documents.

  • Introduce a configurable newSegmentTrackingTimeMs (default 10s) to track new segments on the server side, allowing them to be accessed as optional segments until brokers update their routing tables.

Ensure Upsert Deletion Consistency with Compaction Flow Enabled

Enhancement addresses inconsistencies in upsert-compaction by introducing a mechanism to track the distinct segment count for primary keys. By ensuring a record exists in only one segment before compacting deleted records, it prevents older non-deleted records from being incorrectly revived during server restarts, ensuring consistent table state.

Consistent Segments Tracking for Consistent Upsert View

This improves consistent upsert view handling by addressing segment tracking and query inconsistencies. Key changes include:

  • Complete and Consistent Segment Tracking: Introduced a new Set to track segments before registration to the table manager, ensuring synchronized segment membership and validDocIds access.

  • Improved Segment Replacement: Added DuoSegmentDataManager to register both mutable and immutable segments during replacement, allowing queries to access a complete data view without blocking ingestion.

  • Query Handling Enhancements: Queries now acquire the latest consuming segments to avoid missing newly ingested data if the broker's routing table isn't updated.

  • Misc Fixes: Addressed edge cases, such as updating _numDocsIndexed before metadata updates, returning empty bitmaps instead of null, and preventing bitmap re-acquisition outside locking logic. These changes, gated by the new feature flag upsertConfig.consistencyMode, are tested with unit and stress tests in a staging environment to ensure reliability.

Other Notable Improvements and Bug Fixes

  • Config for max output segment size in UpsertCompactMerge task.

  • Add config for ignoreCrcMismatch for upsert-compaction task.

  • Upsert small segment merger task in minions.

  • Fix to acquire segmentLock before taking segment snapshot.

  • Update upsert TTL watermark in replaceSegment.

  • Fix checks on largest comparison value for upsert ttl and allow to add segments out of ttl.

  • More observability and metrics to track the upsert rate of deletion.

Lucene and Text Search Improvements

  • Store index metadata file for Lucene text indexes.

  • Runtime configurability for Lucene analyzers and query parsers, enabling dynamic text tokenization and advanced log search capabilities like case-sensitive/insensitive searches.

Security Improvements and Vulnerability Fixes

  • Force SSL cert reload daily using the scheduled thread.

  • Allow configuring TLS between brokers and servers for the multi-stage engine.

  • Strip Matrix parameter from BasePath checking.

  • Disable replacing environment variables and system properties in get table configs REST API.

  • Dependencies upgrade for vulnerabilities.

  • TLS Configuration Support for QueryServer and Dispatch Clients.

  • Returning tables names failing authorization in Exception for Multi-State Engine Queries.

  • TLS Port support for Minion.

  • Upgrade the hadoop version to 3.3.6 to fix vulnerabilities.

  • Fix vulnerabilities for msopenjdk 11 pinot-base-runtime image.

Miscellaneous Improvements

  • Allow setting ForwardIndexConfig default settings via cluster config.

  • Extend Merge Rollup Capabilities for Datasketches.

  • Skip task validation during table creation with schema.

  • Add capability to configure sketch precision / accuracy for different rollup buckets. Helpful in a space-saving for use cases where historical data does not require high accuracy.

  • Add support for application-level query quota.

  • Improvement to allow setting ForwardIndexConfig default settings via cluster config.

  • Enhanced mutable Index class to be as pluggable.

  • Improvement to allow configurable initial capacity for IndexedTable.

  • Add a new segment reload API for flexible control, allowing specific segments to be reloaded on designated servers and enabling workload management through batch processing and replica group targeting.

  • Add a server API to list segments that need to be refreshed for a table.

  • Introduced the ability to erase dimension values before rollup in merged segments, reducing cardinality and optimizing space for less critical historical data.

  • Add support for immutable CLPForwardIndex creator and related classes.

  • Add support for Minion Task to support automatic Segment Refresh.

  • Add support for S3A Connector.

  • Add support for hex decimal to long scalar functions.

  • Remove emitting null value fields during data transformation for SchemaConformingTransformer.

  • Improved CSV record reader to skip unparseable lines.

  • Add the ability to specify a target instance for segment reloading and improve API response messages when segments are not found on the target instances.

  • Add support for JSON Path Exists function.

  • Improvement for MSQ explain and stageStats when dealing with empty tables.

  • Improvement for dynamically adjusting GroupByResultHolder's initial capacity based on filter predicates to optimize resource allocation and improve performance for filtered group-by queries.

  • Add support for the isEqualSet Function.

  • Improvement to ensure consistent index configuration by constructing IndexLoadingConfig and SegmentGeneratorConfig from table config and schema, fixing inconsistencies and honouring FieldConfig.EncodingType.

  • Add usage of CLPMutableForwardIndexV2 by default to improve ingestion performance and efficiency.

  • Add support for application-level query quota.

  • Add null handling support for aggregations grouped by MV columns.

  • Add support to enable the capability to specify zstd and lz4 segment compression via config.

  • Add support for map data type on UI.

  • Add support for ComplexType in SchemaInfo to render Complex Column count in UI.

  • Introduced raw fwd index version V5 containing implicit num doc length, improving space efficiency.

  • Improvement for colocated Joins without hints.

  • Enhanced optimizeDictionary is used to optimize var-width type columns optionally.

  • Add support for BETWEEN in NumericalFilterOptimizer.

  • Add support for NULLIF scalar function.

  • Improvement for allowing usage of star-tree index with null handling enabled when no null values in segment columns.

  • Improvement Improvement for avoiding using setter in IndexLoadingConfig for consuming segment.

  • Implement consistent data push for Spark3 segment generation and metadata push jobs.

  • Improvement in addressing ingestion delays in real-time tables with many partitions by mitigating simultaneous segment commits across consumers.

  • Improve query options validation and error handling.

  • Add support an arbitrary number of WHEN THEN clauses in the scalar CASE function.

  • Add support for configuring Theta and Tuple aggregation functions.

  • Add support for Map type in complex schema.

  • Add TTL watermark storage/loading for the dedup feature to prevent stale metadata from being added to the store when loading segments.

  • Polymorphic scalar function implementation for BETWEEN.

  • Polymorphic binary arithmetic scalar functions.

  • Improvement for Adaptive Server Selection to penalize servers returning server-side exceptions.

  • Add a server-level configuration for the segment server upload to the deep store.

  • Add support to upload segments in batch mode with METADATA upload type.

  • Remove recreateDeletedConsumingSegment flag from RealtimeSegmentValidationManager.

  • Kafka3 support for realtime ingestion.

  • Allow the building of an index on the preserved field in SchemaConformingTransformer.

  • Add support to differentiate null and emptyLists for multi-value columns in avro decoder.

  • Broker config to set default query null handling behavior.

  • Moves the untarring method to BaseTaskExecutor to enable downloading and untarring from a peer server if deepstore untarring fails and allows DownloadFromServer to be enabled.

  • Optimize Adaptive Server Selection.

  • New SPI to support custom executor services, providing default implementations for cached and fixed thread pools.

  • Introduction of shared IdealStateUpdaterLock for PinotLLCRealtimeSegmentManager to prevent race conditions and timeouts during large segment updates.

  • Support for configuring aggregation function parameters in the star-tree index.

  • Write support for creating Pinot segments in the Pinot Spark connector.

  • Array flattening support in SchemaConformingTransformer.

  • Allow table names in TableConfigs with or without database name when database context is passed.

  • Improvement in null handling performance for nullable single input aggregation functions.

  • Improvement in column-based null handling by refining method naming, adding documentation and updating validation and constructor logic to support column-specific null strategies.

  • UI load time improvements.

  • Enhanced the noRawDataForTextIndex config to skip writing raw data when re-using the mutable index is enabled, fixing a global disable issue and improving ingestion performance.

  • Improvements to polymorphic scalar comparison functions for better backward compatibility.

  • Add TablePauseStatus to track the pause details.

  • Check stale dedup metadata when adding new records/segments.

  • Improve error messages with star-tree indexes creation.

  • Adds support for ZStandard and LZ4 compression in tar archives, enhancing efficiency and reducing CPU bottlenecks for large-scale data operations.

  • Support for IPv6 in Net Utils.

  • Optimize NullableSingleInputAggregationFunction when the entire block is null based on the null bitmap’s cardinality.

  • Supporting extra headers in the request to support the database in routing the requests.

  • Adds routing policy details to query error messages for unavailable segments, providing context to ease confusion and expedite issue triage.

  • Refactoring and cleanup for permissions and access. ,

  • Prevent 500 error for non-existent tasktype in /tasks/{taskType}/tasks API.

  • Changed STREAM_DATA_LOSS from a Meter to a Gauge to accurately reflect data loss detection and ensure proper cleanup.

Bug Fixes

  • Fix typo in RefreshSegmentTaskExecutor logger.

  • Fix to avoid handling JSON_ARRAY as multi-value JSON during transformation.

  • Fix for partition-enabled instance assignment with minimized movement.

  • Fix v1 query engine behaviour for aggregations without group by where the limit is zero.

  • Fix metadata fetch by increasing timeout for the Kafka client connection.

  • Fix integer overflow in GroupByUtils.

  • Fix for using PropertiesWriter to escape index_map keys properly.

  • Fix query option validation for group-by queries.

  • Fix for making RecordExtractor preserve empty array/map and map entries with empty values.

  • Fix CRC mismatch during deep store upload retry task.

  • Fix for allowing reload for UploadedRealtimeSegmentName segments.

  • Fix default value handling in REGEXP_EXTRACT transform function.

  • Fix for Spark upsert table backfill support.

  • Fix long value parsing in jsonextractscalar.

  • Fix deep store upload retry for infinite retention tables.

  • Fix to ensure deterministic index processing order across server replicas and runs to prevent inconsistent segment data file layouts and unnecessary synchronization.

  • Fix for real-time validation NPE when stream partition is no longer available.

  • Fix for handling NULL values encountered in CLPDecodeTransformFunction.

  • Fix for TextMatchFilterOptimizer grouping for the inner compound query.

  • Fix for removing redundant API calls on the home page.

  • Fix the missing precondition check for the V5 writer version in BaseChunkForwardIndexWriter.

  • Fix for computing all groups for the group by queries with only filtered aggregations.

  • Fix for race condition in IdealStateGroupCommit.

  • Fix default column handling when the forward index is disabled.

  • Fix bug with server return final aggregation result when null handling is enabled.

  • Fix Kubernetes Routing Issue in Helm chart.

  • Fix raw index conversion from v4.

  • Fix for consuming segments cleanup on server startup.

  • Fix for making S3PinotFS listFiles return directories when non-recursive.

  • Fix for rebalancer EV converge check for low disk mode.

  • Fix for copying native text index during format conversion.

  • Fix for enforcing removeSegment flow with _enableDeletedKeysCompactionConsistency.

  • Fix for Init BrokerQueryEventListener.

  • Fix for supporting ComplexFieldSpec in Schema and column metadata.

  • Fix race condition in shared literal transform functions.

  • Fix for honouring the column max length property while populating min/max values for column metadata.

  • Fix for skipping encoding the path URL for the Azure deep store.

  • Fix for handling DUAL SQL queries in Into JDBC client.

  • Fix TLS configuration for HTTP clients.

  • Fix bugs in DynamicBrokerSelection.

  • Fix literal type handling in LiteralValueExtractor.

  • Fix for handling NULL values appropriately during segment reload for newly derived columns.

  • Fix filtered aggregate with ordering.

  • Fix implementing a table-level lock to prevent parallel updates to the SegmentLineage ZK record and align real-time table ideal state updates with minion task locking for consistency.

  • Fix INT overflow issue for FixedByteSVMutableForwardIndex with large segment size.

  • Fix preload enablement checks to consider the preload executor and refine numMissedSegments logging to exclude unchanged segments, preventing incorrect missing segment reports.

  • Fix a bug in resource status evaluation during service startup, ensuring resources return GOOD when servers have no assigned segments, addressing issues with small tables and segment redistribution.

  • Fix RealtimeProvisioningHelperCommand to allow using just schemaFile along with sampleCompletedSegmentDir.

POST

/databases/{databaseName}/quotas?maxQueriesPerSecond=

Sets the database query quota

GET

/databases/{databaseName}/quotas

Get the database query quota

POST

/query/sql

New broker API parameter has been added to trigger pagination.

GET

/resultStore/{requestId}/results

Broker API that can be used to iterate over the result set of a query submitted using the above API.

GET

/resultStore/{requestId}/

Returns the BrokerResponse metadata of the query.

GET

/resultStore

Lists all the requestIds of all the query results available in the response store.

DELETE

/resultStore/{requestId}/

Delete the results of a query.

ST_GeomFromGeoJson(string) -> binary
ST_GeogFromGeoJson(string) -> binary
ST_AsGeoJson(binary) -> string
#14507
Design Doc
#13733
#14212
#13303
#13304
#14782
#14719
#14690
#13711
#14732
#14664
#14645
#14614
#14689
#14574
#14523
#14603
#14502
#14476
#14387
#14384
#14448
#14289
#14264
#14304
#14273
#14285
#14284
#14198
#14195
#13966
#14203
#14211
#13746
#14040
#13981
#13598
#13955
#13573
#13922
#13673
#13732
Design Doc
#13885
#13999
#14048
#14084
#14092
#14104
#14141
#14192
#14227
#14251
#14286
#14331
#14413
#14426
#14501
#14582
#14598
#14611
#14631
#14676
#13544
#13847
#14110
Design Doc
#14646
#13790
Design Doc
#14671
#14405
#14701
#13992
#13347
#13677
#14772
#14668
#14477
#14179
#14147
#14094
#13838
#13948
#13003
#14535
#14387
#14383
#14002
#13892
#13645
#13195
#12943
#12561)
#14030
#14773
#14625
#14683
#14373
#14226
#14773
#14609
#14620
#14544
#14544
#14355
#14288
#14300
#14474
#14435
#14351
#14396
#14393
#14376
#14374
#14001
#14313
#14258
#14241
#14226
#14071
#14008
#14245
#14254
#14105
#13943
#13994
#14163
#14203
#14177
#14190
#14139
#14170
#14158
#14125
#14167
#13906)
#14137
#14113
#14089
#14029
#14093
#13646
#14024
#13891
#13993
#13572
#13977
#13964
#13952
#13921
#13947
#13835
#13748
#13890
#13934
#13791
#13839
#13296
#13776
#13870
#13803
#13848
#13818
#13782
#13805
#13758
#13417
#13706
#13696
#13633
#13537
#13712
#14763
#14738
#14726
#13564
#14638
#14610
#12018
#14618
#14547
#14506
#14494
#14489
#14443
#14337
#14406
#14391
#14392
#14364
#14299
#14295
#14265
#14211
#14237
#14215
#14181
#13450
#14171
#14174
#14073
#14178
#14172
#13914
#13995
#13905
#13916
#13897
#13850
#13846
#13477
#13816
#13715
#13212
#13784
#13735
#13717
#13747
#13541
#13727

1.2.0

Release Notes for 1.2.0

This release comes with several Improvements and Bug Fixes for the Multistage Engine, Upserts and Compaction. There are a ton of other small features and general bug fixes.

Multistage Engine Improvements

Features

New Window Functions: LEAD, LAG, FIRST_VALUE, LAST_VALUE #12878 #13340

  • LEAD allows you to access values after the current row in a frame.

  • LAG allows you to access values before the current row in a frame.

  • FIRST_VALUE and LAST_VALUE return the respective extremal values in the frame.

Support for Logical Database in V2 Engine #12591 #12695

  • V2 Engine now supports a "database" construct, enabling table namespace isolation within the same Pinot cluster.

  • Improves user experience when multiple users are using the same Pinot Cluster.

  • Access control policies can be set at the database level.

  • Database can be selected in a query using a SET statement, such as SET database=my_db;.

Improved Multi-Value (MV) and Array Function Support

  • Added array sum aggregation functions for point-wise array operations #13324.

  • Added support for valueIn MV transform function #13443.

  • Fixed bug in numeric casts for MV columns in filters #13425.

  • Fixed NPE in ArrayAgg when a column contains no data #13358.

  • Fixed array literal handling #13345.

Support for WITHIN GROUP Clause and ListAgg #13146

  • WITHIN GROUP Clause can be used to process rows in a given order within a group.

  • One of the most common use-cases for this is the ListAgg function, which when combined with WITHIN GROUP can be used to concatenate strings in a given order.

Scalar/Transform Function and Set Operation Improvements

  • Added Geospatial Scalar Function support for use in intermediate stage in the v2 query engine #13457.

  • Fix 'WEEK' transform function #13483.

  • Support EXTRACT as a scalar function #13463.

  • Added support for ALL modifier for INTERSECT and EXCEPT Set Operations #13151 #13166.

Improved Literal Handling Support

  • Fixed bug in handling literal arguments in aggregation functions like Percentile #13282.

  • Allow INT and FLOAT literals #13078.

  • Fixed literal handling for all types #13344 #13345.

  • Fixed null literal handling for null intolerant functions #13255.

Metrics Improvements

  • Added new metrics for tracking queries executed globally and at the table level #12982.

  • New metrics to track join counts and window function counts #13032.

  • Multiple meters and timers to track Multistage Engine Internals #13035.

Notable Improvements and Bug Fixes

  • Improved Window operators resiliency, with new checks to make sure the window doesn't grow too large #13180 #13428 #13441.

  • Optimized Group Key generation #12394.

  • Fixed SortedMailboxReceiveOperator to honor convention of pulling at most 1 EOS block #12406.

  • Improvement in how execution stats are handled #12517 #12704 #13136.

  • Use Protobuf instead of Reflection for Plan Serialization #13221.

Upsert Compaction and Minion Improvements

Features and Improvements

Minion Resource Isolation #12459 #12786

  • Minions now support resource isolation based on an instance tag.

  • Instance tag is configured at table level, and can be set for each task on a table.

  • This enables you to implement arbitrary resource isolation strategies, i.e. you can use a set of Minion Nodes for running any set of tasks across any set of tables.

Greedy Upsert Compaction Scheduling #12461

  • Upsert compaction now schedules segments for compaction based on the number of invalid docs.

  • This helps the compaction task to handle arbitrary temporal distribution of invalid docs.

Notable Improvements

  • Minions can now download segments from servers when deepstore copy is missing. This feature is enabled via a cluster level config allowDownloadFromServer #12960 #13247.

  • Added support for TLS Port in Minions #12943.

  • New metrics added for Minions to track segment/record processing information #12710.

Bug Fixes

  • Minions can now handle invalid instance tags in Task Configs gracefully. Prior to this change, Minions would be stuck in IN_PROGRESS state until task timeout #13092.

  • Fix bug to return validDocIDsMetadata from all servers #12431.

  • Upsert compaction doesn't retain maxLength information and trims string fields #13157.

Upsert Improvements

Features and Improvements

Consistent Table View for Upsert Tables #12976

  • Adds different modes of consistency guarantees for Upsert tables.

  • Adds a new UpsertConfig called consistencyMode which can be set to NONE, SYNC, SNAPSHOT.

  • SYNC is optimized for data freshness but can lead to elevated query latencies and is best for low-qps use-cases. In this mode, the ingestion threads will take a WLock when updating validDocID bitmaps.

  • SNAPSHOT mode can handle high-qps/high-ingestion use-cases by getting the list of valid docs from a snapshot of validDocID. The snapshot can be refreshed every few seconds and the tolerance can be set via a query option upsertViewFreshnessMs.

Pluggable Partial Upsert Merger #11983

  • Partial Upsert merges the old record and the new incoming record to generate the final ingested record.

  • Pinot now allows users to customize how this merge of an old row and the new row is computed.

  • This allows a column value in the new row to be an arbitrary function of the old and the new row.

Support for Uploading Externally Partitioned Segments for Upsert Backfill 13107

  • Segments uploaded for Upsert Backfill can now explicitly specify the Kafka partition they belong to.

  • This enables backfilling an Upsert table where the externally generated segments are partitioned using an arbitrary hash function on an arbitrary primary key.

Misc Improvements and Bug Fixes

  • Fixed a Bug in Handling Equal Comparison Column Values in Upsert, which could lead to data inconsistency (#12395)

  • Upsert snapshot will now snapshot only those segments which have updates. #13285.

Notable Features

JSON Support Improvements

  • JSON Index can now be used for evaluating Regex and Range Predicates. #12568

  • jsonExtractIndex now supports contextual array filters. #12683 #12531.

  • JSON column type now supports filter predicates like =, !=, IN and NOT IN. This is convenient for scenarios where the JSON values are very small. #13283.

  • JSON_MATCH now supports exclusive predicates correctly. For instance, you can use predicates such as JSON_MATCH(person, '"$.addresses[*].country" != ''us''' to find all people who have at least one address that is not in the US. #13139.

  • jsonExtractIndex supports extracting Multi-Value JSON Fields, and also supports providing any default value when the key doesn't exist. #12748.

  • Added isJson UDF which increases your options to handle invalid JSONs. This can be used in queries and for filtering invalid json column values in ingestion. #12603.

  • Fix ArrayIndexOutOfBoundsException in jsonExtractIndex. #13479.

Lucene and Text Search Improvements

  • Improved Segment Build Time for Lucene Text Index by 40-60%. This improvement is realized when a consuming segment commits and changes to an ImmutableSegment. This significantly helps in lowering ingestion lag at commit time due to a large text index #12744 #13094 #13050.

  • Phrase Search can run 3x faster when the Lucene Index Config enablePrefixSuffixMatchingInPhraseQueries is set to true. This is achieved by rewriting phrase search query to a wildcard and prefix matching query #12680.

  • Fixed bug in TextMatchFilterOptimizer that was not applying precedence to the filter expressions properly, which could lead to incorrect results. #13009.

  • Fixed bug in handling NOT text_match which could have returned incorrect results. #12372.

  • Added SchemaConformingTranformerV2 to enhance text search abilities. #12788.

  • Added metrics to track Lucene NRT Refresh Delay #13307.

  • Switched to NRTCachingDirectory for Realtime segments and prevented duplicates in the Realtime Lucene Index to avoid IndexOutOfBounds query time exceptions. #13308.

  • Lucene Version is upgraded to 9.11.1. #13505.

New Funnel Functions #13176 #13231 #13228

  • Added funnelMaxStep function which can be used to calculate max funnel steps for a given sliding window .

  • Added funnelCompleteCount to calculate the number of completed funnels, and funnelMatchStep to get the funnel match array.

Support for Interning for OnHeapByteDictionary #12342

  • This can reduce the heap usage of a dictionary encoded byte column, for a certain distribution of duplicate values. See #12223 for details.

Column Major Builder On By Default for New Tables #12770

  • Prior to this feature, on a segment commit, Pinot would convert all the columnar data from the Mutable Segment to row-major, and then re-build column major Immutable Segments.

  • This feature skips the row-major conversion and is expected to be both space and time efficient.

  • It can help lower ingestion lag from segment commits, especially helpful when your segments are large.

Support for SQL Formatting in Query Editor #11725

  • You can now prettify SQL right in the Controller UI!

Hash Function for UUID Primary Keys #12538

  • Added a new lossless hash-function for Upsert Primary Keys optimized for UUIDs.

  • The hash function can reduce Old Gen by up to 30%.

  • It maps a UUID to a 16 byte array, vs encoding it in a UTF string which would take 36 bytes.

Column Level Index Skip Query Option #12414

  • Convenient for debugging impact of indexes on query performance or results.

  • You can add the skipIndexes option to your query to skip any number of indexes. e.g. SET skipIndexes=inverted,range;

New UDFs and Scalar Functions

  • New GeoHash functions: encodeGeoHash, decodeGeoHash, decodeGeoHashLatitude and decodeGeoHashLongitude.

  • dateBin can be used to align a timestamp to the nearest time bucket.

  • prefixes, suffixes and uniqueNgrams UDFs for generating all respective string subsequences from a string input. #12392.

  • Added isJson UDF which increases your options to handle invalid JSONs. This can be used in queries and for filtering invalid json column values in ingestion. #12603.

  • splitPart UDF has minor improvements. #12437.

CLP Compression Codec in Forward Indexes #12504

  • CLP is a compressed log processor which has really high compression ratio for certain log types.

  • To enable this, you can set the compressionCodec in the fieldConfigList of the column you want to target.

Misc. Improvements

  • Enable segment preloading at partition level #12451.

  • Use Temurin instead of AdoptOpenJdk #12533

  • Adding record reader config/context param to record transformer #12520

  • Removing legacy commons-lang dependency #13480

  • 12508: Feature add segment rows flush config #12681

  • ADSS Race Condition and update to client error codes #13104

  • Add ExceptionMapper to convert Exception to Response Object for Broker REST API's #13292

  • Add FunnelMaxStepAggregationFunction and FunnelCompleteCountAggregationFunction #13231

  • Add GZIP Compression Codec (#11434) #12668

  • Add PodDisruptionBudgets to the Pinot Helm chart #13153

  • Add Postgres compliant name aliasing for String Functions. #12795

  • Add SchemaConformingTransformerV2 to enhance text search abilities #12788

  • Add a benchmark to measure multi-stage block serde cost #13336

  • Add a plan version field to QueryRequest Protobuf Message #13267

  • Add a post-validator visitor that verifies there are no cast to bytes #12475

  • Add a safe version of CLStaticHttpHandler that disallows path traversal. #13124

  • Add ability to track filtered messages offset #12602

  • Add back 'numRowsResultSet' to BrokerResponse, and retain it when result table id hidden #13198

  • Add back profile for shade #12979

  • Add back some exclude deps from hadoop-mapreduce-client-core #12638

  • Add backward compatibility regression test suite for multi-stage query engine #13193

  • Add base class for custom object accumulator #12685

  • Add clickstream example table for funnel analysis #13379

  • Add config option for timezone #12386

  • Add config to skip record ingestion on string column length exceeding configured max schema length #13103

  • Add controller API to get allLiveInstances #12498

  • Add isJson UDF #12603

  • Add list of collaborators to asf.yaml #13346

  • Add locking logic to get consistent table view for upsert tables #12976

  • Add metric to track number of segments missed in upsert-snapshot #12581

  • Add metrics for SEGMENTS_WITH_LESS_REPLICAS monitoring #12336

  • Add mode to allow adding dummy events for non-matching steps #13382

  • Add offset based lag metrics #13298

  • Add protobuf codegen decoder #12980

  • Add retry policy to wait for job id to persist during rebalancing #13372

  • Add round-robin logic during downloadSegmentFromPeer #12353

  • Add schema as input to the decoder. #12981

  • Add splitPartWithLimit and splitPartFromEnd UDFs #12437

  • Add support for creating raw derived columns during segment reload #13037

  • Add support for raw JSON filter predicates #13283

  • Add the possibility of configuring ForwardIndexes with compressionCodec #12218

  • Add upsert-snapshot timer metric #12383

  • Add validation check for forward index disabled if it's a REALTIME table #12838

  • Added PR compatability test against release 1.1.0 #12921

  • Added kafka partition number to metadata. #13447

  • Added pinot-error-code header in query response #12338

  • Added tests for additional data types in SegmentPreProcessorTest.java #12755

  • Adding a cluster config to enable instance pool and replica group configuration in table config #13131

  • Adding batch api support for WindowFunction #12993

  • Adding bytes string data type integration tests #12387

  • Adding registerExtraComponents to allow registering additional components in various services #13465

  • Adding support of insecure TLS #12416

  • Adding support to insecure TLS when creating SSLFactory #12425

  • Adds AGGREGATE_CASE_TO_FILTER rule #12643

  • Adds per-column, query-time index skip option #12414

  • Allow Aggregations in Case Expressions #12613

  • Allow PintoHelixResourceManager subclasses to be used in the controller starter by providing an overridable PinotHelixResouceManager object creator function #13495

  • Allow RequestContext to consider http-headers case-insensitivity #13169

  • Allow Server throttling just before executing queries on server to allow max CPU and disk utilization #12930

  • Allow all raw index config in star-tree index #13225

  • Allow apply both environment variables and system properties to user and table configs, Environment variables take precedence over system properties #13011

  • Allow configurable queryWorkerThreads in Pinot server side GrpcQueryServer #13404

  • Allow dynamically setting the log level even for loggers that aren't already explicitly configured #13156

  • Allow passing custom record reader to be inited/closed in SegmentProcessorFramework #12529

  • Allow passing database context through database http header #12417

  • Allow stop to interrupt the consumer thread and safely release the resource #13418

  • Allow user configurable regex library for queries #13005

  • Allow using 'serverReturnFinalResult' to optimize server partitioned table #13208

  • Assign default value to newly added derived column upon reload #12648

  • Avoid port conflict in integration tests #13390

  • Better handling of null tableNames #12654

  • CLP as a compressionCodec #12504

  • Change helm app version to 1.0.0 for Apache Pinot latest release version #12436

  • Clean Google Dependencies #13297

  • Clean up BrokerRequestHandler and BrokerResponse #13179

  • Clean up arbitrary sleep in /GrpcBrokerClusterIntegrationTest #12379

  • Cleaning up vector index comments and exceptions #13150

  • Cleanup HTTP components dependencies and upgrade Thrift #12905

  • Cleanup Javax and Jakarta dependencies #12760

  • Cleanup deprecated query options #13040

  • Cleanup the consumer interfaces and legacy code #12697

  • Cleanup unnecessary dependencies under pinot-s3 #12904

  • Cleanup unused aggregate internal hint #13295

  • Consistency in API response for live broker #12201

  • Consolidate bouncycastle libraries #12706

  • Consolidate nimbus-jose-jwt version to 9.37.3 #12609

  • ControllerRequestClient accepts headers. Useful for authN tests #13481

  • Custom configuration property reader for segment metadata files #12440

  • Delete database API #12765

  • Deprecate PinotHelixResourceManager#getAllTables() in favour of getAllTables(String databaseName) #12782

  • Detect expired messages in Kafka. Log and set a gauge. #12608

  • Do not hard code resource class in BaseClusterIntegrationTest #13400

  • Do not pause ingestion when upsert snapshot flow errors out #13257

  • Don't drop original field during flatten #13490

  • Don't enforce -realTimeInstanceCount and -offlineInstanceCount options when creating broker tenants #13236

  • Egalpin/skip indexes minor changes #12514

  • Emit Metrics for Broker Adaptive Server Selector type #12482

  • Emit table size related metrics only in lead controller #12747

  • Enable complexType handling in SegmentProcessFramework #12942

  • Enable more integration tests to run on the v2 multi-stage query engine #13467

  • Enabling avroParquet to read Int96 as bytes #12484

  • Enhance Kinesis consumer #12806

  • Enhance Parquet Test #13082

  • Enhance ProtoSerializationUtils to handle class move #12946

  • Enhance Pulsar consumer #12812

  • Enhance PulsarConsumerTest #12948

  • Enhance commit threshold to accept size threshold without setting rows to 0 #12684

  • Enhance json index to support regexp and range predicate evaluation #12568

  • Enhancement: Sketch value aggregator performance #13020

  • Ensure FieldConfig.getEncodingType() is never null #12430

  • Ensure all the lists used in PinotQuery are ArrayList #13017

  • Ensure brokerId and requestId are always set in BrokerResponse #13200

  • Enter segment preloading at partition level #12451

  • Exclude dimensions from star-tree index stored type check #13355

  • Expose more helper API in TableDataManager #13147

  • Extend compatibility verifier operation timeout from 1m to 2m to reduce flakiness #13338

  • Extract json individual array elements from json index for the transform function jsonExtractIndex #12466

  • Fetch query quota capacity utilization rate metric in a callback function #12767

  • First with time #12235

  • GitHub Actions checkout v4 #12550

  • Gzip compression, ensure uncompressed size can be calculated from compressed buffer #12802

  • Handle errors gracefully during multi-stage stats collection in the broker #13496

  • Handle shaded classes in all methods of kafka factory #13087

  • Hash Function for UUID Primary Keys #12538

  • Ignore case when checking for Direct Memory OOM #12657

  • Improve Retention Manager Segment Lineage Clean Up #13232

  • Improve error message for max rows in join limit breach #13394

  • Improve exception logging when we fail to index / transform message #12594

  • Improve logging in range index handler for index updates #13381

  • Improve upsert compaction threshold validations #13424

  • Improve warn logs for requesting validDocID snapshots #13280

  • Improved metrics for server grpc query #13177

  • Improved null check for varargs #12673

  • Improved segment build time for Lucene text index realtime to offline conversion #12744

  • In ClusterTest, make start port higher to avoid potential conflict with Kafka #13402

  • Introduce PinotLogicalAggregate and remove internal hint #13291

  • Introduce retries while creating stream message decoder for more robustness #13036

  • Isolate bad server configs during broker startup phase #12931

  • Issue #12367 #12922

  • Json extract index filter support #12683

  • Json extract index mv #12532

  • Keep get tables API with and without database #12804

  • Lint failure #12294

  • Logging a warn message instead of throwing exception #12546

  • Made the error message around dimension table size clearer #13163

  • Make Helix state transition handling idempotent #12886

  • Make KafkaConsumerFactory method less restrictive to avoid incompatibility #12815

  • Make task manager APIs database aware #12766

  • Metric for count of tables configured with various tier backends #12940

  • Metric for upsert tables count #12505

  • Metrics for Realtime Rows Fetched and Stream Consumer Create Exceptions #12522

  • Minmaxrange null #12252

  • Modify consumingSegmentsInfo endpoint to indicate how many servers failed #12523

  • Move offset validation logic to consumer classes #13015

  • Move package org.apache.calcite to org.apache.pinot.calcite #12837

  • Move resolveComparisonTies from addOrReplaceSegment to base class #13396

  • Move some mispositioned tests under pinot-core #12884

  • Move wildfly-openssl dependency management to root pom #12597

  • Moving deleteSegment call from POST to DELETE call #12663

  • Optimize unnecessary extra array allocation and conversion for raw derived column during segment reload #13115

  • Pass explicit TypeRef when evaluating MV jsonPath #12524

  • Percentile operations supporting null #12271

  • Prepare for next development iteration #12530

  • Propagate Disable User Agent Config to Http Client #12479

  • Properly handle complex type transformer in segment processor framework #13258

  • Properly return response if SegmentCompletion is aborted #13206

  • Publish helm 0.2.8 #12465

  • Publish helm 0.2.9 #13230

  • Pull janino dependency to root pom #12724

  • Pull pulsar version definitaion into root POM #13002

  • Query response opt #13420

  • Re-enable the Spotless plugin for Java 21 #12992

  • Readme - How to setup Pinot UI for development #12408

  • Record enricher #12243

  • Refactor PinotTaskManager class #12964

  • Refactored CommonsConfigurationUtils for loading properties configuration. #13201

  • Refactored compatibility-verifier module #13359

  • Refactoring removeSegment flow in upsert #13449

  • Refine PeerServerSegmentFinder #12933

  • Refine SegmentFetcherFactory #12936

  • Replace custom fmpp plugin with fmpp-maven-plugin #12737

  • Reposition query submission spot for adaptive server selection #13327

  • Reset controller port when stopping the controller in ControllerTest #13399

  • Rest Endpoint to Create ZNode #12497

  • Return clear error message when no common broker found for multi-stage query with tables from different tenants #13235

  • Returning tables names failing authorization in Exception of Multi State Engine Queries #13195

  • Revert " Adding record reader config/context param to record transformer (#12520)" #12526

  • Revert "Using local copy of segment instead of downloading from remote (#12863)" #13114

  • Short circuit SubPlanFragmenter because we don't support multiple sub-plans yet #13306

  • Simplify Google dependencies by importing BOM #12456

  • Specify version for commons-validator #12935

  • Support NOT in StarTree Index #12988

  • Support empty strings as json nodes^ #12555

  • Supporting human-readable format when configuring broker response size #12510

  • Use ArrayList instead of LinkedList in SortOperator #12783

  • Use a two server setup for multi-stage query engine backward compatibility regression test suite #13371

  • Use more efficient variants of URLEncoder::encode and URLDecoder::decode #13030

  • Use parameterized log messages instead of string concatenation #13145

  • Use separate action for /tasks/scheduler/jobDetails API #13054

  • Use try-with-resources to close file walk stream in LocalPinotFS #13029

  • Using local copy of segment instead of downloading from remote #12863

  • [Adaptive Server Selector] Add metrics for Stats Manager Queue Size #12340

  • [Cleanup] Move classes in pinot-common to the correct package #13478

  • [Feature] Add Support for SQL Formatting in Query Editor #11725

  • [HELM]: Added additional probes options and startup probe. #13165

  • [HELM]: Added checksum config annotation in stateful set for broker, controller and server #13059

  • [HELM]: Added namespace support in K8s deployment. #13380

  • [HELM]: zookeeper chart upgrade to version 13.2.0 #13083

  • [Minor] Add Nullable annotation to HttpHeaders in BrokerRequestHandler #12816

  • [Minor] Small refactor of raw index creator constructor to be more clear #13093

  • [Multi-stage] Clean up RelNode to Operator handling #13325

  • [null-aggr] Add null handling support in mode aggregation #12227

  • [partial-upsert] configure early release of _partitionGroupConsumerSemaphore in RealtimeSegmentDataManager #13256

  • [spark-connector] Add option to fail read when there are invalid segments #13080

  • add Netty arm64 dependencies #12493

  • add Netty unit test #12486

  • add SegmentContext to collect validDocIds bitmaps for many segments together #12694

  • add skipUnavailableServers query option #13387

  • add insecure mode when Pinot uses TLS connections #12525

  • add instrumentation to json index getMatchingFlattenedDocsMap() #13164

  • add jmx to promethues metric exporting rule for realtimeRowsFiltered #12759

  • add metrics for IdeaState update #13266

  • add some metrics for upsert table preloading #12722

  • add some tests on jsonPathString #12954

  • add test cases in RequestUtilsTest #12557

  • add unit test for JsonAsyncHttpPinotClientTransport #12633

  • add unit test for QueryServer #12599

  • add unit test for ServerChannels #12616

  • add unit test for StringFunctions encodeUrl #13391

  • add unit tests for pinot-jdbc-client #13137

  • add url assertion to SegmentCompletionProtocolTest #13373

  • adjust the llc partition consuming metric reporting logic #12627

  • allow passing null http headers object to translateTableName #12764

  • allow to set segment when use SegmentProcessorFramework #13341

  • auto renew jvm default sslconext when it's loaded from files #12462

  • avoid useless intermediate byte array allocation for VarChunkV4Reader's getStringMV #12978

  • aws sdk 2.25.3 #12562

  • build-helper-maven-plugin 3.5.0 #12548

  • cache ssl contexts and reuse them #12404

  • clean up jetbrain nullable annotation #13427

  • cleanup: maven no transfer progress #12444

  • close JDBC connections #12494

  • do not fail on duplicate relaxed vars (#13214)z

  • dropwizard metrics 4.2.25 #12600

  • dynamic chunk sizing for v4 raw forward index #12945

  • enable Netty leak detection #12483

  • enable parallel Maven in pinot linter script #12751

  • ensure inverse And/OrFilterOperator implementations match the query #13199

  • exclude .mvn directory from source assembly #12558

  • extend CompactedPinotSegmentRecordReader so that it can skip deleteRecord #13352

  • get startTime outside the executor task to avoid flaky time checks #13250

  • handle absent segments so that catchup checker doesn't get stuck on them #12883

  • handle overflow for MutableOffHeapByteArrayStore buffer starting size #13215

  • handle segments not tracked by partition mgr and add skipUpsertView query option #13415

  • handle table name translation on missed api resources #12792

  • hash4j version upgrade to 0.17.0 #12968

  • including the underlying exception in the logging output #13248

  • int96 parity with native parquet reader #12496

  • jsonExtractIndex support array of default values #12748

  • log the log rate limiter rate for dropped broker logs #13041

  • make http listener ssl config swappable #12455

  • make reflection calls compatible with 0.9.11 [#12958](https://github.com/apache/

  • maven: no transfer progress #12528

  • missed to delete the temp dir #12637

  • move shouldReplaceOnComparisonTie to base class to be more reusable #13353

  • reduce Java enum .values() usage in TimerContext #12579

  • reduce logging for SpecialValueTransformer #12970

  • reduce regex pattern compilation in Pinot jdbc #13138

  • refactor TlsUtils class #12515

  • refine when to registerSegment while doing addSegment and replaceSegment for upsert tables for better data consistency #12709

  • reformat AdminConsoleIntegrationTest.java #12552

  • reformat ClusterTest.java #12531

  • release segment mgrs more reliably #13216

  • replaced getServer with getServers #12545

  • report rebalance job status for the early returns like noops #13281

  • require noDictionaryColumns with aggregationConfigs #12464

  • share the same table config object #12463

  • track segments for snapshotting even if they lost all comparisons #13388

  • untrack the segment out of TTL #12449

  • update ControllerJobType from enum to string #12518

  • update RewriterConstants so that expr min max would not collide with columns start with "parent" #13357

  • update access control check error handling to catch throwable and log errors #13209

Bug Fixes

  • Use gte(lte) to replace between() which has a bug #12595

  • Fix the ConcurrentModificationException for And/Or DocIdSet #12611

  • Upgrade RoaringBitmap to 1.0.5 to pick up the fix for RangeBitmap.between() #12604

  • bugfix: do not move src ByteBuffer position for LZ4 length prefixed decompress #12539

  • Bug Fix createDictionaryForColumn does not take into account inverted index #13048

  • fix Cluster Manager error #12632

  • fix for quick start Cluster Manager issue #12610

  • Adding config for having suffix for client ID for realtime consumer #13168

  • Addressed comments and fixed tests from pull request 12389. /uptime and /start-time endpoints working all components #12512

  • Bigfix. Added missing paramName #13060

  • Bug fix: Do not ignore scheme property #12332

  • Bug fix: Handle missing shade config overwrites for Kafka #13437

  • BugFix: Fix merge result from more than one server #12778

  • Bugfix. Allow tenant rebalance with downtime as true #13246

  • Bugfix. Avoid passing null table name input to translation util #12726

  • Bugfix. Correct wrong method call from scheduleTask() to scheduleTaskForDatabase() #12791

  • Bugfix. Maintain literal data type during function evaluation #12607

  • Cleanup: Fix grammar in error message, also improve readability. #13451

  • Fix Bug in Handling Equal Comparison Column Values in Upsert #12395

  • Fix ColumnMinMaxValueGenerator #12502

  • Fix JavaEE related dependencies #13058

  • Fix Logging Location for CPU-Based Query Killing #13318

  • Fix PulsarUtils to not share buffer #12671

  • Fix URI construction so that AddSchema command line tool works when override flag is set to true #13320

  • Fix [Type]ArrayList elements() method usage #13354

  • Fix a typo when calculating query freshness #12947

  • Fix an overflow in PinotDataBuffer.readFrom #13152

  • Fix bug in logging in UpsertCompaction task #12419

  • Fix bug to return validDocIDsMetadata from all servers #12431

  • Fix connection issues if using JDBC and Hikari (#12267) #12411

  • Fix controller host / port / protocol CLI option description for admin commands #13237

  • Fix environment variables not applied when creating table #12560

  • Fix error message for insufficient number of untagged brokers during tenant creation #13234

  • Fix few metric rules which were affected by the database prefix handling #13290

  • Fix file handle leaks in Pinot Driver (apache#12263) #12356

  • Fix flakiness of ControllerPeriodicTasksIntegrationTest #13337

  • Fix issue with startree index metadata loading for columns with '__' in name #12554

  • Fix metric rule pattern regex #12856

  • Fix pinot-parquet NoClassFound issue #12615

  • Fix segment size check in OfflineClusterIntegrationTest #13389

  • Fix some resource leak in tests #12794

  • Fix the NPE from IS update metrics #13313

  • Fix the NPE when metadataTTL is enabled without delete column #13262

  • Fix the ServletConfig loading issue with swagger. #13122

  • Fix the issue that map flatten shouldn't remove the map field from the record #13243

  • Fix the race condition for H3InclusionIndexFilterOperator #12487

  • Fix the time segment pruner on TIMESTAMP data type #12789

  • Fix time stats in SegmentIndexCreationDriverImpl #13429

  • Fixed infer logical type name from avro union schema #13224

  • Fixing instance type to resolve #12677 and #12678

  • Helm: bug fix for chart rendering issue. #13264

  • Try to amend kafka common package with pinot shaded package prefix #13056

  • Update getValidDocIdsMetadataFromServer to make call in batches to servers and other bug fixes #13314

  • Upgrade com.microsoft.azure:msal4j from 1.3.5 to 1.3.10 for CVE fixing #12580

  • [bugfix] Handling null value for kafka client id suffix #13279

  • bugfix: fixing jdbc client sql feature not supported exception #12480

  • bugfix: re-add support for not text_match #12372

  • bugfix: reduce enum array allocation in QueryLogger #12478

  • bugfix: use consumerDir during lucene realtime segment conversion #13094

  • cleanup: fix apache rat violation #12476

  • fix GuavaRateLimiter acquire method #12500

  • fix fieldsToRead class not in decoder #13186

  • fix flakey test, avoid early finalization #13095

  • fix merging null multi value in partial upsert #13031

  • fix race condition in ScalingThreadPoolExecutor #13360

  • fix shared buffer, tests #12587

  • fix(build): update node version to 16 #12924

  • fixing CVE critical issues by resolving kerby/jline and wildfly libraries #12566

  • fixing pinot-adls high severity CVEs #12571

  • fixing swagger setup using localhost as host name #13254

  • swagger-ui upgrade to 5.15.0 Fixes #12908

  • upgrade jettison version to fix CVE #12567

1.0.0

This page covers the latest changes included in the Apache Pinot™ 1.0.0 release, including new features, enhancements, and bug fixes.

1.0.0 (2023-09-19)

This release includes the several new features, enhancements, and bug fixes, including the following highlights:

  • Multi-stage query engine: new features, enhancements, and bug fixes. Learn how to enable and use the multi-stage query engine or more about how the multi-stage query engine works.

Multi-stage query engine new features

  • Support for window functions

    • Initial (phase 1) Query runtime for window functions with ORDER BY within the OVER() clause (#10449)

    • Support for the ranking ROW_NUMBER() window function (#10527, #10587)

  • Set operations support:

    • Support SetOperations (UNION, INTERSECT, MINUS) compilation in query planner (#10535)

  • Timestamp and Date Operations

  • Support TIMESTAMP type and date ops functions (#11350)

  • Aggregate functions

    • Support more aggregation functions that are currently implementable (#11208)

    • Support multi-value aggregation functions (#11216)

  • Support Sketch based functions (#11153), (#11517)

  • Make Intermediate Stage Worker Assignment Tenant Aware (#10617)

  • Evaluate literal expressions during query parsing, enabling more efficient query execution (#11438)

  • Added support for partition parallelism in partitioned table scans, allowing for more efficient data retrieval (#11266)

  • [multistage]Adding more tuple sketch scalar functions and integration tests (#11517)

Multi-stage query engine enhancements

  • Turn on v2 engine by default (#10543)

  • Introduced the ability to stream leaf stage blocks for more efficient data processing (#11472).

  • Early terminate SortOperator if there is a limit (#11334)

  • Implement ordering for SortExchange (#10408)

  • Table level Access Validation, QPS Quota, Phase Metrics for multistage queries (#10534)

  • Support partition based leaf stage processing (#11234)

  • Populate queryOption down to leaf (#10626)

  • Pushdown explain plan queries from the controller to the broker (#10505)

  • Enhanced the multi-stage group-by executor to support limiting the number of groups,

  • improving query performance and resource utilization (#11424).

  • Improved resilience and reliability of the multi-stage join operator, now with added support for hash join right table protection (#11401).

Multi-stage query engine bug fixes

  • Fix Predicate Pushdown by Using Rule Collection (#10409)

  • Try fixing mailbox cancel race condition (#10432)

  • Catch Throwable to Propagate Proper Error Message (#10438)

  • Fix tenant detection issues (#10546)

  • Handle Integer.MIN_VALUE in hashCode based FieldSelectionKeySelector (#10596)

  • Improve error message in case of non-existent table queried from the controller (#10599)

  • Derive SUM return type to be PostgreSQL compatible (#11151)

Index SPI

  • Add the ability to include new index types at runtime in Apache Pinot. This opens the ability of adding third party indexes, including proprietary indexes. More details here

Null value support for pinot queries

  • NULL support for ORDER BY, DISTINCT, GROUP BY, value transform functions and filtering.

Upsert enhancements

Delete support in upsert enabled tables (#10703)

Support added to extend upserts and allow deleting records from a realtime table. The design details can be found here.

Preload segments with upsert snapshots to speedup table loading (#11020)

Adds a feature to preload segments from a table that uses the upsert snapshot feature. The segments with validDocIds snapshots can be preloaded in a more efficient manner to speed up the table loading (thus server restarts).

TTL configs for upsert primary keys (#10915)

Adds support for specifying expiry TTL for upsert primary key metadata cleanup.

Segment compaction for upsert real-time tables (#10463)

Adds a new minion task to compact segments belonging to a real-time table with upserts.

Pinot Spark Connector for Spark3 (#10394)

  • Added spark3 support for Pinot Spark Connector (#10394)

  • Also added support to pass pinot query options to spark connector (#10443)

PinotDataBufferFactory and new PinotDataBuffer implementations (#10528)

Adds new implementations of PinotDataBuffer that uses Unsafe java APIs and foreign memory APIs. Also added support for PinotDataBufferFactory to allow plugging in custom PinotDataBuffer implementations.

Query functions enhancements

  • Add PercentileKLL aggregation function (#10643)

  • Support for ARG_MIN and ARG_MAX Functions (#10636)

  • refactor argmin/max to exprmin/max and make it calcite compliant (#11296)

  • Integer Tuple Sketch support (#10427)

  • Adding vector scalar functions (#11222)

  • [feature] multi-value datetime transform variants (#10841)

  • FUNNEL_COUNT Aggregation Function (#10867)

  • [multistage] Add support for RANK and DENSE_RANK ranking window functions (#10700)

  • add theta sketch scalar (#11153)

  • Register dateTimeConverter,timeConvert,dateTrunc, regexpReplace to v2 functions (#11097)

  • Add extract(quarter/dow/doy) support (#11388)

  • Funnel Count - Multiple Strategies (no partitioning requisites) (#11092)

  • Add Boolean assertion transform functions. (#11547)

JSON and CLP encoded message ingestion and querying

  • Add clpDecode transform function for decoding CLP-encoded fields. (#10885)

  • Add CLPDecodeRewriter to make it easier to call clpDecode with a column-group name rather than the individual columns. (#11006)

  • Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. (#11210)

Tier level index config override (#10553)

  • Allows overriding index configs at tier level, allowing for more flexible index configurations for different tiers.

Ingestion connectors and features

  • Kinesis stream header extraction (#9713)

  • Extract record keys, headers and metadata from Pulsar sources (#10995)

  • Realtime pre-aggregation for Distinct Count HLL & Big Decimal (#10926)

  • Added support to skip unparseable records in the csv record reader (#11487)

  • Null support for protobuf ingestion. (#11553)

UI enhancements

  • Adds persistence of authentication details in the browser session. This means that even if you refresh the app, you will still be logged in until the authentication session expires (#10389)

  • AuthProvider logic updated to decode the access token and extract user name and email. This information will now be available in the app for features to consume. (#10925)

Pinot docker image improvements and enhancements

  • Make Pinot base build and runtime images support Amazon Corretto and MS OpenJDK (#10422)

  • Support multi-arch pinot docker image (#10429)

  • Update dockerfile with recent jdk distro changes (#10963)

Operational improvements

Rebalance

  • Rebalance status API (#10359)

  • Tenant level rebalance API Tenant rebalance and status tracking APIs (#11128)

Config to use customized broker query thread pool (#10614)

Added new configuration options below which allow use of a bounded thread pool and allocate capacities for it.

pinot.broker.enable.bounded.http.async.executor
pinot.broker.http.async.executor.max.pool.size
pinot.broker.http.async.executor.core.pool.size
pinot.broker.http.async.executor.queue.size

This feature allows better management of broker resources.

Drop results support (#10419)

Adds a parameter to queryOptions to drop the resultTable from the response. This mode can be used to troubleshoot a query (which may have sensitive data in the result) using metadata only.

Make column order deterministic in segment (#10468)

In segment metadata and index map, store columns in alphabetical order so that the result is deterministic. Segments generated before/after this PR will have different CRC, so during the upgrade, we might get segments with different CRC from old and new consuming servers. For the segment consumed during the upgrade, some downloads might be needed.

Allow configuring helix timeouts for EV dropped in Instance manager (#10510)

Adds options to configure helix timeouts external.view.dropped.max.wait.ms`` - The duration of time in milliseconds to wait for the external view to be dropped. Default - 20 minutes. external.view.check.interval.ms`` - The period in milliseconds in which to ping ZK for latest EV state.

Enable case insensitivity by default (#10771)

This PR makes Pinot case insensitive be default, and removes the deprecated property enable.case.insensitive.pql

Newly added APIs and client methods

  • Add Server API to get tenant pools (#11273)

  • Add new broker query point for querying multi-stage engine (#11341)

  • Add a new controller endpoint for segment deletion with a time window (#10758)

  • New API to get tenant tags (#10937)

  • Instance retag validation check api (#11077)

  • Use PUT request to enable/disable table/instance (#11109)

  • Update the pinot tenants tables api to support returning broker tagged tables (#11184)

  • Add requestId for BrokerResponse in pinot-broker and java-client (#10943)

  • Provide results in CompletableFuture for java clients and expose metrics (#10326)

Cleanup and backward incompatible changes

High level consumers are no longer supported

  • Cleanup HLC code (#11326)

  • Remove support for High level consumers in Apache Pinot (#11017)

Type information preservation of query literals

  • [feature] [backward-incompat] [null support # 2] Preserve null literal information in literal context and literal transform (#10380) String versions of numerical values are no longer accepted. For example, "123" won't be treated as a numerical anymore.

Controller job status ZNode path update

  • Moving Zk updates for reload, force_commit to their own Znodes which … (#10451) The status of previously completed reload jobs will not be available after this change is deployed.

Metric names for mutable indexes to change

  • Implement mutable index using index SPI (#10687) Due to a change in the IndexType enum used for some logs and metrics in mutable indexes, the metric names may change slightly.

Update in controller API to enable / disable / drop instances

  • Update getTenantInstances call for controller and separate POST operations on it (#10993)

Change in substring query function definition

  • Change substring to comply with standard sql definition (#11502)

Full list of features added

  • Allow queries on multiple tables of same tenant to be executed from controller UI #10336

  • Encapsulate changes in IndexLoadingConfig and SegmentGeneratorConfig #10352

  • [Index SPI] IndexType (#10191)

  • Simplify filtered aggregate transform operator creation (#10410)

  • Introduce BaseProjectOperator and ValueBlock (#10405)

  • Add support to create realtime segment in local (#10433)

  • Refactor: Pass context instead on individual arguments to operator (#10413)

  • Add "processAll" mode for MergeRollupTask (#10387)

  • Upgrade h2 version from 1.x to 2.x (#10456)

  • Added optional force param to the table configs update API (#10441)

  • Enhance broker reduce to handle different column names from server response (#10454)

  • Adding fields to enable/disable dictionary optimization. (#10484)

  • Remove converted H2 type NUMERIC(200, 100) from BIG_DECIMAL (#10483)

  • Add JOIN support to PinotQuery (#10421)

  • Add testng on verifier (#10491)

  • Clean up temp consuming segment files during server start (#10489)

  • make pinot k8s sts and deployment start command configurable (#10509)

  • Fix Bottleneck for Server Bootstrap by Making maxConnsPerRoute Configurable (#10487)

  • Type match between resultType and function's dataType (#10472)

  • create segment zk metadata cache (#10455)

  • Allow ValueBlock length to increase in TransformFunction (#10515)

  • Allow configuring helix timeouts for EV dropped in Instance manager (#10510)

  • Enhance error reporting (#10531)

  • Combine "GET /segments" API & "GET /segments/{tableName}/select" (#10412)

  • Exposed the CSV header map as part of CSVRecordReader (#10542)

  • Moving Zk updates for reload,force_commit to their own Znodes which will spread out Zk write load across jobTypes (#10451)

  • Enabling dictionary override optimization on the segment reload path as well. (#10557)

  • Make broker's rest resource packages configurable (#10588)

  • Check EV not exist before allowing creating the table (#10593)

  • Adding an parameter (toSegments) to the endSegmentReplacement API (#10630)

  • update target tier for segments if tierConfigs is provided (#10642)

  • Add support for custom compression factor for Percentile TDigest aggregation functions (#10649)

  • Utility to convert table config into updated format (#10623)

  • Segment lifecycle event listener support (#10536)

  • Add server metrics to capture gRPC activity (#10678)

  • Separate and parallelize BloomFilter based semgment pruner (#10660)

  • API to expose the contract/rules imposed by pinot on tableConfig #10655

  • Add description field to metrics in Pinot (#10744)

  • changing the dedup store to become pluggable #10639

  • Make the TimeUnit in the DATETRUNC function case insensitive. (#10750)

  • [feature] Consider tierConfigs when assigning new offline segment #10746

  • Compress idealstate according to estimated size #10766

  • 10689: Update for pinot helm release version 0.2.7 (#10723)

  • Fail the query if a filter's rhs contains NULL. (#11188)

  • Support Off Heap for Native Text Indices (#10842)

  • refine segment reload executor to avoid creating threads unbounded #10837

  • compress nullvector bitmap upon seal (#10852)

  • Enable case insensitivity by default (#10771)

  • Push out-of-order events metrics for full upsert (#10944)

  • [feature] add requestId for BrokerResponse in pinot-broker and java-client #10943

  • Provide results in CompletableFuture for java clients and expose metrics #10326

  • Add minion observability for segment upload/download failures (#10978)

  • Enhance early terminate for combine operator (#10988)

  • Add fromController method that accepts a PinotClientTransport (#11013)

  • Ensure min/max value generation in the segment metadata. (#10891)

  • Apply some allocation optimizations on GrpcSendingMailbox (#11015)

  • When enable case-insensitive, don't allow to add newly column name which have the same lowercase name with existed columns. (#10991)

  • Replace Long attributes with primitive values to reduce boxing (#11059)

  • retry KafkaConsumer creation in KafkaPartitionLevelConnectionHandler.java (#253) (#11040)

  • Support for new dataTime format in DateTimeGranularitySpec without explicitly setting size (#11057)

  • Returning 403 status code in case of authorization failures (#11136)

  • Simplify compatible test to avoid test against itself (#11163)

  • Updated code for setting value of segment min/max property. (#10990)

  • Add stat to track number of segments that have valid doc id snapshots (#11110)

  • Add brokerId and brokerReduceTimeMs to the broker response stats (#11142)

  • safely multiply integers to prevent overflow (#11186)

  • Move largest comparison value update logic out of map access (#11157)

  • Optimize DimensionTableDataManager to abort unnecesarry loading (#11192)

  • Refine isNullsLast and isAsc functions. (#11199)

  • Update the pinot tenants tables api to support returning broker tagged tables (#11184)

  • add multi-value support for native text index (#11204)

  • Add percentiles report in QuerySummary (#11299)

  • Add meter for broker responses with unavailable segments (#11301)

  • Enhance Minion task management (#11315)

  • add additional lucene index configs (#11354)

  • Add DECIMAL data type to orc record reader (#11377)

  • add configuration to fail server startup on non-good status checker (#11347)

  • allow passing freshness checker after an idle threshold (#11345)

  • Add broker validation for hybrid tableConfig creation (#7908)

  • Support partition parallelism for partitioned table scan (#11266)

Vulnerability fixes, bugfixes, cleanups and deprecations

  • Remove support for High level consumers in Apache Pinot (#11017)

  • Fix JDBC driver check for username (#10416)

  • [Clean up] Remove getColumnName() from AggregationFunction interface (#10431)

  • fix jersey TerminalWriterInterceptor MessageBodyWriter not found issue (#10462)

  • Bug fix: Start counting operator execution time from first NoOp block (#10450)

  • Fix unavailable instances issues for StrictReplicaGroup (#10466)

  • Change shell to bash (#10469)

  • Fix the double destroy of segment data manager during server shutdown (#10475)

  • Remove "isSorted()" precondition check in the ForwardIndexHandler (#10476)

  • Fix null handling in streaming selection operator (#10453)

  • Fix jackson dependencies (#10477)

  • Startree index build enhancement (#10905)

  • optimize queries where lhs and rhs of predicate are equal (#10444)

  • Trivial fix on a warning detected by static checker (#10492)

  • wait for full segment commit protocol on force commit (#10479)

  • Fix bug and add test for noDict -> Dict conversion for sorted column (#10497)

  • Make column order deterministic in segment (#10468)

  • Type match between resultType and function's dataType (#10472)

  • Allow empty segmentsTo for segment replacement protocol (#10511)

  • Use string as default compatible type for coalesce (#10516)

  • Use threadlocal variable for genericRow to make the MemoryOptimizedTable threadsafe (#10502)

  • Fix shading in spark2 connector pom file (#10490)

  • Fix ramping delay caused by long lasting sequence of unfiltered messa… (#10418)

  • Do not serialize metrics in each Operator (#10473)

  • Make pinot-controller apply webpack production mode when bin-dist profile is used. (#10525)

  • Fix FS props handling when using /ingestFromUri (#10480)

  • Clean up v0_deprecated batch ingestion jobs (#10532)

  • Deprecate kafka 0.9 support (#10522)

  • safely multiply integers to prevent overflow (#11186)

  • Reduce timeout for codecov and not fail the job in any case (#10547)

  • Fix DataTableV3 serde bug for empty array (#10583)

  • Do not record operator stats when tracing is enabled (#10447)

  • Forward auth token for logger APIs from controller to other controllers and brokers (#10590)

  • Bug fix: Partial upsert default strategy is null (#10610)

  • Fix flaky test caused by EV check during table creation (#10616)

  • Fix withDissabledTrue typo (#10624)

  • Cleanup unnecessary mailbox id ser/de (#10629)

  • no error metric for queries where all segments are pruned (#10589)

  • bug fix: to keep QueryParser thread safe when handling many read requests on class RealtimeLuceneTextIndex (#10620)

  • Fix static DictionaryIndexConfig.DEFAULT_OFFHEAP being actually onheap (#10632)

  • 10567: [cleanup pinot-integration-test-base], clean query generations and some other refactoring. (#10648)

  • Fixes backward incompatability with SegmentGenerationJobSpec for segment push job runners (#10645)

  • Bug fix to get the toSegments list correctly (#10659)

  • 10661: Fix for failing numeric comparison in where clause for IllegalStateException. (#10662)

  • Fixes partial upsert not reflecting multiple comparison column values (#10693)

  • Fix Bug in Reporting Timer Value for Min Consuming Freshness (#10690)

  • Fix typo of rowSize -> columnSize (#10699)

  • update segment target tier before table rebalance (#10695)

  • Fix a bug in star-tree filter operator which can incorrecly filter documents (#10707)

  • Enhance the instrumentation for a corner case where the query doesn't go through DocIdSetOp (#10729)

  • bug fix: add missing properties when edit instance config (#10741)

  • Making segmentMapper do the init and cleanup of RecordReader (#10874)

  • Fix githubEvents table for quickstart recipes (#10716)

  • Minor Realtime Segment Commit Upload Improvements (#10725)

  • Return 503 for all interrupted queries. Refactor the query killing code. (#10683)

  • Add decoder initialization error to the server's error cache (#10773)

  • bug fix: add @JsonProperty to SegmentAssignmentConfig (#10759)

  • ensure we wait the full no query timeout before shutting down (#10784)

  • Clean up KLL functions with deprecated convention (#10795)

  • Redefine the semantics of SEGMENT_STREAMED_DOWNLOAD_UNTAR_FAILURES metric to count individual segment fetch failures. (#10777)

  • fix excpetion during exchange routing causes stucked pipeline (#10802)

  • [bugfix] fix floating point and integral type backward incompatible issue (#10650)

  • [pinot-core] Start consumption after creating segment data manager (#11227)

  • Fix IndexOutOfBoundException in filtered aggregation group-by (#11231)

  • Fix null pointer exception in segment debug endpoint #11228

  • Clean up RangeIndexBasedFilterOperator. (#11219)

  • Fix the escape/unescape issue for property value in metadata (#11223)

  • Fix a bug in the order by comparator (#10818)

  • Keeps nullness attributes of merged in comparison column values (#10704)

  • Add required JSON annotation in H3IndexResolution (#10792)

  • Fix a bug in SELECT DISTINCT ORDER BY. (#10827)

  • jsonPathString should return null instead of string literal "null" (#10855)

  • Bug Fix: Segment Purger cannot purge old segments after schema evolution (#10869)

  • Fix #10713 by giving metainfo more priority than config (#10851)

  • Close PinotFS after Data Manager Shutdowns (#10888)

  • bump awssdk version for a bugfix on http conn leakage (#10898)

  • Fix MultiNodesOfflineClusterIntegrationTest.testServerHardFailure() (#10909)

  • Fix a bug in SELECT DISTINCT ORDER BY LIMIT. (#10887)

  • Fix an integer overflow bug. (#10940)

  • Return true when _resultSet is not null (#10899)

  • Fixing table name extraction for lateral join queries (#10933)

  • Fix casting when prefetching mmap'd segment larger than 2GB (#10936)

  • Null check before closing reader (#10954)

  • Fixes SQL wildcard escaping in LIKE queries (#10897)

  • [Clean up] Do not count DISTINCT as aggregation (#10985)

  • do not readd lucene readers to queue if segment is destroyed #10989

  • Message batch ingestion lag fix (#10983)

  • Fix a typo in snapshot lock (#11007)

  • When extracting root-level field name for complex type handling, use the whole delimiter (#11005)

  • update jersey to fix Denial of Service (DoS) (#11021)

  • Update getTenantInstances call for controller and separate POST operations on it (#10993)

  • update freemaker to fix Server-side Template Injection (#11019)

  • format double 0 properly to compare with h2 results (#11049)

  • Fix double-checked locking in ConnectionFactory (#11014)

  • Remove presto-pinot-driver and pinot-java-client-jdk8 module (#11051)

  • Make RequestUtils always return a string array when getTableNames (#11069)

  • Fix BOOL_AND and BOOL_OR result type (#11033)

  • [cleanup] Consolidate some query and controller/broker methods in integration tests (#11064)

  • Fix grpc regression on multi-stage engine (#11086)

  • Delete an obsolete TODO. (#11080)

  • Minor fix on AddTableCommand.toString() (#11082)

  • Allow using Lucene text indexes on mutable MV columns. (#11093)

  • Allow offloading multiple segments from same table in parallel (#11107)

  • Added serviceAccount to minion-stateless (#11095)

  • Bug fix: TableUpsertMetadataManager is null (#11129)

  • Fix reload bug (#11131)

  • Allow extra aggregation types in RealtimeToOfflineSegmentsTask (#10982)

  • Fix a bug when use range index to solve EQ predicate (#11146)

  • Sanitise API inputs used as file path variables (#11132)

  • Fix NPE when nested query doesn't have gapfill (#11155)

  • Fix the NPE when query response error stream is null (#11154)

  • Make interface methods non private, for java 8 compatibility (#11164)

  • Increment nextDocId even if geo indexing fails (#11158)

  • Fix the issue of consuming segment entering ERROR state due to stream connection errors (#11166)

  • In TableRebalancer, remove instance partitions only when reassigning instances (#11169)

  • Remove JDK 8 unsupported code (#11176)

  • Fix compat test by adding -am flag to build pinot-integration-tests (#11181)

  • dont duplicate register scalar function in CalciteSchema (#11190)

  • Fix the storage quota check for metadata push (#11193)

  • Delete filtering NULL support dead code paths. (#11198)

  • [bugfix] Do not move real-time segments to working dir on restart (#11226)

  • Fix a bug in ExpressionScanDocIdIterator for multi-value. (#11253)

  • Exclude NULLs when PredicateEvaluator::isAlwaysTrue is true. (#11261)

  • UI: fix sql query options seperator (#10770)

  • Fix a NullPointerException bug in ScalarTransformFunctionWrapper. (#11309)

  • [refactor] improve disk read for partial upsert handler (#10927)

  • Fix the wrong query time when the response is empty (#11349)

  • getMessageAtIndex should actually return the value in the streamMessage for compatibility (#11355)

  • Remove presto jdk8 related dependencies (#11285)

  • Remove special routing handling for multiple consuming segments (#11371)

  • Properly handle shutdown of TableDataManager (#11380)

  • Fixing the stale pinot ServerInstance in _tableTenantServersMap (#11386)

  • Fix the thread safety issue for mutable forward index (#11392)

  • Fix RawStringDistinctExecutor integer overflow (#11403)

  • [logging] fix consume rate logging bug to respect 1 minute threshold (#11421)

0.12.0

Multi-Stage Query Engine

New join semantics support

  • Left join (#9466)

  • In-equi join (#9448)

  • Full join (#9907)

  • Right join (#9907)

  • Semi join (#9367)

  • Using keyword (#9373)

New sql semantics support:

  • Having (#9274)

  • Order by (#9279)

  • In/NotIn clause (#9374)

  • Cast (#9384)

  • LIke/Rexlike (#9654)

  • Range predicate (#9445)

Performance enhancement

  • Thread safe query planning (#9344)

  • Partial query execution and round robin scheduling (#9753)

  • Improve data table serde (#9731)

Major updates

  • Force commit consuming segments by @sajjad-moradi in #9197

  • add a freshness based consumption status checker by @jadami10 in #9244

  • Add metrics to track controller segment download and upload requests in progress by @gviedma in #9258

  • Adding endpoint to download local log files for each component by @xiangfu0 in #9259

  • [Feature] Add an option to search input files recursively in ingestion job. The default is set to true to be backward compatible. by @61yao in #9265

  • add query cancel APIs on controller backed by those on brokers by @klsince in #9276

  • Add Spark Job Launcher tool by @KKcorps in #9288

  • Enable Consistent Data Push for Standalone Segment Push Job Runners by @yuanbenson in #9295

  • Allow server to directly return the final aggregation result by @Jackie-Jiang in #9304

  • TierBasedSegmentDirectoryLoader to keep segments in multi-datadir by @klsince in #9306

  • Adaptive Server Selection by @vvivekiyer in #9311

  • [Feature] Support IsDistinctFrom and IsNotDistinctFrom by @61yao in #9312

  • Allow ingestion of errored records with incorrect datatype by @KKcorps in #9320

  • Allow setting custom time boundary for hybrid table queries by @saurabhd336 in #9356

  • skip late cron job with max allowed delay by @klsince in #9372

  • Do not allow implicit cast for BOOLEAN and TIMESTAMP by @Jackie-Jiang in #9385

  • Add missing properties in CSV plugin by @KKcorps in #9399

  • set MDC so that one can route minion task logs to separate files cleanly by @klsince in #9400

  • Add a new API to fix segment date time in metadata by @KKcorps in #9413

  • Update get bytes to return raw bytes of string and support getBytesMV by @61yao in #9441

  • Exposing consumer's record lag in /consumingSegmentsInfo by @navina in #9515

  • Do not create dictionary for high-cardinality columns by @KKcorps in #9527

  • get task runtime configs tracked in Helix by @klsince in #9540

  • Add more options to json index by @Jackie-Jiang in #9543

  • add SegmentTierAssigner and refine restful APIs to get segment tier info by @klsince in #9598

  • Add segment level debug API by @saurabhd336 in #9609

  • Add record availability lag for Kafka connector by @navina in #9621

  • notify servers that need to move segments to new tiers via SegmentReloadMessage by @klsince in #9624

  • Allow to configure multi-datadirs as instance configs and a Quickstart example about them by @klsince in #9705

  • Customize stopword for Lucene Index by @jasperjiaguo in #9708

  • Add memory optimized dimension table by @KKcorps in #9802

  • ADLS file system upgrade by @xiangfu0 in #9855

  • Added Delete Schema/Table pinot admin commands by @bagipriyank in #9857

  • Adding new ADLSPinotFS auth type: DEFAULT by @xiangfu0 in #9860

  • Add rate limit to Kinesis requests by @KKcorps in #9863

  • Adding configs for zk client timeout by @xiangfu0 in #9975

Other features/changes

  • Show most recent scheduling errors by @satishwaghela in #9161

  • Do not use aggregation result for distinct query in IntermediateResultsBlock by @Jackie-Jiang in #9262

  • Emit metrics for ratio of actual consumption rate to rate limit in real-time tables by @sajjad-moradi in #9201

  • add metrics entry offlineTableCount by @walterddr in #9270

  • refine query cancel resp msg by @klsince in #9242

  • add @ManualAuthorization annotation for non-standard endpoints by @apucher in #9252

  • Optimize ser/de to avoid using output stream by @Jackie-Jiang in #9278

  • Add Support for Covariance Function by @SabrinaZhaozyf in #9236

  • Throw an exception when MV columns are present in the order-by expression list in selection order-by only queries by @somandal in #9078

  • Improve server query cancellation and timeout checking during execution by @jasperjiaguo in #9286

  • Add capabilities to ingest from another stream without disabling the real-time table by @sajjad-moradi in #9289

  • Add minMaxInvalid flag to avoid unnecessary needPreprocess by @npawar in #9238

  • Add array cardinality function by @walterddr in #9300

  • TierBasedSegmentDirectoryLoader to keep segments in multi-datadir by @klsince in #9306

  • Add support for custom null values in CSV record reader by @KKcorps in #9318

  • Infer parquet reader type based on file metadata by @saurabhd336 in #9294

  • Add Support for Cast Function on MV Columns by @SabrinaZhaozyf in #9296

  • Allow ingestion of errored records with incorrect datatype by @KKcorps in #9320

  • [Feature] Not Operator Transformation by @61yao in #9330

  • Handle null string in CSV decoder by @KKcorps in #9340

  • [Feature] Not scalar function by @61yao in #9338

  • Add support for EXTRACT syntax and converts it to appropriate Pinot expression by @tanmesh in #9184

  • Add support for Auth in controller requests in java query client by @KKcorps in #9230

  • delete all related minion task metadata when deleting a table by @zhtaoxiang in #9339

  • BloomFilterRule should only recommend for supported column type by @yuanbenson in #9364

  • Support all the types in ParquetNativeRecordReader by @xiangfu0 in #9352

  • Improve segment name check in metadata push by @zhtaoxiang in #9359

  • Allow expression transformer cotinue on error by @xiangfu0 in #9376

  • skip late cron job with max allowed delay by @klsince in #9372

  • Enhance and filter predicate evaluation efficiency by @jasperjiaguo in #9336

  • Deprecate instanceId Config For Broker/Minion Specific Configs by @ankitsultana in #9308

  • Optimize combine operator to fully utilize threads by @Jackie-Jiang in #9387

  • Terminate the query after plan generation if timeout by @jasperjiaguo in #9386

  • [Feature] Support IsDistinctFrom and IsNotDistinctFrom by @61yao in #9312

  • [Feature] Support Coalesce for Column Names by @61yao in #9327

  • Disable logging for interrupted exceptions in kinesis by @KKcorps in #9405

  • Benchmark thread cpu time by @jasperjiaguo in #9408

  • Use ISODateTimeFormat as default for SIMPLE_DATE_FORMAT by @KKcorps in #9378

  • Extract the common logic for upsert metadata manager by @Jackie-Jiang in #9435

  • Make minion task metadata manager methods more generic by @saurabhd336 in #9436

  • Always pass clientId to kafka's consumer properties by @navina in #9444

  • Adaptive Server Selection by @vvivekiyer in #9311

  • Refine IndexHandler methods a bit to make them reentrant by @klsince in #9440

  • use MinionEventObserver to track finer grained task progress status on worker by @klsince in #9432

  • Allow spaces in input file paths by @KKcorps in #9426

  • Add support for gracefully handling the errors while transformations by @KKcorps in #9377

  • Cache Deleted Segment Names in Server to Avoid SegmentMissingError by @ankitsultana in #9423

  • Handle Invalid timestamps by @KKcorps in #9355

  • refine minion worker event observer to track finer grained progress for tasks by @klsince in #9449

  • spark-connector should use v2/brokers endpoint by @itschrispeck in #9451

  • Remove netty server query support from presto-pinot-driver to remove pinot-core and pinot-segment-local dependencies by @xiangfu0 in #9455

  • Adaptive Server Selection: Address pending review comments by @vvivekiyer in #9462

  • track progress from within segment processor framework by @klsince in #9457

  • Decouple ser/de from DataTable by @Jackie-Jiang in #9468

  • collect file info like mtime, length while listing files for free by @klsince in #9466

  • Extract record keys, headers and metadata from Stream sources by @navina in #9224

  • [pinot-spark-connector] Bump spark connector max inbound message size by @cbalci in #9475

  • refine the minion task progress api a bit by @klsince in #9482

  • add parsing for AT TIME ZONE by @agavra in #9477

  • Eliminate explosion of metrics due to gapfill queries by @elonazoulay in #9490

  • ForwardIndexHandler: Change compressionType during segmentReload by @vvivekiyer in #9454

  • Introduce Segment AssignmentStrategy Interface by @GSharayu in #9309

  • Add query interruption flag check to broker groupby reduction by @jasperjiaguo in #9499

  • adding optional client payload by @walterddr in #9465

  • [feature] distinct from scalar functions by @61yao in #9486

  • Check data table version on server only for null handling by @Jackie-Jiang in #9508

  • Add docId and column name to segment read exception by @KKcorps in #9512

  • Sort scanning based operators by cardinality in AndDocIdSet evaluation by @jasperjiaguo in #9420

  • Do not fail CI when codecov upload fails by @Jackie-Jiang in #9522

  • [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization by @deemoliu in #9062

  • broker filter by @dongxiaoman in #9391

  • [feature] coalesce scalar by @61yao in #9487

  • Allow setting custom time boundary for hybrid table queries by @saurabhd336 in #9356

  • [GHA] add cache timeout by @walterddr in #9524

  • Optimize PinotHelixResourceManager.hasTable() by @Jackie-Jiang in #9526

  • Include exception when upsert metadata manager cannot be created by @Jackie-Jiang in #9532

  • allow to config task expire time by @klsince in #9530

  • expose task finish time via debug API by @klsince in #9534

  • Remove the wrong warning log in KafkaPartitionLevelConsumer by @Jackie-Jiang in #9536

  • starting http server for minion worker conditionally by @klsince in #9542

  • Make StreamMessage generic and a bug fix by @vvivekiyer in #9544

  • Improve primary key serialization performance by @KKcorps in #9538

  • [Upsert] Skip removing upsert metadata when shutting down the server by @Jackie-Jiang in #9551

  • add array element at function by @walterddr in #9554

  • Handle the case when enableNullHandling is true and an aggregation function is used w/ a column that has an empty null bitmap by @nizarhejazi in #9566

  • Support segment storage format without forward index by @somandal in #9333

  • Adding SegmentNameGenerator type inference if not explicitly set in config by @timsants in #9550

  • add version information to JMX metrics & component logs by @agavra in #9578

  • remove unused RecordTransform/RecordFilter classes by @agavra in #9607

  • Support rewriting forward index upon changing compression type for existing raw MV column by @vvivekiyer in #9510

  • Support Avro's Fixed data type by @sajjad-moradi in #9642

  • [feature] [kubernetes] add loadBalancerSourceRanges to service-external.yaml for controller and broker by @jameskelleher in #9494

  • Limit up to 10 unavailable segments to be printed in the query exception by @Jackie-Jiang in #9617

  • remove more unused filter code by @agavra in #9620

  • Do not cache record reader in segment by @Jackie-Jiang in #9604

  • make first part of user agent header configurable by @rino-kadijk in #9471

  • optimize order by sorted ASC, unsorted and order by DESC cases by @gortiz in #8979

  • Enhance cluster config update API to handle non-string values properly by @Jackie-Jiang in #9635

  • Reverts recommender REST API back to PUT (reverts PR #9326) by @yuanbenson in #9638

  • Remove invalid pruner names from server config by @Jackie-Jiang in #9646

  • Using usageHelp instead of deprecated help in picocli commands by @navina in #9608

  • Handle unique query id on server by @Jackie-Jiang in #9648

  • stateless group marker missing several by @walterddr in #9673

  • Support reloading consuming segment using force commit by @Jackie-Jiang in #9640

  • Improve star-tree to use star-node when the predicate matches all the non-star nodes by @Jackie-Jiang in #9667

  • add FetchPlanner interface to decide what column index to prefetch by @klsince in #9668

  • Improve star-tree traversal using ArrayDeque by @Jackie-Jiang in #9688

  • Handle errors in combine operator by @Jackie-Jiang in #9689

  • return different error code if old version is not on master by @SabrinaZhaozyf in #9686

  • Support creating dictionary at runtime for an existing column by @vvivekiyer in #9678

  • check mutable segment explicitly instead of checking existence of indexDir by @klsince in #9718

  • Remove leftover file before downloading segmentTar by @npawar in #9719

  • add index key and size map to segment metadata by @walterddr in #9712

  • Use ideal state as source of truth for segment existence by @Jackie-Jiang in #9735

  • Close Filesystem on exit with Minion Tasks by @KKcorps in #9681

  • render the tables list even as the table sizes are loading by @jadami10 in #9741

  • Add Support for IP Address Function by @SabrinaZhaozyf in #9501

  • bubble up error messages from broker by @agavra in #9754

  • Add support to disable the forward index for existing columns by @somandal in #9740

  • show table metadata info in aggregate index size form by @walterddr in #9733

  • Preprocess immutable segments from REALTIME table conditionally when loading them by @klsince in #9772

  • revert default timeout nano change in QueryConfig by @agavra in #9790

  • AdaptiveServerSelection: Update stats for servers that have not responded by @vvivekiyer in #9801

  • Add null value index for default column by @KKcorps in #9777

  • [MergeRollupTask] include partition info into segment name by @zhtaoxiang in #9815

  • Adding a consumer lag as metric via a periodic task in controller by @navina in #9800

  • Deserialize Hyperloglog objects more optimally by @priyen in #9749

  • Download offline segments from peers by @wirybeaver in #9710

  • Thread Level Usage Accounting and Query Killing on Server by @jasperjiaguo in #9727

  • Add max merger and min mergers for partial upsert by @deemoliu in #9665

  • #9518 added pinot helm 0.2.6 with secure version pinot 0.11.0 by @bagipriyank in #9519

  • Combine the read access for replication config by @snleee in #9849

  • add v1 ingress in helm chart by @jhisse in #9862

  • Optimize AdaptiveServerSelection for replicaGroup based routing by @vvivekiyer in #9803

  • Do not sort the instances in InstancePartitions by @Jackie-Jiang in #9866

  • Merge new columns in existing record with default merge strategy by @navina in #9851

  • Support disabling dictionary at runtime for an existing column by @vvivekiyer in #9868

  • support BOOL_AND and BOOL_OR aggregate functions by @agavra in #9848

  • Use Pulsar AdminClient to delete unused subscriptions by @navina in #9859

  • add table sort function for table size by @jadami10 in #9844

  • In Kafka consumer, seek offset only when needed by @Jackie-Jiang in #9896

  • fallback if no broker found for the specified table name by @klsince in #9914

  • Allow liveness check during server shutting down by @Jackie-Jiang in #9915

  • Allow segment upload via Metadata in MergeRollup Minion task by @KKcorps in #9825

  • Add back the Helix workaround for missing IS change by @Jackie-Jiang in #9921

  • Allow uploading real-time segments via CLI by @KKcorps in #9861

  • Add capability to update and delete table config via CLI by @KKcorps in #9852

  • default to TAR if push mode is not set by @klsince in #9935

  • load startree index via segment reader interface by @klsince in #9828

  • Allow collections for MV transform functions by @saurabhd336 in #9908

  • Construct new IndexLoadingConfig when loading completed real-time segments by @vvivekiyer in #9938

  • Make GET /tableConfigs backwards compatible in case schema does not match raw table name by @timsants in #9922

  • feat: add compressed file support for ORCRecordReader by @etolbakov in #9884

  • Add Variance and Standard Deviation Aggregation Functions by @snleee in #9910

  • enable MergeRollupTask on real-time tables by @zhtaoxiang in #9890

  • Update cardinality when converting raw column to dict based by @vvivekiyer in #9875

  • Add back auth token for UploadSegmentCommand by @timsants in #9960

  • Improving gz support for avro record readers by @snleee in #9951

  • Default column handling of noForwardIndex and regeneration of forward index on reload path by @somandal in #9810

  • [Feature] Support coalesce literal by @61yao in #9958

  • Ability to initialize S3PinotFs with serverSideEncryption properties when passing client directly by @npawar in #9988

  • handle pending minion tasks properly when getting the task progress status by @klsince in #9911

  • allow gauge stored in metric registry to be updated by @zhtaoxiang in #9961

  • support case-insensitive query options in SET syntax by @agavra in #9912

  • pin versions-maven-plugin to 2.13.0 by @jadami10 in #9993

  • Pulsar Connection handler should not spin up a consumer / reader by @navina in #9893

  • Handle in-memory segment metadata for index checking by @Jackie-Jiang in #10017

  • Support the cross-account access using IAM role for S3 PinotFS by @snleee in #10009

  • report minion task metadata last update time as metric by @zhtaoxiang in #9954

  • support SKEWNESS and KURTOSIS aggregates by @agavra in #10021

  • emit minion task generation time and error metrics by @zhtaoxiang in #10026

  • Use the same default time value for all replicas by @Jackie-Jiang in #10029

  • Reduce the number of segments to wait for convergence when rebalancing by @saurabhd336 in #10028

UI Update & Improvement

  • Allow hiding query console tab based on cluster config (#9261)

  • Allow hiding pinot broker swagger UI by config (#9343)

  • Add UI to show fine-grained minion task progress (#9488)

  • Add UI to track segment reload progress (#9521)

  • Show minion task runtime config details in UI (#9652)

  • Redefine the segment status (#9699)

  • Show an option to reload the segments during edit schema (#9762)

  • Load schema UI async (#9781)

  • Fix blank screen when redirect to unknown app route (#9888)

Library version upgrade

  • Upgrade h3 lib from 3.7.2 to 4.0.0 to lower glibc requirement (#9335)

  • Upgrade ZK version to 3.6.3 (#9612)

  • Upgrade snakeyaml from 1.30 to 1.33 (#9464)

  • Upgrade RoaringBitmap from 0.9.28 to 0.9.35 (#9730)

  • Upgrade spotless-maven-plugin from 2.9.0 to 2.28.0 (#9877)

  • Upgrade decode-uri-component from 0.2.0 to 0.2.2 (#9941)

BugFixes

  • Fix bug with logging request headers by @abhs50 in #9247

  • Fix a UT that only shows up on host with more cores by @klsince in #9257

  • Fix message count by @Jackie-Jiang in #9271

  • Fix issue with auth AccessType in Schema REST endpoints by @sajjad-moradi in #9293

  • Fix PerfBenchmarkRunner to skip the tmp dir by @Jackie-Jiang in #9298

  • Fix thrift deserializer thread safety issue by @saurabhd336 in #9299

  • Fix transformation to string for BOOLEAN and TIMESTAMP by @Jackie-Jiang in #9287

  • [hotfix] Add VARBINARY column to switch case branch by @walterddr in #9313

  • Fix annotation for "/recommender" endpoint by @sajjad-moradi in #9326

  • Fix jdk8 build issue due to missing pom dependency by @somandal in #9351

  • Fix pom to use pinot-common-jdk8 for pinot-connector jkd8 java client by @somandal in #9353

  • Fix log to reflect job type by @KKcorps in #9381

  • [Bugfix] schema update bug fix by @MeihanLi in #9382

  • fix histogram null pointer exception by @jasperjiaguo in #9428

  • Fix thread safety issues with SDF (WIP) by @saurabhd336 in #9425

  • Bug fix: failure status in ingestion jobs doesn't reflect in exit code by @KKcorps in #9410

  • Fix skip segment logic in MinMaxValueBasedSelectionOrderByCombineOperator by @Jackie-Jiang in #9434

  • Fix the bug of hybrid table request using the same request id by @Jackie-Jiang in #9443

  • Fix the range check for range index on raw column by @Jackie-Jiang in #9453

  • Fix Data-Correctness Bug in GTE Comparison in BinaryOperatorTransformFunction by @ankitsultana in #9461

  • extend PinotFS impls with listFilesWithMetadata and some bugfix by @klsince in #9478

  • fix null transform bound check by @walterddr in #9495

  • Fix JsonExtractScalar when no value is extracted by @Jackie-Jiang in #9500

  • Fix AddTable for real-time tables by @npawar in #9506

  • Fix some type convert scalar functions by @Jackie-Jiang in #9509

  • fix spammy logs for ConfluentSchemaRegistryRealtimeClusterIntegrationTest [MINOR] by @agavra in #9516

  • Fix timestamp index on column of preserved key by @Jackie-Jiang in #9533

  • Fix record extractor when ByteBuffer can be reused by @Jackie-Jiang in #9549

  • Fix explain plan ALL_SEGMENTS_PRUNED_ON_SERVER node by @somandal in #9572

  • Fix time validation when data type needs to be converted by @Jackie-Jiang in #9569

  • UI: fix incorrect task finish time by @jayeshchoudhary in #9557

  • Fix the bug where uploaded segments cannot be deleted on real-time table by @Jackie-Jiang in #9579

  • [bugfix] correct the dir for building segments in FileIngestionHelper by @zhtaoxiang in #9591

  • Fix NonAggregationGroupByToDistinctQueryRewriter by @Jackie-Jiang in #9605

  • fix distinct result return by @walterddr in #9582

  • Fix GcsPinotFS by @lfernandez93 in #9556

  • fix DataSchema thread-safe issue by @walterddr in #9619

  • Bug fix: Add missing table config fetch for /tableConfigs list all by @timsants in #9603

  • Fix re-uploading segment when the previous upload failed by @Jackie-Jiang in #9631

  • Fix string split which should be on whole separator by @Jackie-Jiang in #9650

  • Fix server request sent delay to be non-negative by @Jackie-Jiang in #9656

  • bugfix: Add missing BIG_DECIMAL support for GenericRow serde by @timsants in #9661

  • Fix extra restlet resource test which should be stateless by @Jackie-Jiang in #9674

  • AdaptiveServerSelection: Fix timer by @vvivekiyer in #9697

  • fix PinotVersion to be compatible with prometheus by @agavra in #9701

  • Fix the setup for ControllerTest shared cluster by @Jackie-Jiang in #9704

  • [hotfix]groovy class cache leak by @walterddr in #9716

  • Fix TIMESTAMP index handling in SegmentMapper by @Jackie-Jiang in #9722

  • Fix the server admin endpoint cache to reflect the config changes by @Jackie-Jiang in #9734

  • [bugfix] fix case-when issue by @walterddr in #9702

  • [bugfix] Let StartControllerCommand also handle "pinot.zk.server", "pinot.cluster.name" in default conf/pinot-controller.conf by @thangnd197 in #9739

  • [hotfix] semi-join opt by @walterddr in #9779

  • Fixing the rebalance issue for real-time table with tier by @snleee in #9780

  • UI: show segment debug details when segment is in bad state by @jayeshchoudhary in #9700

  • Fix the replication in segment assignment strategy by @GSharayu in #9816

  • fix potential fd leakage for SegmentProcessorFramework by @klsince in #9797

  • Fix NPE when reading ZK address from controller config by @Jackie-Jiang in #9751

  • have query table list show search bar; fix InstancesTables filter by @jadami10 in #9742

  • [pinot-spark-connector] Fix empty data table handling in GRPC reader by @cbalci in #9837

  • [bugfix] fix mergeRollupTask metrics by @zhtaoxiang in #9864

  • Bug fix: Get correct primary key count by @KKcorps in #9876

  • Fix issues for real-time table reload by @Jackie-Jiang in #9885

  • UI: fix segment status color remains same in different table page by @jayeshchoudhary in #9891

  • Fix bloom filter creation on BYTES by @Jackie-Jiang in #9898

  • [hotfix] broker selection not using table name by @walterddr in #9902

  • Fix race condition when 2 segment upload occurred for the same segment by @jackjlli in #9905

  • fix timezone_hour/timezone_minute functions by @agavra in #9949

  • [Bugfix] Move brokerId extraction to BaseBrokerStarter by @jackjlli in #9965

  • Fix ser/de for StringLongPair by @Jackie-Jiang in #9985

  • bugfix dir check for HadoopPinotFS.copyFromLocalDir by @klsince in #9979

  • Bugfix: Use correct exception import in TableRebalancer. by @mayankshriv in #10025

  • Fix NPE in AbstractMetrics From Race Condition by @ankitsultana in #10022

1.1.0

Release Notes for 1.1.0

Summary

This release comes with several features, including SQL, UI, and performance enhancements. Also included are bug fixes across multiple features such as the V2 multi-stage query engine, ingestion, storage format, and SQL support.

Multi-stage query engine

Features

Support RelDistribution-based trait planning (#11976, #12079)

  • Adds support for RelDistribution optimization for more accurate leaf-stage direct exchange/shuffle. Also extends partition optimization beyond leaf stage to entire query plan.

  • Applies optimization based on distribution trait in the mailbox/worker assignment stage

    • Fixes previous direct exchange which was decided based on the table partition hint. Now direct exchange is decided via distribution trait: it will applied if-and-only-if the trait propagated matches the exchange requirement.

    • As a side effect, is_colocated_by_join_keys query option is reintroduced to ensure dynamic broadcast which can also benefit from direct exchange optimization

    • Allows propagation of partition distribution trait info across the tree to be used during Physical Planning phase. It can be used in the following scenarios (will follow up in separate PRs)

  • Note on backward incompatbility

    • is_colocated_by_join_keys hint is now required for making colocated joins

      • it should only affect semi-join b/c it is the only one utilizing broadcast exchange but were pulled to act as direct exchange.

      • inner/left/right/full join should automatically apply colocation thus the backward incompatibility should not affect these.

Leaf stage planning with multi-semi join support (#11937)

  • Solves the limitation of pinotQuery that supports limited amount of PlanNodes.

  • Splits the ServerRequest planning into 2 stages

    • First plan as much as possible into PinotQuery

    • for any remainder nodes that cannot be planned into PinotQuery, will be run together with the LeafStageTransferrableBlockOperator as the input locally.

Support for ArrayAgg aggregation function (#11822)

  • Usage: ArrayAgg(column, 'dataType' [, 'isDistinct'])

  • Float type column is treated as Double in the multistage engine, so FLOAT type is not supported.

  • Supports data BOOLEAN, INT, LONG, FLOAT(only in V1), DOUBLE, STRING, TIMESTAMP. E.g. ArrayAgg(intCol, 'INT') returns ARRAY<INT>

Enhancements

  • Canonicalize SqlKind.OTHERS and SqlKind.OTHER_FUNCTIONS and support

    concat as || operator (#12025)

  • Capability for constant filter in QueryContext, with support for server to handle it (#11956)

  • Tests for filter pushdown (#11994)

  • Enhancements to query plan tests (#11966)

  • Refactor PlanFragmenter to make the logic clear (#11912)

  • Observability enhancements to emit metrics for grpc request and multi-stage leaf stage (#11838)

    • pinot.server.query.log.maxRatePerSecond: query log max rate (QPS, default 10K)

    • pinot.server.query.log.droppedReportMaxRatePerSecond: dropped query log report max rate (QPS, default 1)

  • Security enhancement to add RBAC authorization checks for multi-stage query engine (#11830)

  • Enhancement to leaf-stage execution stats NPE handling (#11805)

  • Enhancement to add a framework to back-propagate metadata across opChains (#11746)

  • Use of BinaryArray to wire proto for multi-stage engine bytes literal handling (#11738)

  • Enable dynamic broadcast for SEMI joins. Adds a fallback option to enable hash table join using joinOptions(join_strategy = 'hash_table')(#11696)

  • Improvements to dispatch exception handling (#11688)

  • Allow malformed dateTime string to return default value configurable in the function signature (#11258)

    • fromDateTime(colContainsMalformedStr, '<dateTimeFormat>', '<timezone>', <default_value>)
  • Improvement in multi-stage aggregation to directly store column index as identifier (#11617)

  • Perf optimization to avoid unnecessary rows conversion in aggregation (#11607)

  • Enhance SegmentPartitionMetadataManager to handle new segment (#11585)

  • Optimize mailbox info in query plan to reduce memory footprint (#12382)

    • This PR changes the proto object structure, which will cause backward incompatibility when broker and server are running different version.

  • Optimizations to query plan serialization (#12370)

  • Optimization for parallel execution of Ser/de stage plan (#12363)

  • Optimizations in query dispatch (#12358)

  • Perf optimization for group-by and join for single key scenario (#11630)

Bugfixes, refactoring, cleanups, tests

  • Bugfix for evaluation of chained literal functions (#12248)

  • Fixes to sort copy rule (#12251 and #12237)

  • Fixes duplicate results for literal queries (#12240)

  • Bugfix to use UTF-8 encoding for default Charset (#12213)

  • Bugfix to escape table name when routing queries (#12212)

  • Refactoring of planner code and removing unnecessary rules (#12070, #12052)

  • Fix to remove unnecessar project after agg during relBuilder (#12058)

  • Fixes issues multi-semi-join (#12038)

  • Fixes leaf limit refactor issue (#12001)

  • Add back filter merge after rule (#11989)

  • Fix operator EOS pull (#11970)

  • Fix type cast issue with dateTimeConvert scalar function (#11839, #11971)

  • Fix to set explicit warning flags set on each stage stats (#11936)

  • Fix mailbox visitor mismatch receive/send (#11908)

  • Fix eliminate multiple exchanges in nested semi-join queries (#11882)

  • Bugfix for multiple consecutive Exchange returning empty response (#11885)

  • Fixing unit-test-2 build (#11889)

  • Fix issue with realtime partition mismatch metric (#11871)

  • Fix the NPE for rebalance retry (#11883)

  • Bugfix to make Agg literal attach happen after BASIC_RULES (#11863)

  • Fix NPE by init execution stats map (#11801)

  • Test cases for special column escape (#11737)

  • Fix StPoint scalar function usage in multi-stage engine intermediate stage (#11731)

  • Clean up for transform function type (#11726)

  • Add capability to ignore test (#11703)

  • Fix custom property naming (#11675)

  • Log warning when multi-stage engine planning throws exception (#11595)

  • Fix usage of metadata overrides (#11587)

  • Test change to enable metadata manager by default for colocated join quickstart (#11579)

  • Tests for IN/NOT-IN operation (#12349)

  • Fix stage id in stage plan (#12366)

  • Bugfix for IN and NOT IN filters within case statements (#12305)

Notable features

Server-level throttling for realtime consumption (#12292)

  • Use server config pinot.server.consumption.rate.limit to enable this feature

  • Server rate limiter is disabled by default (default value 0)

Reduce segment generation disk footprint for Minion Tasks (#12220)

  • Supported in MergeRollupTask and RealtimeToOfflineSegmentsTask minion tasks

  • Use taskConfig segmentMapperFileSizeThresholdInBytes to specify the threshold size

"task": {
  "taskTypeConfigsMap": {
    "<task_name>": {
      "segmentMapperFileSizeThresholdInBytes": "1000000000"
    }
  }
}

Support for swapping of TLS keystore/truststore (#12277, #12325)

  • Security feature that makes the keystore/truststore swappable.

  • Auto-reloads keystore/truststore (without need for a restart) if they are local files

Sticky query routing (#12276)

  • Adds support for deterministic and sticky routing for a query / table / broker. This setting would lead to same server / set of servers (for MultiStageReplicaGroupSelector) being used for all queries of a given table.

  • Query option (takes precedence over fixed routing setting at table / broker config level) SET "useFixedReplica"=true;

  • Table config (takes precedence over fixed routing setting at broker config level)

    "routing": {
       ...          
       "useFixedReplica": true
    }
  • Broker conf - pinot.broker.use.fixed.replica=true

Table Config to disallow duplicate primary key for dimension tables (#12290)

  • Use tableConfig dimensionTableConfig.errorOnDuplicatePrimaryKey=true to enable this behavior

  • Disabled by default

Partition-level ForceCommit for realtime tables (#12088)

  • Support to force-commit specific partitions of a realtime table.

  • Partitions can be specified to the forceCommit API as a comma separated list of partition names or consuming segment names

Support initializing broker tags from config (#12175)

  • Support to give the broker initial tags on startup.

  • Automatically updates brokerResource when broker joins the cluster for the first time

  • Broker tags are provided as comma-separated values in pinot.broker.instance.tags

Support for StreamNative OAuth2 authentication for Pulsar (#12068)

  • StreamNative (the cloud SAAS offering of Pulsar) uses OAuth2 to authenticate clients to their Pulsar clusters.

  • For more information, see how to Configure OAuth2 authentication in Pulsar clients

  • Can be configured by adding the following properties to streamConfigs:

"stream.pulsar.issuerUrl": "https://auth.streamnative.cloud"
"stream.pulsar.credsFilePath": "file:///path/to/private_creds_file"
"stream.pulsar.audience": "urn:sn:pulsar:test:test-cluster"

Introduce low disk mode to table rebalance (#12072)

  • Introduces a new table rebalance boolean config lowDiskMode.Default value is false.

  • Applicable for rebalance with downtime=false.

  • When enabled, segments will first be offloaded from servers, then added to servers after offload is done. It may increase the total time of the rebalance, but can be useful when servers are low on disk space, and we want to scale up the cluster and rebalance the table to more servers.

  • #12112 adds the UI capability to toggle this option

Support vector index and hierarchical navigable small worlds (HNSW) (#11977)

  • Supports Vector Index on float array/multi-value columnz

  • Add predicate and function to retrieve topK closest vector. Example query

SELECT ProductId, UserId, l2_distance(embedding, ARRAY[-0.0013143676,-0.011042999,...]) AS l2_dist, n_tokens, combined
FROM fineFoodReviews
WHERE VECTOR_SIMILARITY(embedding, ARRAY[-0.0013143676,-0.011042999,...], 5)  
ORDER by l2_dist ASC 
LIMIT 10
  • The function l2_distance will return a double value where the first parameter is the embedding column and the second parameter is the search term embedding literal.

  • Since VectorSimilarity is a predicate, once config the topK, this predicate will return topk rows per segment. Then if you are using this index with other predicate, you may not get expected number of rows since the records matching other predicate might not in the topk rows.

Support for retention on deleted keys of upsert tables (#12037)

  • Adds an upsert config deletedKeysTTL which will remove deleted keys from in-memory hashmap and mark the validDocID as invalid after the deletedKeysTTL threshold period.

  • Disabled by default. Enabled only if a valid value for deletedKeysTTL is set.

Configurable Lucene analyzer (#12027)

  • Introduces the capability to specify a custom Lucene analyzer used by text index for indexing and search on an individual column basis.

  • Sample usage

fieldConfigList: [
   {
        "name": "columnName",
        "indexType": "TEXT",
        "indexTypes": [
          "TEXT"
        ],
        "properties": {
          "luceneAnalyzerClass": "org.apache.lucene.analysis.core.KeywordAnalyzer"
        },
      }
  ]
  • Default Behavior falls back to using the standardAnalyzer unless the luceneAnalyzerClass property is specified.

Support for murmur3 as a partition function (#12049)

  • Murmur3 support with optional fields seed and variant for the hash in functionConfig field of columnPartitionMap.Default value for seed is 0.

  • Added support for 2 variants of Murmur3: x86_32 and x64_32 configurable using the variant field in functionConfig. If no variant is provided we choose to keep the x86_32 variant as it was part of the original implementation.

  • Examples of functionConfig;

    "tableIndexConfig": {
          ..
          "segmentPartitionConfig": {
            "columnPartitionMap": {
              "memberId": {
                "functionName": "Murmur3",
                "numPartitions": 3 
              },
              ..
            }
          }

    Here there is no functionConfig configured, so the seed value will be 0 and variant will be x86_32.

    "tableIndexConfig": {
          ..
          "segmentPartitionConfig": {
            "columnPartitionMap": {
              "memberId": {
                "functionName": "Murmur3",
                "numPartitions": 3,
                "functionConfig": {
                   "seed": "9001"
                 },
              },
              ..
            }
          }

    Here the seed is configured as 9001 but as no variant is provided, x86_32 will be picked up.

     "tableIndexConfig": {
          ..
          "segmentPartitionConfig": {
            "columnPartitionMap": {
              "memberId": {
                "functionName": "Murmur3",
                "numPartitions": 3,
                "functionConfig" :{
                   "seed": "9001"
                   "variant": "x64_32"
                 },
              },
              ..
            }
          }

    Here the variant is mentioned so Murmur3 will use the x64_32 variant with 9001 as seed.

  • Note on users using Debezium and Murmur3 as partitioning function :

    • The partitioning key should be set up on either of byte[], String or long[] columns.

    • On pinot variant should be set as x64_32 and seed should be set as 9001.

New optimized MV forward index to only store unique MV values

  • Adds new MV dictionary encoded forward index format that only stores the unique MV entries.

  • This new index format can significantly reduce the index size when the MV entries repeat a lot

  • The new index format can be enabled during index creation, derived column creation, and segment reload

  • To enable the new index format, set the compression codec in the FieldConfig:

    {
      "name": "myCol",
      "encodingType": "DICTIONARY",
      "compressionCodec": "MV_ENTRY_DICT"
    }

    Or use the new index JSON:

    {
      "name": "myCol",
      "encodingType": "DICTIONARY",
      "indexes": {
        "forward": {
          "dictIdCompressionType": "MV_ENTRY_DICT"
        }
      }
    }

Support for explicit null handling modes (#11960)

  • Adds support for 2 possible ways to handle null:

    • Table mode - which already exists

    • Column mode, which means that each column specifies its own nullability in the FieldSpec

  • Column mode can be enabled by the below config.

  • The default value for enableColumnBasedNullHandling is false. When set to true, Pinot will ignore TableConfig.IndexingConfig.nullHandlingEnabled and columns will be nullable if and only if FieldSpec.notNull is false, which is also the default value.

{
  "schemaName": "blablabla",
  "dimensionFieldSpecs": [
    {
      "dataType": "INT",
      "name": "nullableField",
      "notNull": false
    },
    {
      "dataType": "INT",
      "name": "notNullableField",
      "notNull": true
    },
    {
      "dataType": "INT",
      "name": "defaultNullableField"
    },
    ...
  ],
  "enableColumnBasedNullHandling": true/false
}

Support tracking out of order events in Upsert (#11877)

  • Adds a new upsert config outOfOrderRecordColumn

  • When set to a non-null value, we check whether an event is OOO or not and then accordingly update the corresponding column value to true / false.

  • This will help in tracking which event is out-of-order while using skipUpsert

Compression configuration support for aggregationConfigs to StartreeIndexConfigs (#11744)

  • Can be used to save space. For eg: when a functionColumnPairs has a output type of bytes, such as when you use distinctcountrawhll.

  • Sample config

"starTreeIndexConfigs": [
        {
          "dimensionsSplitOrder": [
            "a",
            "b",
            "c"
          ],
          "skipStarNodeCreationForDimensions": [],
          "functionColumnPairs": [],
          "aggregationConfigs": [
            {
              "columnName": "column1",
              "aggregationFunction": "SUM",
              "compressionCodec": "SNAPPY"
            },
            {
              "columnName": "column2",
              "aggregationFunction": "distinctcounthll",
              "compressionCodec": "LZ4"
            }
          ],
          "maxLeafRecords": 10000
        }
      ]

Preconfiguration based mirror instance assignment (#11578)

  • Supports instance assignment based pre-configured instance assignment map.

  • The assignment will always respect the mirrored servers in the pre-configured map

  • More details here

  • Sample table config

"instanceAssignmentConfigMap": {
  "CONSUMING": {
    "partitionSelector": "MIRROR_SERVER_SET_PARTITION_SELECTOR",
    "replicaGroupPartitionConfig": { ... },
     "tagPoolConfig": {
       ...
       "tag": "mt1_REALTIME"
     }
     ...
 }
 "COMPLETED": {
   "partitionSelector": "MIRROR_SERVER_SET_PARTITION_SELECTOR",
   "replicaGroupPartitionConfig": { ... },
    "tagPoolConfig": {
       ...
       "tag": "mt1_OFFLINE"
     }
     ...
 },
 "instancePartitionsMap": {
      "CONSUMING": “mt1_CONSUMING"
      "COMPLETED": "mt1_OFFLINE"
 },

Support for listing dimension tables (#11859)

  • Adds dimension as a valid option to table "type" in the /tables controller API

Support in upsert for dropping out of order events (#11811)

  • This patch adds a new config for upsert: dropOutOfOrderRecord

  • If set to true, pinot doesn't persist out-of-order events in the segment.

  • This feature is useful to

    • Save disk-usage

    • Avoid any confusion when using skipUpsert for partial-upsert tables as nulls start showing up for columns where a previous non-null was encountered and we don't know if it's an out-of-order event or not.

Support to retry failed table rebalance tasks (#11740)

  • New configs for the RebalanceChecker periodic task:

    • controller.rebalance.checker.frequencyPeriod: 5min by default ; -1 to disable

    • controller.rebalanceChecker.initialDelayInSeconds: 2min+ by default

  • New configs added for RebalanceConfig:

    • heartbeatIntervalInMs: 300_000 i.e. 5min

    • heartbeatTimeoutInMs: 3600_000 i.e. 1hr

    • maxAttempts: 3 by default, i.e. the original run plus two retries

    • retryInitialDelayInMs: 300_000 i.e. 5min, for exponential backoff w/ jitters

  • New metrics to monitor rebalance and its retries:

    • TABLE_REBALANCE_FAILURE("TableRebalanceFailure", false), emit from TableRebalancer.rebalanceTable()

    • TABLE_REBALANCE_EXECUTION_TIME_MS("tableRebalanceExecutionTimeMs", false), emit from TableRebalancer.rebalanceTable()

    • TABLE_REBALANCE_FAILURE_DETECTED("TableRebalanceFailureDetected", false), emit from RebalanceChecker

    • TABLE_REBALANCE_RETRY("TableRebalanceRetry", false), emit from RebalanceChecker

  • New restful API

    • DELETE /tables/{tableName}/rebalance API to stop rebalance. In comparison, POST /tables/{tableName}/rebalance was used to start one.

Support for UltraLogLog (#11835)

  • UltraLogLog aggregations for Count Distinct (distinctCountULL and distinctCountRawULL)

  • UltraLogLog creation via Transform Function

  • UltraLogLog merging in MergeRollup

  • Support for UltraLogLog in Star-Tree indexes

Support for Apache Datasketches CPC sketch (#11774)

  • Ingestion via transformation function

  • Extracting estimates via query aggregation functions

  • Segment rollup aggregation

  • StarTree aggregation

Support to reduce DirectMemory OOM chances on broker (#11710)

  • Broadly there are two configs that will enable this feature:

    • maxServerResponseSizeBytes: Maximum serialized response size across all servers for a query. This value is equally divided across all servers processing the query.

    • maxQueryResponseSizeBytes: Maximum length of the serialized response per server for a query

  • Configs are available as queryOption, tableConfig and Broker config. The priority of enforcement is as follows:

    The overriding order of priority is:
    1. QueryOption  -> maxServerResponseSizeBytes
    2. QueryOption  -> maxQueryResponseSizeBytes
    3. TableConfig  -> maxServerResponseSizeBytes
    4. TableConfig  -> maxQueryResponseSizeBytes
    5. BrokerConfig -> pinot.broker.max.server.response.size.bytes
    6. BrokerConfig -> pinot.broker.max.query.response.size.bytes

UI support to allow schema to be created with JSON config (#11809)

  • This is helpful when user has the entire JSON handy

  • UI still keeps Form Way to add Schema along with JSON view

Support in JSON index for ignoring values longer than a given length (#11604)

  • Use option maxValueLength in jsonIndexConfig to restrict length of values

  • A value of 0 (or when the key is omitted) means there is no restriction

Support for MultiValue VarByte V4 index writer (#11674)

  • Supports serializing and writing MV columns in VarByteChunkForwardIndexWriterV4

  • Supports V4 reader that can be used to read SV var length, MV fixed length and MV var length buffers encoded with V4 writer

Improved scalar function support for Multivalue columns(#11555, #11654)

arrayIndexOfInt(int[] value, int valToFind)
arrayIndexOfLong(int[] value, long valToFind)
arrayIndexOfFloat(int[] value, float valToFind)
arrayIndexOfDouble(int[] value, double valToFind)
arrayIndexOfString(int[] value, String valToFind)
intersectIndices(int[] values1, int[] values2)

Support for FrequentStringsSketch and FrequentLonsSketch aggregation functions (#11098)

  • Approximation aggregation functions for estimating the frequencies of items a dataset in a memory efficient way. More details in Apache Datasketches library.

FREQUENTLONGSSKETCH(col, maxMapSize=256) -> Base64 encoded sketch object
FREQUENTSTRINGSSKETCH(col, maxMapSize=256) -> Base64 encoded sketch object

Controller API for table index (#11576)

  • Table index api to get the aggregate index details of all segments for a table.

    • URL/tables/{tableName}/indexes

  • Response format

    {
        "totalSegments": 31,
        "columnToIndexesCount":
        {
            "col1":
            {
                "dictionary": 31,
                "bloom": 0,
                "null": 0,
                "forward": 31,
                ...
                "inverted": 0,
                "some-dynamically-injected-index-type": 31,
            },
            "col2":
            {
                ...
            }
            ...
    }

Support for configurable rebalance delay at lead controller (#11509)

  • The lead controller rebalance delay is now configurable with controller.resource.rebalance.delay_ms

  • Changing rebalance configurations will now update the lead controller resource

Support for configuration through environment variables (#12307)

  • Adds support for Pinot configuration through ENV variables with Dynamic mapping.

  • More details in issue: #10651

  • Sample configs through ENV

export PINOT_CONTROLLER_HOST=host
export PINOT_SERVER_PROPERTY_WHATEVER=whatever_property
export ANOTHER_VARIABLE=random

Add hyperLogLogPlus aggregation function for distinct count (#11346)

  • HLL++ has higher accuracy than HLL when cardinality of dimension is at 10k-100k.

  • More details here

DISTINCTCOUNTHLLPLUS(some_id, 12)

Support for clpMatch

  • Adds query rewriting logic to transform a "virtual" UDF, clpMatch, into a boolean expression on the columns of a CLP-encoded field.

  • To use the rewriter, modify broker config to add org.apache.pinot.sql.parsers.rewriter.ClpRewriter to pinot.broker.query.rewriter.class.names.

Support for DATETIMECONVERTWINDOWHOP function (#11773)

Support for JSON_EXTRACT_INDEX transform function to leverage json index for json value extraction (#11739)

Support for ArrayAgg aggregation function (#11822)

GenerateData command support for generating data in JSON format (#11778)

Enhancements

SQL

  • Support ARRAY function as a literal evaluation (#12278)

  • Support for ARRAY literal transform functions (#12118)

  • Theta Sketch Aggregation enhancements (#12042)

    • Adds configuration options for DistinctCountThetaSketchAggregationFunction

    • Respects ordering for existing Theta sketches to use "early-stop" optimisations for unions

  • Add query option override for Broker MinGroupTrimSize (#11984)

  • Support for 2 new scalar functions for bytes: toUUIDBytes and fromUUIDBytes (#11988)

  • Config option to make groupBy trim size configurable at Broker (#11958)

  • Pre-aggregation support for distinct count hll++ (#11747)

  • Add float type into literal thrift to preserve literal type conforming to SQL standards (#11697)

  • Enhancement to add query function override for Aggregate functions of multi valued columns (#11307)

  • Perf optimization in IN clause evaluation (#11557)

  • Add TextMatchFilterOptimizer to maximally push down text_match filters to Lucene (#12339

UI

  • Async rendering of UI elements to load UI elements async resulting in faster page loads (#12210)

  • Make the table name link clickable in task details (#12253)

  • Swagger UI enhancements to resumeConsumption API call (#12200)

  • Adds support for CTRL key as a modifier for Query shortcuts (#12087)

  • UI enhancement to show partial index in reload (#11913)

  • UI improvement to add Links to Instance in Table and Segment View (#11807)

  • Fixes reload to use the right indexes API instead of fetching all segment metadata (#11793)

  • Enhancement to add toggle to hide/show query exceptions (#11611)

Misc

  • Enhancement to reduce the heap usage of String Dictionaries that are loaded on-heap (#12223)

  • Wire soft upsert delete for Compaction task (12330)

  • Upsert compaction debuggability APIs for validDocId metadata (#12275)

  • Make server resource classes configurable (#12324)

  • Shared aggregations for Startree index - mapping from aggregation used in the query to aggregation used to store pre-aggregated values (#12164)

  • Increased fetch timeout for Kineses to prevent stuck kinesis consumers

  • Metric to track table rebalance (#12270)

  • Allow server-level configs for upsert metadata (#18851)

  • Support to dynamically initialize Kafka client SSL configs (#12249)

  • Optimize segment metadata file creation without having to download full segment (#12255)

  • Allow string / numeric data type for deleteRecordColumn config (#12222)

  • Atomic and Deterministic snapshots of validDocId for upsert tables (#12232, #12246)

  • Observability enhancement to add column name when JSON index building fails (#12151)

  • Creation of DateTimeGenerator for DATE_TIME field type columns (#12206)

  • Add singleton registry for all controller and minion metrics (#12119)

  • Support helm chart server separate liveness and readiness probe endpoints (#11800)

  • Observability enhancement to add metrics for Table Disabled and Consumption Paused (#12000)

  • Support for SegmentGenerationAndPushTask to push segment to realtime table (#12084)

  • Enhancement to make the deep store upload retry async with configurable parallelism (#12017)

  • Optimizations in segment commit to not read partition group metadata (#11943)

  • Replace timer with scheduled executor service in IngestionDelayTracker to reduce number of threads (#11849)

  • Adds an option skipControllerCertValidation to skip controller cert validation in AddTableCommand (#11967)

  • Adds instrumentation for DataTable Creation (#11942)

  • Improve performance of ZkBasicAuthAccessFactory by caching Bcrypt password (#11904)

  • Adds support to to fetch metadata for specific list of segments (#11949)

  • Allow user specify local temp directory for quickstart (#11961)

  • Optimization for server to directly return final result for queries hitting single server (#11938)

  • Explain plan optimization to early release AcquireReleaseColumnsSegmentOperator (#11945)

  • Observability metric to track query timeouts (#11892)

  • Add support for auth in QueryRunner (#11897)

  • Allow users to pass custom RecordTransformers to SegmentProcessorFramework (#11887)

  • Add isPartialResult flag to broker response (#11592)

  • Add new configs to Google Cloud Storage (GCS) connector: jsonKey (#11890)

    • jsonKey is the GCP credential key in string format (either in plain string or base64 encoded string). Refer Creating and managing service account keys to download the keys.

  • Performance enhancement to build segments in column orientation (#11776)

    • Disabled by default. Can be enabled by setting table config columnMajorSegmentBuilderEnabled

  • Observability enhancements to emit metrics for grpc request and multi-stage leaf stage (#11838)

    • pinot.server.query.log.maxRatePerSecond: query log max rate (QPS, default 10K)

    • pinot.server.query.log.droppedReportMaxRatePerSecond: dropped query log report max rate (QPS, default 1)

  • Observability improvement to expose GRPC metrics (#11842)

  • Improvements to response format for reload API to be pretty printed (#11608)

  • Enhancements to support Java 21 (#11672)

  • Add more information in RequestContext class (#11708)

  • Support to read exact buffer byte ranges corresponding to a given forward index doc id (#11729)

  • Enhance Broker reducer to handle expression format change (#11762)

  • Capture build scans on ge.apache.org to benefit from deep build insights (#11767)

  • Performance enhancement in multiple places by updating initial capacity of HashMap (#11709)

  • Support for building indexes post segment file creation, allowing indexes that may depend on a completed segment to be built as part of the segment creation process (#11711)

  • Support excluding time values in SimpleSegmentNameGenerator (#11650)

  • Perf enhancement to reduce cpu usage by avoiding throwing an exception during query execution (#11715)

  • Added framework for supporting nulls in ScalarTransformFunctionWrapper in the future (#11653)

  • Observability change to metrics to export netty direct memory used and max (#11575)

  • Observability change to add a metric to measure total thread cpu time for a table (#11713)

  • Observability change to use SlidingTimeWindowArrayReservoirin dropwizard metrics (#11695)

  • Minor improvements to upsert preload (#11694)

  • Observability changes to expose additional Realtime Ingestion Metrics (#11685)

  • Perf enhancement to remove the global lock in SegmentCompletionManager (#11679)

  • Enhancements to unify tmp file naming format and delete tmp files at a regular cadence by extending the ControllerPeriodicTask (#10815)

    • controller.realtime.segment.tmpFileAsyncDeletionEnabled (default false)

    • controller.realtime.segment.tmpFileRetentionInSeconds (default 3600)

  • Improvements to skip unparseable records in the csv record reader (#11540, #11594)

  • Enhancements to allow override/force options when add schema (#11572)

  • Enhancement to handle direct memory OOM on brokers (#11496)

  • Enhancement to metadata API to return upsert partition to primary key count map for both controller and server APIs (#12334)

  • Enhancements to peer server segment download by retrying both peer discovery and download. (#12317)

  • Helper functions in StarTreeBuilderUtils and StarTreeV2BuilderConfig (#12361)

  • Perf optimizations to release all segments of a table in releaseAndRemoveAllSegments method (#12297)

  • Enhancement to Maintain pool selection for the minimizeDataMovement instance partition assignment strategy (#11953)

  • Upsert enhancement to assign segments for with respect to ideal state (#11628)

  • Observability change to export Additional Upsert Metrics to Prom (#11660)

  • Observibility enhancement to add CPU metrics for minion purge task (#12337)

  • Add HttpHeaders in broker event listener requestContext (#12258)

Bug fixes, refactoring, cleanups, deprecations

  • Upsert bugfix in "rewind()" for CompactedPinotSegmentRecordReader (#12329)

  • Fix error message format for Preconditions.checks failures(#12327)

  • Bugfix to distribute Pinot as a multi-release JAR (#12131, #12300)

  • Fixes in upsert metadata manager (#12319)

  • Security fix to allow querying tables with table-type suffix (#12310)

  • Bugfix to ensure tagConfigOverride config is null for upsert tables (#12233 and #12311)

  • Increased fetch timeout for Kineses to prevent stuck kinesis consumers(#12214)

  • Fixes to catch-all Regex for JXM -> Prom Exporter (#12073 and #12295)

  • Fixes lucene index errors when using QuickStart (#12289)

  • Null handling bugfix for sketch group-by queries (#12259)

  • Null pointer exception fixes in Controller SQL resource (#12211)

  • Synchronization fixes to replace upsert segments (#12105 and #12241)

  • Bugfix for S3 connection pool error when AWS session tokens expire after an hour (#12221)

  • FileWriter fixes to append headerline only for required formats like csv (#12208)

  • Security bugfix for pulsar OAuth2 authentication (#12195)

  • Bugfix to appropriately compute "segment.flush.threshold.size" when force-committing realtime segments (#12188)

  • Fixes rebalance converge check that reports success before rebalance completes (#12182)

  • Fixes upsertPrimaryKeysCount metric reporting when table is deleted (#12169)

  • Update LICENSE-binary for commons-configuration2 upgrade (#12165)

  • Improve error logging when preloading segments not exist on server (#12153)

  • Fixes to file access resource leaks (#12129)

  • Ingestion bugfix to avoid unnecessary transformers in CompositeTransformer (#12138)

  • Improve logging to print OS name during service statup (#12135)

  • Improve logging in multiple files (#12134, #12137, #12127, #12121)

  • Test fixes for ExprMinMaxRewriterTest.testQueryRewrite (#12047)

  • Fixes default path of log4j in helmchart (#12069, #12083)

  • Fix default brokerUpdateFrequencyInMillis for connector (#12093)

  • Updates to README file (#12075)

  • Fix to remove unnecessary locking during segment preloading (#12077)

  • Fix bug with silently ignoring force commit call failures (#12044)

  • Upsert bugfix to allow optional segments that can be skipped by servers without failing the query (#11978)

  • Fix incorrect handling of consumer creation errors (#12045)

  • Fix the memory leak issue on CommonsConfigurationUtils (#12056)

  • Fix rebalance on upsert table (#12054)

  • Add new Transformer to transform -0.0 and NaN (#12032)

  • Improve inverted index validation in table config to enhance user experience (#12043)

  • Fixes test flakiness by replacing HashSet/HashMap with LinkedHashSet/LinkedHashMap (#11941)

  • Flaky test fix for ServerRoutingStatsManagerTest.testQuerySubmitAndCompletionStats (#12029)

  • Fix derived column from MV column (#12028)

  • Support for leveraging StarTree index in conjunction with filtered aggregations (#11886)

  • Improves tableConfig validation for enabling size based threshold for realtime tables (#12016)

  • Fix flaky PinotTenantRestletResourceTest (#12026)

  • Fix flaky Fix PinotTenantRestletResourceTest (#12019)

  • Fix the race condition of concurrent modification to segment data managers (#12004)

  • Fix the misuse of star-tree when all predicates are always false under OR (#12003)

  • Fix the test failures caused by instance drop failure (#12002)

  • Fix fromULL scalar function (#11995)

  • Fix to exclude module-info.class during shade operations (#11975)

  • Fix the wrong import for Preconditions (#11979)

  • Add check for illegal character '/' in taskName (#11955)

  • Bugfix to only register new segments when it's fully initalized by partitionUpsertMetadataManager (#11964)

  • Obervability fix to add logs to track sequence of events for table creation (#11946)

  • Fix the NPE in minimizeDataMovement instance assignment strategy (#11952)

  • Fix to add catch all logging for exception during DQL/DML process (#11944)

  • Fix bug where we don't handle cases that a upsert table has both upsert deletion and upsert ttl configs (#11791)

  • Removing direct dependencies on commons-logging and replacing with jcl-over-slf4j (#11920)

  • Fix NPE for IN clause on constant STRING dictionary (#11930)

  • Fix flaky OfflineClusterIntegrationTest on server response size tests (#11926)

  • Avoid npe when checking mirror server set assignment (#11915)

  • Deprecate _segmentAssignmentStrategy in favor of SegmentsValidationAndRetentionConfig #11869

  • Bugfix to capture auth phase timing even if access is denied (#11884)

  • Bugfix to mark rows as invalid in case primary time column is out of range (#11907)

  • Fix to radomize server port to avoid port already bind issue (#11861)

  • Add LazyRow abstraction for previously indexed record (#11826)

  • Config Validation for upsert table to not assign COMPLETED segments to another server (#11852)

  • Bugfix to resolve dependency conflict in pinot-protobuf module (#11867)

  • Fix case of useMultistageEngine property reference in JsonAsyncHttpPinotClientTransportFactory (#11820)

  • Bugfix to add woodstox-core to pinot-s3 dependencies and fix stack trace (#11799)

  • Fix to move pinot-segment-local test from unit test suite 1 to 2 (#11865)

  • Observability fix to log upsert config when initializing the metadata manager (#11864)

  • Fix to improve tests when errors are received in the consumer thread (#11858)

  • Fix for flaky ArrayAgg test (#11860)

  • Fix for flaky tests in TupleSelectionTransformFunctionsTest (#11848)

  • Fix for arrayAgg null support (#11853)

  • Fix the bug of reading decimal value stored in int32 or int64 (#11840)

  • Remove duplicate pinot-integration-tests from unit test suite 2 (#11844)

  • Fix for a null handling error in queries (#11829)

  • Fix the way of fetching the segment zk metadata for task generators (#11832)

  • Make testInvalidateCachedControllerLeader times based on getMinInvalidateIntervalMs (#11815)

  • Update doap to reflect latest release (#11827)

  • Clean up integration test pom file (#11817)

  • Bugfix to exclude OFFLINE segments when reading server to segments map (#11818)

  • Add tests for zstd compressed parquet files (#11808)

  • Fix job submission time for reload and foce commit job (#11803)

  • Remove actually unsupported config that selectively enable nullable columns (#10653)

  • Fix LLCRealtimeClusterIntegrationTest.testReset (#11806)

  • Use expected version in api for table config read modify write change (#11782)

  • Move jobId out of rebalanceConfig (#11790)

  • Fix PeerServerSegmentFinder not respecting HTTPS port (#11752)

  • Enhanced geospatial v2 integration tests (#11741)

  • Add integration test for rebalance in upsert tables (#11568)

  • Fix trivy CI issue (#11757)

  • Cleanup rebalance configs by adding a RebalanceConfig class (#11730)

  • Fix a protobuf comment to be more precise (#11735)

  • Move scala dependencies to root pom (#11671)

  • Fix ProtoBuf inputformat plug-in handling for null values (#11723)

  • Bugfix where segment download URI is invalid after same CRC refresh using tar push (#11720)

  • Fix in TableCacheTest (#11717)

  • Add more test for broker jersey bounded thread pool (#11705)

  • Fix bug in gapfill with SumAvgGapfillProcessor. (#11714)

  • Bugfix to allow GcsPinotFS to work with granular permissions (#11655)

  • Fix default log4j2 config file path in helm chart (#11707)

  • Refactor code and doc occurrences of argmin/max -> exprmin/max (#11700)

  • Make constructor and functions public to be used from scheduler plugins (#11699)

  • Bugfix to change json_format to return java null when java null is received (#11673)

  • Fix the potential access to upsert metadata manager after it is closed (#11692)

  • Bugfix to use isOptional instead of the deprecated hasOptional Keyword (#11682)

  • Fix logging issue in RealtimeTableDataManager (#11693)

  • Cleanup some reader/writer logic for raw forward index (#11669)

  • Do not execute spotless in Java 21 (#11670)

  • Update license-maven-plugin (#11665)

  • Bugfix to allow deletion of local files with special characters (#11664)

  • Clean up CaseTransformFunction::constructStatementListLegacy. (#11339)

  • Bugfix to force FileChannel to commit data to disk (#11625)

  • Remove the old deprecated commit end without metadata (#11662)

  • Fix for a jackson vulnerability (#11619)

  • Refactor BasicAuthUtils from pinot-core to pinot-common and remove pinot-core dependency from pinot-jdbc-client (#11620)

  • Bugfix to support several extensions for different indexes (#11600)

  • Fix the alias handling in single-stage engine (#11610)

  • Fix to use constant null place holder (#11615)

  • Refactor to move all BlockValSet into the same package (#11616)

  • Remove deprecated Request class from pinot-java-client (#11614)

  • Refactoring to remove old thirdeye files. (#11609)

  • Testing fix to use builder method in integration test (#11564)

  • Fix the broken Pinot JDBC client. (#11606)

  • Bugfix to change the Forbidden error to Unauthorized (#11501)

  • Fix for schema add UI issue that passing wrong data in the request header (#11602)

  • Remove/Deprecate HLC handling code (#11590)

  • Fix the bug of using push time to identify new created segment (#11599)

  • Bugfix in CSVRecordReader when using line iterator (#11581)

  • Remove split commit and some deprecated config for real-time protocol on controller (#11663)Improved validation for single argument aggregation functions (#11556)

  • Fix to not emit lag once tabledatamanager shutdown (#11534)

  • Bugfix to fail reload if derived columns can't be created (#11559)

  • Fix the double unescape of property value (#12405)

  • Fix for the backward compatible issue that existing metadata may contain unescaped characters (#12393)

  • Skip invalid json string rather than throwing error during json indexing (#12238)

  • Fixing the multiple files concurrent write issue when reloading SSLFactory (#12384)

  • Fix memory leaking issue by making thread local variable static (#12242)

  • Bugfixfor Upsert compaction task generator (#12380)

  • Log information about SSLFactory renewal (#12357)

  • Fixing array literal usage for vector (#12365)

  • Fixing quickstart table baseballStats minion ingestion (#12371)

  • Fix backward compatible issue in DistinctCountThetaSketchAggregationFunction (#12347)

  • Bugfix to skip instead of throwing error on 'getValidDocIdMetadata' (#12360)

  • Fix to clean up segment metadata when the associated segment gets deleted from remote store (#12350)

  • Fix getBigDecimal() scale throwing rounding error (#12326)

  • Workaround fix for the problem of Helix sending 2 transitions for CONSUMING -> DROPPED (#12351)

  • Bugfix for making nonLeaderForTables exhaustive (#12345)

  • Bugfixes for graceful interrupt handling of mutable lucene index (#11558,#12274)

  • Remove split commit and some deprecated config for real-time protocol on controller (#11663)

  • Update the table config in quick start (#11652)

  • Deprecate k8s skaffold scripts and move helm to project root directory (#11648)

  • Fix NPE in SingleColumnKeySelector (#11644)

  • Simplify kafka build and remove old kafka 0.9 files (#11638)

  • Adding comments for docker image tags, make a hyper link of helmChart from root directory (#11646)

  • Improve the error response on controller. (#11624)

  • Simplify authrozation for table config get (#11640)

  • Bugfix to remove segments with empty download url in UpsertCompactionTask (#12320)

  • Test changes to make taskManager resources protected for derived classes to override in their setUp() method. (#12335)

Backward incompatible Changes

  • Fix a race condition for upsert compaction (#12346). Notes on backward incompatibility below:

    • This PR is introducing backward incompatibility for UpsertCompactionTask. Previously, we allowed to configure the compaction task without the snapshot enabled. We found that using in-memory based validDocIds is a bit dangerous as it will not give us the consistency (e.g. fetching validDocIds bitmap while the server is restarting & updating validDocIds).

      We now enforce the enableSnapshot=true for UpsertCompactionTask if the advanced customer wants to run the compaction with the in-memory validDocId bitmap.

      {
        "upsertConfig": {
          "mode": "FULL",
          "enableSnapshot": true
        }
      }
      ...
      "task": {
        "taskTypeConfigsMap": {
          "UpsertCompactionTask": {
            "schedule": "0 */5 * ? * *",
            "bufferTimePeriod": "7d",
            "invalidRecordsThresholdPercent": "30",
            "invalidRecordsThresholdCount": "100000",
            "invalidDocIdsType": "SNAPSHOT/IN_MEMORY/IN_MEMORY_WITH_DELETE"
          }
        }
      }

      Also, we allow to configure invalidDocIdsType to UpsertCompactionTask for advanced user.

      1. snapshot: Default validDocIds type. This indicates that the validDocIds bitmap is loaded from the snapshot from the Pinot segment. UpsertConfig's enableSnapshot must be enabled for this type.

        1. onHeap: the validDocIds bitmap will be fetched from the server.

        2. onHeapWithDelete: the validDocIds bitmap will be fetched from the server. This will also take account into the deleted documents. UpsertConfig's deleteRecordColumn must be provided for this type.

  • Removal of the feature flag allow.table.name.with.database (#12402)

  • Error handling to throw exception when schema name doesn't match table name during table creation (#11591)

  • Fix type cast issue with dateTimeConvert scalar function (#11839, #11971)

  • Incompatible API fix to remove table state update operation in GET call (#11621)

  • Use string to represent BigDecimal datatype in JSON response (#11716)

  • Single quoted literal will not have its type auto-derived to maintain SQL compatibility (#11763)

  • Changes to always use split commit on server and disables the option to disable it (#11680, #11687)

  • Change to not allow NaN as default value for Float and Double in Schemas (#11661)

  • Code cleanup and refactor that removes TableDataManagerConfig (#12189)

  • Fix partition handling for consistency of values between query and segment (#12115)

  • Changes for migration to commons-configuration2 (#11985)

  • Cleanup to simplify the upsert metadata manager constructor (#12120)

  • Fixes typo in pom.xml (#11997)

  • JDBC Driver fixes to support Jetbrains Intellij/Datagrip database tooling (#11814)

  • Fix regression in ForwardIndexType for noDictionaryConfig and noDictionaryColumns (#11784)

  • Separate pr test scripts and codecov (#11804)

  • Bugfix to make reload status should only count online/consuming segments (#11787)

  • Fix flaky TableViewsTest (#11770)

  • Fix a flaky test (#11771)

  • Cleanup to fee more disk for trivy job (#11780)

  • Fix schema name in table config during controller startup (#11574)

  • Prevent NPE when attempt to fetch partition information fails (#11769)

  • Added UTs for null handling in CaseTransform function. (#11721)

  • Bugfix to disallow peer download when replication is < 2 (#11469)

  • Updates to Docker image and GitHub action scripts (#12378)

  • Enhancements to queries test framework (#12215)

Library upgrades and dependencies

  • Update maven-jar-plugin and maven-enforcer-plugin version (#11637)

  • Update testng as the test provider explicitly instead of relying on the classpath. (#11612)

  • Update compatibility verifier version (#11684)

  • Upgrade Avro dependency to 1.10.2 (#11698)

  • Upgrade testng version to 7.8.0 (#11462)

  • Update lombok version and config (#11742)

  • Upgrading Apache Helix to 1.3.1 version (#11754)

  • Upgrade spark from 3.2 to 3.5 (#11702)

  • Added commons-configuration2 dependency. (#11792)

  • Upgrade confluent libraries to 7.2.6 to fix some errors related to optional proto fields (#11753)

  • Upgrade lucene to 9.8.0 and upgrade text index version (#11857)

  • Upgrade the PinotConfiguartion to commons-configuartion2(#11916)

  • Pre PinotConfig commons-configuartions2 upgrade (#11868)

  • Bump commons-codec:commons-codec from 1.15 to 1.16.0 (#12204)

  • Bump flink.version from 1.12.0 to 1.14.6 (#12202)

  • Bump com.yscope.clp:clp-ffi from 0.4.3 to 0.4.4 (#12203)

  • Bump org.apache.spark:spark-launcher_2.12 from 3.2.1 to 3.5.0 (#12199)

  • Bump io.grpc:grpc-context from 1.59.0 to 1.60.1 (#12198)

  • Bump com.azure:azure-core from 1.37.0 to 1.45.1 (#12193)

  • Bump org.freemarker:freemarker from 2.3.30 to 2.3.32 (#12192)

  • Bump com.google.auto.service:auto-service from 1.0.1 to 1.1.1 (#12183)

  • Bump dropwizard-metrics.version from 4.2.22 to 4.2.23 (#12178)

  • Bump org.apache.yetus:audience-annotations from 0.13.0 to 0.15.0 (#12170)

  • Bump com.gradle:common-custom-user-data-maven-extension (#12171)

  • Bump org.apache.httpcomponents:httpclient from 4.5.13 to 4.5.14 (#12172)

  • Bump org.glassfish.tyrus.bundles:tyrus-standalone-client (#12162)

  • Bump com.google.api.grpc:proto-google-common-protos (#12159)

  • Bump org.apache.datasketches:datasketches-java from 4.1.0 to 5.0.0 (#12161)

  • Bump org.apache.zookeeper:zookeeper from 3.6.3 to 3.7.2 (#12152)

  • Bump org.apache.commons:commons-collections4 from 4.1 to 4.4 (#12149)

  • Bump log4j.version from 2.20.0 to 2.22.0 (#12143)

  • Bump com.github.luben:zstd-jni from 1.5.5-6 to 1.5.5-11 (#12125)

  • Bump com.google.guava:guava from 32.0.1-jre to 32.1.3-jre (#12124)

  • Bump org.apache.avro:avro from 1.10.2 to 1.11.3 (#12116)

  • Bump org.apache.maven.plugins:maven-assembly-plugin from 3.1.1 to 3.6.0 (#12109)

  • Bump net.java.dev.javacc:javacc from 7.0.10 to 7.0.13 (#12103)

  • Bump com.azure:azure-identity from 1.8.1 to 1.11.1 (#12095)

  • Bump xml-apis:xml-apis from 1.4.01 to 2.0.2 (#12082)

  • Bump up the parquet version to 1.13.1 (#12076)

  • Bump io.grpc:grpc-context from 1.14.0 to 1.59.0 (#12034)

  • Bump org.reactivestreams:reactive-streams from 1.0.3 to 1.0.4 (#12033)

  • Bump org.codehaus.mojo:appassembler-maven-plugin from 1.10 to 2.1.0 (#12030)

  • Bump com.google.code.findbugs:jsr305 from 3.0.0 to 3.0.2 (#12031)

  • Bump org.jacoco:jacoco-maven-plugin from 0.8.9 to 0.8.11 (#12024)

  • Bump dropwizard-metrics.version from 4.2.2 to 4.2.22 (#12022)

  • Bump grpc.version from 1.53.0 to 1.59.0 (#12023)

  • Bump com.google.code.gson:gson from 2.2.4 to 2.10.1 (#12009)

  • Bump net.nicoulaj.maven.plugins:checksum-maven-plugin from 1.8 to 1.11 (#12008)

  • Bump circe.version from 0.14.2 to 0.14.6 (#12006)

  • Bump com.mercateo:test-clock from 1.0.2 to 1.0.4 (#12005)

  • Bump simpleclient_common.version from 0.8.1 to 0.16.0 (#11986)

  • Bump com.jayway.jsonpath:json-path from 2.7.0 to 2.8.0 (#11987)

  • Bump commons-net:commons-net from 3.1 to 3.10.0 (#11982)

  • Bump org.scalatest:scalatest-maven-plugin from 1.0 to 2.2.0 (#11973)

  • Bump io.netty:netty-bom from 4.1.94.Final to 4.1.100.Final (#11972)

  • Bump com.google.errorprone:error_prone_annotations from 2.3.4 to 2.23.0 (#11905)

  • Bump net.minidev:json-smart from 2.4.10 to 2.5.0 (#11875)

  • Bump org.yaml:snakeyaml from 2.0 to 2.2 (#11876)

  • Bump browserify-sign in /pinot-controller/src/main/resources (#11896)

  • Bump org.easymock:easymock from 4.2 to 5.2.0 (#11854)

  • Bump org.codehaus.mojo:exec-maven-plugin from 1.5.0 to 3.1.0 (#11856)

  • Bump com.github.luben:zstd-jni from 1.5.2-3 to 1.5.5-6 (#11855)

  • Bump aws.sdk.version from 2.20.94 to 2.20.137 (#11463)

  • Bump org.xerial.snappy:snappy-java from 1.1.10.1 to 1.1.10.4 (#11678)