Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Here you will find a collection of ready-made sample applications and examples for real-world data
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The Pinot Controller is responsible for the following:
Maintaining global metadata (e.g. configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store.
Hosting the Helix Controller and managing other Pinot components (brokers, servers, minions)
Maintaining the mapping of which servers are responsible for which segments. This mapping is used by the servers to download the portion of the segments that they are responsible for. This mapping is also used by the broker to decide which servers to route the queries to.
Serving admin endpoints for viewing, creating, updating, and deleting configs, which are used to manage and operate the cluster.
Serving endpoints for segment uploads, which are used in offline data pushes. They are responsible for initializing real-time consumption and coordination of persisting real-time segments into the segment store periodically.
Undertaking other management activities such as managing retention of segments, validations.
For redundancy, there can be multiple instances of Pinot controllers. Pinot expects that all controllers are configured with the same back-end storage system so that they have a common view of the segments (e.g. NFS). Pinot can use other storage systems such as HDFS or ADLS.
Make sure you've setup Zookeeper. If you're using docker, make sure to pull the pinot docker image. To start a controller
Learn about the various components of Pinot and terminologies used to describe data stored in Pinot
Pinot is designed to deliver low latency queries on large datasets. In order to achieve this performance, Pinot stores data in a columnar format and adds additional indices to perform fast filtering, aggregation and group by.
Raw data is broken into small data shards and each shard is converted into a unit known as a segment. One or more segments together form a table, which is the logical container for querying Pinot using SQL/PQL.
Pinot uses a variety of terms which can refer to either abstractions that model the storage of data or infrastructure components that drive the functionality of the system.
Similar to traditional databases, Pinot has the concept of a table—a logical abstraction to refer to a collection of related data. As is the case with RDBMS, a table is a construct that consists of columns and rows (documents) that are queried using SQL. A table is associated with a schema which defines the columns in a table as well as their data types.
As opposed to RDBMS schemas, multiple tables can be created in Pinot (real-time or batch) that inherit a single schema definition. Tables are independently configured for concerns such as indexing strategies, partitioning, tenants, data sources, and/or replication.
Pinot has a distributed systems architecture that scales horizontally. Pinot expects the size of a table to grow infinitely over time. In order to achieve this, all data needs to be distributed across multiple nodes. Pinot achieves this by breaking data into smaller chunks known as segments (this is similar to shards/partitions in HA relational databases). Segments can also be seen as time-based partitions.
In order to support multi-tenancy, Pinot has first class support for tenants. A table is associated with a tenant. This allows all tables belonging to a particular logical namespace to be grouped under a single tenant name and isolated from other tenants. This isolation between tenants provides different namespaces for applications and teams to prevent sharing tables or schemas. Development teams building applications will never have to operate an independent deployment of Pinot. An organization can operate a single cluster and scale it out as new tenants increase the overall volume of queries. Developers can manage their own schemas and tables without being impacted by any other tenant on a cluster.
By default, all tables belong to a default tenant named "default". The concept of tenants is very important, as it satisfies the architectural principle of a "database per service/application" without having to operate many independent data stores. Further, tenants will schedule resources so that segments (shards) are able to restrict a table's data to reside only on a specified set of nodes. Similar to the kind of isolation that is ubiquitously used in Linux containers, compute resources in Pinot can be scheduled to prevent resource contention between tenants.
Logically, a cluster is simply a group of tenants. As with the classical definition of a cluster, it is also a grouping of a set of compute nodes. Typically, there is only one cluster per environment/data center. There is no needed to create multiple clusters since Pinot supports the concept of tenants. At LinkedIn, the largest Pinot cluster consists of 1000+ nodes distributed across a data center. The number of nodes in a cluster can be added in a way that will linearly increase performance and availability of queries. The number of nodes and the compute resources per node will reliably predict the QPS for a Pinot cluster, and as such, capacity planning can be easily achieved using SLAs that assert performance expectations for end-user applications.
Auto-scaling is also achievable, however, a set amount of nodes is recommended to keep QPS consistent when query loads vary in sudden unpredictable end-user usage scenarios.
A Pinot cluster is comprised of multiple distributed system components. These components are useful to understand for operators that are monitoring system usage or are debugging an issue with a cluster deployment.
Controller
Server
Broker
Minion (optional)
The benefits of scale that make Pinot linearly scalable for an unbounded number of nodes is made possible through its integration with Apache Zookeeper and Apache Helix.
Helix is a cluster management solution that was designed and created by the authors of Pinot at LinkedIn. Helix drives the state of a Pinot cluster from a transient state to an ideal state, acting as the fault-tolerant distributed state store that guarantees consistency. Helix is embedded as agents that operate within a controller, broker, and server, and does not exist as an independent and horizontally scaled component.
A controller is the core orchestrator that drives the consistency and routing in a Pinot cluster. Controllers are horizontally scaled as an independent component (container) and has visibility of the state of all other components in a cluster. The controller reacts and responds to state changes in the system and schedules the allocation of resources for tables, segments, or nodes. As mentioned earlier, Helix is embedded within the controller as an agent that is a participant responsible for observing and driving state changes that are subscribed to by other components.
In addition to cluster management, resource allocation, and scheduling, the controller is also the HTTP gateway for REST API administration of a Pinot deployment. A web-based query console is also provided for operators to quickly and easily run SQL/PQL queries.
A broker receives queries from a client and routes their execution to one or more Pinot servers before returning a consolidated response.
Servers host segments (shards) that are scheduled and allocated across multiple nodes and routed on an assignment to a tenant (there is a single tenant by default). Servers are independent containers that scale horizontally and are notified by Helix through state changes driven by the controller. A server can either be a real-time server or an offline server.
A real-time and offline server have very different resource usage requirements, where real-time servers are continually consuming new messages from external systems (such as Kafka topics) that are ingested and allocated on segments of a tenant. Because of this, resource isolation can be used to prioritize high-throughput real-time data streams that are ingested and then made available for query through a broker.
Pinot minion is an optional component that can be used to run background tasks such as "purge" for GDPR (General Data Protection Regulation). As Pinot is an immutable aggregate store, records containing sensitive private data need to be purged on a request-by-request basis. Minion provides a solution for this purpose that complies with GDPR while optimizing Pinot segments and building additional indices that guarantees performance in the presence of the possibility of data deletion. One can also write a custom task that runs on a periodic basis. While it's possible to perform these tasks on the Pinot servers directly, having a separate process (Minion) lessens the overall degradation of query latency as segments are impacted by mutable writes.
This page covers everything you need to know about how queries are computed in Pinot's distributed systems architecture.
This page will introduce you to the guiding principles behind the design of Apache Pinot. Here you will learn the distributed systems architecture that allows Pinot to scale the performance of queries linearly based on the number of nodes in a cluster. You'll also be introduced to the two different types of tables used to ingest and query data in offline (batch) or real-time (stream) mode.
It's recommended that you read Basic Concepts to better understand the terms used in this guide.
Pinot was designed by engineers at LinkedIn and Uber to scale query performance based on the number of nodes in a cluster. As you add more nodes, query performance will always improve based on the expected query volume per second quota. To achieve horizontal scalability to an unbounded number of nodes and data storage, without performance degradation, the following guiding design principles were established.
Highly available: Pinot is built to serve low latency analytical queries for customer facing applications. By design, there is no single point of failure in Pinot. The system continues to serve queries when a node goes down.
Horizontally scalable: Ability to scale by adding new nodes as a workload changes.
Latency vs Storage: Pinot is built to provide low latency even at high-throughput. Features such as segment assignment strategy, routing strategy, star-tree indexing were developed to achieve this.
Immutable data: Pinot assumes that all data stored is immutable. For GDPR compliance, we provide an add-on solution for purging data while maintaining performance guarantees.
Dynamic configuration changes: Operations such as adding new tables, expanding a cluster, ingesting data, modifying indexing config, and re-balancing must be performed without impacting query availability or performance.
As described in the concepts, Pinot has multiple distributed system components: Controller, Broker, Server, and Minion.
Pinot uses Apache Helix for cluster management. Helix is embedded as an agent within the different components and uses Apache Zookeeper for coordination and maintaining the overall cluster state and health.
All Pinot servers and brokers are managed by Helix. Helix is a generic cluster management framework to manage partitions and replicas in a distributed system. It's helpful to think of Helix as an event-driven discovery service with push and pull notifications that drives the state of a cluster to an ideal configuration. A finite-state machine maintains a contract of stateful operations that drives the health of the cluster towards its optimal configuration. Query load is optimized as Helix updates routing configurations between nodes based on where data is stored in the cluster.
Helix divides nodes into three logical components based on their responsibilities:
Participant: These are the nodes in the cluster that actually host the distributed storage resources.
Spectator: These nodes observe the current state of each participant and routes requests accordingly. Routers, for example, need to know the instance on which a partition is hosted and its state in order to route the request to the appropriate endpoint. Routing is continually being changed to optimize cluster performance as storage primitives are added and changed.
Controller: The controller observes and manages the state of participant nodes. The controller is responsible for coordinating all state transitions in the cluster and ensures that state constraints are satisfied while maintaining cluster stability.
Helix uses Zookeeper to maintain cluster state. Each component in a Pinot cluster takes a Zookeeper address as a startup parameter. The various components that are distributed in a Pinot cluster will watch Zookeeper notifications and issue updates via its embedded Helix-defined agent.
Helix agents use Zookeeper to store and update configurations, as well as for distributed coordination. Zookeeper stores the following information about the cluster:
Knowing the ZNode
layout structure in Zookeeper for Helix agents in a cluster is useful for operations and/or troubleshooting cluster state and health.
Pinot's controller acts as the driver of the cluster's overall state and health. Because of its role as a Helix participant and spectator, which drives the state of other components, it is the first component that is typically started after Zookeeper. Two parameters are required for starting a controller: Zookeeper address and cluster name. The controller will automatically create a cluster via Helix if it does not yet exist.
To achieve fault tolerance, one can start multiple controllers (typically three) and one of them will act as a leader. If the leader crashes or dies, another leader is automatically elected. Leader election is achieved using Apache Helix. Having at-least one controller is required to perform any DDL equivalent operation on the cluster, such as adding a table or a segment.
The controller does not interfere with query execution. Query execution is not impacted even when all controllers nodes are offline. If all controller nodes are offline, the state of the cluster will stay as it was when the last leader went down. When a new leader comes online, a cluster resumes re-balancing activity and can accept new tables or segments.
The controller provides a REST interface to perform CRUD operations on all logical storage resources (servers, brokers, tables, and segments).
See Pinot Data Explorer for more information on the web-based admin tool.
The responsibility of the broker is to route a given query to an appropriate server instance. A broker will collect and merge the responses from all servers into a final result and send it back to the requesting client. The broker provides HTTP endpoints that accept SQL queries and returns the response in JSON format.
Brokers need three key things to start.
Cluster name
Zookeeper address
Broker instance name
At the start, a broker registers as a Helix Participant and awaits notifications from other Helix agents. These notifications will be handled for table creation, a new segment being loaded, or a server starting up/or going down, in addition to any configuration changes.
Service Discovery/Routing Table
Irrespective of the kind of notification, the key responsibility of a broker is to maintain the query routing table. The query routing table is simply a mapping between segments and the servers that a segment resides on. Typically, a segment resides on more than one server. The broker computes multiple routing tables depending on the configured routing strategy for a table. The default strategy is to balance the query load across all available servers.
There are advanced routing strategies available such as ReplicaAware routing, partition-based routing, and minimal server selection routing. These strategies are meant for special or generic cases that are meant to serve very high throughput queries.
Query processing
For every query, a cluster's broker performs the following:
Fetches the routes that are computed for a query based on the routing strategy defined in a table's configuration.
Computes the list of segments to query from on each server.
Scatter-Gather: sends the requests to each server and gathers the responses.
Merge: merges the query results returned from each server.
Sends the query result to the client.
Fault tolerance
Broker instances scale horizontally without an upper bound. In a majority of cases, only three brokers are required. If most query results that are returned to a client are <1MB in size per query, one can run a broker and servers inside the same instance container. This lowers the overall footprint of a cluster deployment for use cases that do not need to guarantee a strict SLA on query performance in production.
Servers host segments and do most of the heavy lifting during query processing. Though the architecture shows that there are two kinds of servers, real-time and offline, a server does not really know if it's going to be a real-time server or an offline server. The responsibility of a server depends on the table assignment strategy.
In theory, a server can host both real-time segments and offline segments. However, in practice, we use different types of machine SKUs for real-time servers and offline servers. The advantage of separating real-time servers and offline servers is to allow each to scale independently.
Offline servers
Offline servers typically host segments that are immutable. In this case, segments are created outside of a cluster and uploaded via a shell-based curl request. Based on the replication factor and the segment assignment strategy, the controller picks one or more servers to host the segment. Servers are notified via Helix about the new segments. Servers fetch the segments from deep store and load them before being ready to serve query requests. At this point, the cluster's broker detects that new segments are available and starts including them in query responses.
Real-time servers
Real-time servers are different from the offline servers. Real-time server nodes ingest data from streaming sources, such as Kafka, and generate the indexed segments in-memory (flushing segments to disk periodically). In memory segments are also known as consuming segments. These consuming segments get flushed periodically based on completion threshold (based on number of rows, time or segment size). At this point, they are known as completed segments. Completed segments are similar to the offline server's segments. Queries go over the in-flight (consuming) segments and the completed segments.
Minion is an optional component and is not required to get started with Pinot. Minion is used for purging data from a Pinot cluster (for reasons such as GDPR compliance in the UK).
Within Pinot, a logical table is modeled as one of two types of physical tables: offline or real-time. The reason for having two types of tables is because each one follows a different state model.
A real-time and offline table provide different configuration options for indexing and, in the case of real-time, the connector properties for the stream data source (i.e. Kafka). Table types also allow users to use different containers for real-time and offline server nodes. For instance, offline servers might use virtual machines with larger storage capacity where real-time servers might need higher system memory and/or more CPU cores.
The two types of tables also scale differently.
Real-time tables have a smaller retention period and scales query performance based on the ingestion rate.
Offline tables have larger retention and scales performance based on the size of stored data.
There are a few things to keep in mind when configuring the different types of tables for your workloads. When ingesting data from the same source, you can have two tables that ingest the same data that are configured differently for real-time and offline queries. Even though the two tables have the same data, performance will scale differently for queries based on your requirements. In this scenario, real-time and offline tables must share the same schema.
Tables for real-time and offline can be configured differently depending on usage requirements. For example, you can choose to enable star-tree indexing for an offline table, while the real-time table with the same schema may not need it.
In batch mode, data is ingested into Pinot via an ingestion job. An ingestion job transforms a raw data source (such as a CSV file) into segments. Once segments are generated for the imported data, an ingestion job stores them into the cluster's segment store (a.k.a deep store) and notifies the controller. The notification is processed and the result is that the Helix agent on the controller updates the ideal state configuration in Zookeeper. Helix will then notify the offline server that there are new segments available. In response to the notification from the controller, the offline server downloads the newly created segments directly from the cluster's segment store. The cluster's broker, which watches for state changes in Helix, detects the new segments and adds them to the list of segments to query (segment-to-server routing table).
At table creation, a controller creates a new entry in Zookeeper for the consuming segment. Helix notices the new segment and notifies the real-time server, which starts consuming data from the streaming source. The broker, which watches for changes, detects the new segments and adds them to the list of segments to query (segment-to-server routing table).
Whenever the segment is complete (i.e. full), the real-time server notifies the Controller, which checks with all replicas and picks a winner to commit the segment to. The winner commits the segment and uploads it to the cluster's segment store, updating the state of the segment from "consuming" to "online". The controller then prepares a new segment in a "consuming" state.
Queries are received by brokers—which checks the request against the segment-to-server routing table—scattering the request between real-time and offline servers.
The two tables then process the request by filtering and aggregating the queried data, which is then returned back to the broker. Finally, the broker gathers together all of the pieces of the query response and responds back to the client with the result.
Brokers handle Pinot queries. They accept queries from clients and forward them to the right servers. They collect results back from the servers and consolidate them into a single response, to send back to the client.
Pinot Brokers are modeled as Helix Spectators. They need to know the location of each segment of a table (and each replica of the segments) and route requests to the appropriate server that hosts the segments of the table being queried.
The broker ensures that all the rows of the table are queried exactly once so as to return correct, consistent results for a query. The brokers may optimize to prune some of the segments as long as accuracy is not sacrificed.
Helix provides the framework by which spectators can learn the location in which each partition of a resource (i.e. participant) resides. The brokers use this mechanism to learn the servers that host specific segments of a table.
In the case of hybrid tables, the brokers ensure that the overlap between real-time and offline segment data is queried exactly once, by performing offline and real-time federation.
Let's take this example, we have real-time data for 5 days - March 23 to March 27, and offline data has been pushed until Mar 25, which is 2 days behind real-time. The brokers maintain this time boundary.
Suppose, we get a query to this table : select sum(metric) from table
. The broker will split the query into 2 queries based on this time boundary - one for offline and one for realtime. This query becomes - select sum(metric) from table_REALTIME where date >= Mar 25
and select sum(metric) from table_OFFLINE where date < Mar 25
The broker merges results from both these queries before returning the result to the client.
Cluster is a set of nodes comprising of servers, brokers, controllers and minions.
Briefly, Helix divides nodes into three logical components based on their responsibilities
The nodes that host distributed, partitioned resources
The nodes that observe the current state of each Participant and use that information to access the resources. Spectators are notified of state changes in the cluster (state of a participant, or that of a partition in a participant).
The node that observes and controls the Participant nodes. It is responsible for coordinating all transitions in the cluster and ensuring that state constraints are satisfied while maintaining cluster stability.
To setup a Pinot cluster, we need to first start Zookeeper.
Once we've started Zookeeper, we can start other components to join this cluster. If you're using docker, pull the latest apachepinot/pinot
image.
To start other components to join the cluster
Apache Pinot, a real-time distributed OLAP datastore, purpose-built for low-latency high throughput analytics, perfect for user-facing analytical workloads.
We'd love to hear from you!
Pinot is a real-time distributed OLAP datastore, purpose-built to provide ultra low-latency analytics, even at extremely high throughput. It can ingest directly from streaming data sources - such as Apache Kafka and Amazon Kinesis - and make the events available for querying instantly. It can also ingest from batch data sources such as Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage.
At the heart of the system is a columnar store, with several smart indexing and pre-aggregation techniques for low latency. This makes Pinot the most perfect fit for user-facing realtime analytics. At the same time, Pinot is also a great choice for other analytical use-cases, such as internal dashboards, anomaly detection, and ad-hoc data exploration.
Pinot was built by engineers at LinkedIn and Uber and is designed to scale up and out with no upper bound. Performance always remains constant based on the size of your cluster and an expected query per second (QPS) threshold.
User-facing analytics, or site-facing analytics, is the analytical tools and applications that you would expose directly to the end-users of your product. In a user-facing analytics application, think of the user-base as ALL end users of an App. This App could be a social networking app, or a food delivery app - anything at all. It’s not just a few analysts doing offline analysis, or a handful of data scientists in a company running ad-hoc queries. This is ALL end-users, receiving personalized analytics on their personal devices (think 100s of 1000s of queries per second). These queries are triggered by apps, and not written by people, and so the scale will be as much as the active users on that App (think millions of events/sec)
And, this is for all the freshest possible data, which touches on the other aspect here - realtime analytics. "Yesterday" might be a long time ago for some businesses and they cannot wait for ETLs and batch jobs. The data needs to be used for analytics, as soon as it is generated (think latencies < 1s).
Wanting such a user-facing analytics application, using realtime events, sounds great. But what does it mean for the underlying infrastructure, to support such an analytical workload?
Such applications require the freshest possible data, and so the system needs to be able to ingest data in realtime and make it available for querying, also in realtime.
Data for such apps tend to be event data, for a wide range of actions, coming from multiple sources, and so the data comes in at a very high velocity and tends to be highly dimensional.
Queries are triggered by end-users interacting with apps - with queries per second in hundreds of thousands, with arbitrary query patterns, and latencies are expected to be in milliseconds for good user-experience.
And further do all of the above, while being scalable, reliable, highly available and have a low cost to serve.
This video talks more about user-facing realtime analytics, and how Pinot is used to achieve that.
Here's another great video that goes into the details of how Pinot tackles some of the challenges faced in handling a user-facing analytics workload.
Pinot originated at LinkedIn which currently has one of the largest deployment powering more than 50+ user facing applications such as Viewed My Profile, Talent Analytics, Company Analytics, Ad Analytics and many more. At LinkedIn, Pinot also serves as the backend for to visualize and monitor 10,000+ business metrics.
A column-oriented database with various compression schemes such as Run Length, Fixed Bit Length
Pluggable indexing technologies - Sorted Index, Bitmap Index, Inverted Index, StarTree Index, Range Index,
Ability to optimize query/execution plan based on query and segment metadata
Near real-time ingestion from streams such as Kafka, Kinesis and batch ingestion from sources such as Hadoop, S3, Azure, GCS
SQL-like language that supports selection, aggregation, filtering, group by, order by, distinct queries on data
Support for multi-valued fields
Horizontally scalable and fault-tolerant
Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion.
User facing Analytics Products
Real-time Dashboard for Business Metrics
Pinot can be also be used to perform typical analytical operations such as slice and dice, drill down, roll up, and pivot on large scale multi-dimensional data. For instance, at LinkedIn, Pinot powers dashboards for thousands of business metrics. One can connect various BI tools such Superset, Tableau, or PowerBI to visualize data in Pinot.
Anomaly Detection
While Pinot doesn't match the typical mold of a database product, it is best understood based on your role as either an analyst, data scientist, or application developer.
Enterprise business intelligence
For analysts and data scientists, Pinot is best viewed as a highly-scalable data platform for business intelligence. In this view, Pinot converges big data platforms with the traditional role of a data warehouse, making it a suitable replacement for analysis and reporting.
Enterprise application development
For application developers, Pinot is best viewed as an immutable aggregate store that sources events from streaming data sources, such as Kafka, and makes it available for query using SQL.
As is the case with a microservice architecture, data encapsulation ends up requiring each application to provision its own data store, as opposed to sharing one OLTP database for reads and writes. In this case, it becomes difficult to query the complete view of a domain because it becomes stored in many different databases. This is costly in terms of performance, since it requires joins across multiple microservices that expose their data over HTTP under a REST API. To prevent this, Pinot can be used to aggregate all of the data across a microservice architecture into one easily queryable view of the domain.
Our documentation is structured to let you quickly get to the content you need and is organized around the different concerns of users, operators, and developers. If you're new to Pinot and want to learn things by example, please take a look at our getting started section.
Pinot works very well for querying time series data with many dimensions and metrics over a vast unbounded space of records that scales linearly on a per node basis. Filters and aggregations are both easy and fast.
Pinot may be deployed to and operated on a cloud provider or a local or virtual machine. You may get started either with a bare-metal installation or a Kubernetes one (either locally or in the cloud). To get immediately started with Pinot, check out these quick start guides for bootstrapping a Pinot cluster using Docker or Kubernetes.
For a high-level overview that explains how Pinot works, please take a look at our basic concepts section.
To understand the distributed systems architecture that explains Pinot's operating model, please take a look at our basic architecture section.
This section is a reference for the definition of major components and logical abstractions used in Pinot. Please visit the section to get a general overview that ties together all of the reference material in this section.
Make sure you've . If you're using docker, make sure to . To start a broker
Pinot leverages for cluster management. Helix is a cluster management framework to manage replicated, partitioned resources in a distributed system. Helix uses Zookeeper to store cluster state and metadata.
Pinot Servers are modeled as Participants, more details about server nodes can be found in . Pinot Brokers are modeled as Spectators, more details about broker nodes can be found in . Pinot Controllers are modeled as Controllers, more details about controller nodes can be found in .
Another way to visualize the cluster is a logical view, wherein a cluster contains , tenants contain , and tables contain .
Typically, there is only one cluster per environment/data center. There is no needed to create multiple Pinot clusters since Pinot supports the concept of . At LinkedIn, the largest Pinot cluster consists of 1000+ nodes.
Start to browse Zookeeper data at .
Download Pinot Distribution using instructions in
Install to view the data in Zookeeper, and connect to localhost:2181
(Optional) You can also follow the instructions to build your own images.
Explore your cluster via
Join us in our Slack channel for questions, troubleshooting, and feedback. You can request an invite from - .
With Pinot's growing popularity, several companies are now using it in production to power a variety of analytics use cases. A detailed list of companies using Pinot can be found .
Pinot is the perfect choice for user-facing analytics products. Pinot was originally built at LinkedIn to power rich interactive real-time analytic applications such as , , , and many more. is another example of a customer-facing Analytics App. At LinkedIn, Pinot powers 50+ user-facing products, ingesting millions of events per second and serving 100k+ queries per second at millisecond latency.
Instructions to connect Pinot with Superset can found .
In addition to visualizing data in Pinot, one can run Machine Learning Algorithms to detect Anomalies on the data stored in Pinot. See for more information on how to use Pinot for Anomaly Detection and Root Cause Analysis.
Pinot prevent any possibility of sharing ownership of database tables across microservice teams. Developers can create their own query models of data from multiple systems of record depending on their use case and needs. As with all aggregate stores, query models are eventually consistent and immutable.
To start importing data into Pinot, check out our guides on batch import and stream ingestion based on our .
Pinot supports SQL for querying read-only data. Learn more about querying Pinot for time series data in our guide.
Component
Helix Mapping
Segment
Modeled as a Helix Partition. Each segment can have multiple copies referred to as Replicas.
Table
Modeled as a Helix Resource. Multiple segments are grouped into a table. All segments belonging to a Pinot Table have the same schema.
Controller
Embeds the Helix agent that drives the overall state of the cluster.
Server
Broker
Broker is modeled as a Helix Spectator that observes the cluster for changes in the state of segments and servers. In order to support multi-tenancy, brokers are also modeled as Helix Participants.
Minion
Pinot Minion is modeled as a Helix Participant.
Resource
Stored Properties
Controller
The controller that is assigned as the current leader
Servers/Brokers
A list of servers/brokers and their configuration
Health status
Tables
List of tables
Table configurations
Table schema information
List of segments within a table
Segment
Exact server location(s) of a segment (routing table)
State of each segment (online/offline/error/consuming)
Meta data about each segment
Pinot has the concept of a table, which is a logical abstraction to refer to a collection of related data. Pinot has a distributed architecture and scales horizontally. Pinot expects the size of a table to grow infinitely over time. In order to achieve this, the entire data needs to be distributed across multiple nodes.
Pinot achieves this by breaking the data into smaller chunks known as segments (similar to shards/partitions in relational databases). Segments can be seen as time-based partitions.
Thus, a segment is a horizontal shard representing a chunk of table data with some number of rows. The segment stores data for all columns of the table. Each segment packs the data in a columnar fashion, along with the dictionaries and indices for the columns. The segment is laid out in a columnar format so that it can be directly mapped into memory for serving queries.
Columns can be single or multi-valued and the following types are supported: STRING, INT, LONG, FLOAT, DOUBLE or BYTES.
Columns may be declared to be metric or dimension (or specifically as a time dimension) in the schema. Columns can have default null values. For example, the default null value of a integer column can be 0. The default value for bytes columns must be hex-encoded before it's added to the schema.
Pinot uses dictionary encoding to store values as a dictionary ID. Columns may be configured to be “no-dictionary” column in which case raw values are stored. Dictionary IDs are encoded using minimum number of bits for efficient storage (e.g. a column with a cardinality of 3 will use only 2 bits for each dictionary ID).
A forward index is built for each column and compressed for efficient memory use. In addition, you can optionally configure inverted indices for any set of columns. Inverted indices take up more storage, but improve query performance. Specialized indexes like Star-Tree index are also supported. For more details, see Indexing.
Once the table is configured, we can load some data. Loading data involves generating pinot segments from raw data and pushing them to the pinot cluster. Data can be loaded in batch mode or streaming mode. For more details, see the ingestion overview page.
Below are instructions to generate and push segments to Pinot via standalone scripts. For a production setup, you should use frameworks such as Hadoop or Spark. For more details on setting up data ingestion jobs, see Import Data.
To generate a segment, we need to first create a job spec YAML file. This file contains all the information regarding data format, input data location, and pinot cluster coordinates. Note that this assumes that the controller is RUNNING to fetch the table config and schema. If not, you will have to configure the spec to point at their location. For full configurations, see Ingestion Job Spec.
To create and push the segment in one go, use
Sample Console Output
Alternately, you can separately create and then push, by changing the jobType to SegmentCreation
or SegmenTarPush
.
The Ingestion job spec supports templating with Groovy Syntax.
This is convenient if you want to generate one ingestion job template file and schedule it on a daily basis with extra parameters updated daily.
e.g. you could set inputDirURI
with parameters to indicate the date, so that the ingestion job only processes the data for a particular date. Below is an example that templates the date for input and output directories.
You can pass in arguments containing values for ${year}, ${month}, ${day}
when kicking off the ingestion job: -values $param=value1 $param2=value2
...
This ingestion job only generates segments for date 2014-01-03
Prerequisites
Below is an example of how to publish sample data to your stream. As soon as data is available to the realtime stream, it starts getting consumed by the realtime servers
Run below command to stream JSON data into Kafka topic: flights-realtime
Run below command to stream JSON data into Kafka topic: flights-realtime
A tenant is a logical component defined as a group of server/broker nodes with the same Helix tag.
In order to support multi-tenancy, Pinot has first-class support for tenants. Every table is associated with a server tenant and a broker tenant. This controls the nodes that will be used by this table as servers and brokers. This allows all tables belonging to a particular use case to be grouped under a single tenant name.
The concept of tenants is very important when the multiple use cases are using Pinot and there is a need to provide quotas or some sort of isolation across tenants. For example, consider we have two tables Table A
and Table B
in the same Pinot cluster.
We can configure Table A
with server tenant Tenant A
and Table B
with server tenant Tenant B
. We can tag some of the server nodes for Tenant A
and some for Tenant B
. This will ensure that segments of Table A
only reside on servers tagged with Tenant A
, and segment of Table B
only reside on servers tagged with Tenant B
. The same isolation can be achieved at the broker level, by configuring broker tenants to the tables.
No need to create separate clusters for every table or use case!
This tenant is defined in the tenants section of the table config.
This section contains 2 main fields broker
and server
, which decide the tenants used for the broker and server components of this table.
In the above example:
The table will be served by brokers that have been tagged as brokerTenantName_BROKER
in Helix.
If this were an offline table, the offline segments for the table will be hosted in Pinot servers tagged in Helix as serverTenantName_OFFLINE
If this were a real-time table, the real-time segments (both consuming as well as completed ones) will be hosted in pinot servers tagged in Helix as serverTenantName_REALTIME
.
Here's a sample broker tenant config. This will create a broker tenant sampleBrokerTenant
by tagging 3 untagged broker nodes as sampleBrokerTenant_BROKER
.
To create this tenant use the following command. The creation will fail if number of untagged broker nodes is less than numberOfInstances
.
Follow instructions in Getting Pinot to get Pinot locally, and then
Check out the table config in the Rest API to make sure it was successfully uploaded.
Here's a sample server tenant config. This will create a server tenant sampleServerTenant
by tagging 1 untagged server node as sampleServerTenant_OFFLINE
and 1 untagged server node as sampleServerTenant_REALTIME
.
To create this tenant use the following command. The creation will fail if number of untagged server nodes is less than offlineInstances
+ realtimeInstances
.
Follow instructions in Getting Pinot to get Pinot locally, and then
Check out the table config in the Rest API to make sure it was successfully uploaded.
Servers host the data segments and serve queries off the data they host. There are two types of servers:
Offline Offline servers are responsible for downloading segments from the segment store, to host and serve queries off. When a new segment is uploaded to the controller, the controller decides the servers (as many as replication) that will host the new segment and notifies them to download the segment from the segment store. On receiving this notification, the servers download the segment file and load the segment onto the server, to server queries off them.
Real-time Real-time servers directly ingest from a real-time stream (such as Kafka, EventHubs). Periodically, they make segments of the in-memory ingested data, based on certain thresholds. This segment is then persisted onto the segment store.
Pinot Servers are modeled as Helix Participants, hosting Pinot tables (referred to as resources in Helix terminology). Segments of a table are modeled as Helix partitions (of a resource). Thus, a Pinot server hosts one or more helix partitions of one or more helix resources (i.e. one or more segments of one or more tables).
Make sure you've setup Zookeeper. If you're using docker, make sure to pull the pinot docker image. To start a server
USAGE
Each table in Pinot is associated with a Schema. A schema defines what fields are present in the table along with the data types.
The schema is stored in the Zookeeper, along with the table configuration.
A schema also defines what category a column belongs to. Columns in a Pinot table can be categorized into three categories:
Data types determine the operations that can be performed on a column. Pinot supports the following data types:
BOOLEAN
, TIMESTAMP
, JSON
are added after release 0.7.1
. In release 0.7.1
and older releases, BOOLEAN
is equivalent to STRING.
Pinot also supports columns that contain lists or arrays of items, but there isn't an explicit data type to represent these lists or arrays. Instead, you can indicate that a dimension column accepts multiple values. For more information, see DimensionFieldSpec in the Schema configuration reference.
Since Pinot doesn't have a dedicated DATETIME
datatype support, you need to input time in either STRING, LONG, or INT format. However, Pinot needs to convert the date into an understandable format such as epoch timestamp to do operations.
To achieve this conversion, you will need to provide the format of the date along with the data type in the schema. The format is described using the following syntax: timeSize:timeUnit:timeFormat:pattern
.
time size - the size of the time unit. This size is multiplied to the value present in the time column to get an actual timestamp. e.g. if timesize is 5 and value in time column is 4996308 minutes. The value that will be converted to epoch timestamp will be 4996308 * 5 * 60 * 1000 = 1498892400000 milliseconds.
If your date is not in EPOCH
format, this value is not used and can be set to 1 or any other integer.
time unit - one of TimeUnit enum values. e.g. HOURS
, MINUTES
etc. If your date is not in EPOCH
format, this value is not used and can be set to MILLISECONDS
or any other unit.
timeFormat - can be either EPOCH
or SIMPLE_DATE_FORMAT
. If it is SIMPLE_DATE_FORMAT
, the pattern string is also specified.
pattern - This is optional and is only specified when the date is in SIMPLE_DATE_FORMAT
. The pattern should be specified using the java SimpleDateFormat representation. e.g. 2020-08-21 can be represented as yyyy-MM-dd
.
Here are some sample date-time formats you can use in the schema:
1:MILLISECONDS:EPOCH
- used when timestamp is in the epoch milliseconds and stored in LONG
format
1:HOURS:EPOCH
- used when timestamp is in the epoch hours and stored in LONG
or INT
format
1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd
- when the date is in STRING
format and has the pattern year-month-date. e.g. 2020-08-21
1:HOURS:SIMPLE_DATE_FORMAT:EEE MMM dd HH:mm:ss ZZZ yyyy
- when date is in STRING
format. e.g. Mon Aug 24 12:36:50 America/Los_Angeles 2019
There are several built-in virtual columns inside the schema the can be used for debugging purposes:
These virtual columns can be used in queries in a similar way to regular columns.
First, Make sure your cluster is up and running.
Let's create a schema and put it in a JSON file. For this example, we have created a schema for flight data.
For more details on constructing a schema file, see the Schema configuration reference.
Then, we can upload the sample schema provided above using either a Bash command or REST API call.
Check out the schema in the Rest API to make sure it was successfully uploaded
A Minion is a standby component that leverages the Helix Task Framework to offload computationally intensive tasks from other components.
It can be attached to an existing Pinot cluster and then execute tasks as provided by the controller. Custom tasks can be plugged via annotations into the cluster. Some typical minion tasks are:
Segment creation
Segment purge
Segment merge
Make sure you've setup Zookeeper. If you're using docker, make sure to pull the pinot docker image. To start a minion
PinotTaskGenerator interface defines the APIs for the controller to generate tasks for minions to execute.
Factory for PinotTaskExecutor
which defines the APIs for Minion to execute the tasks.
Factory for MinionEventObserver
which defines the APIs for task event callbacks on minion.
To be added
See Pinot managed Offline flows for details.
See Minion merge rollup task for details.
To be added
Tasks are enabled on a per-table basis. To enable a certain task type (e.g. myTask
) on a table, update the table config to include the task type:
Under each enable task type, custom properties can be configured for the task type.
Tasks can be scheduled periodically for all task types on all enabled tables. Enable auto task scheduling by configuring the schedule frequency in the controller config with the key controller.task.frequencyInSeconds
.
Tasks can also be scheduled based on cron expressions. The cron expression is set in the schedule
config for each task type separately. Thie optioncontroller.task.scheduler.enabled
should be set to true
to enable cron scheduling.
As shown below, the RealtimeToOfflineSegmentsTask will be scheduled at the first second of every minute (following the syntax defined here).
Tasks can be manually scheduled using the following controller rest APIs:
To plug in a custom task, implement PinotTaskGenerator
, PinotTaskExecutorFactory
and MinionEventObserverFactory
(optional) for the task type (all of them should return the same string for getTaskType()
), and annotate them with the following annotations:
After annotating the classes, put them under the package of name org.apache.pinot.*.plugin.minion.tasks.*
, then they will be auto-registered by the controller and minion.
See SimpleMinionClusterIntegrationTest where the TestTask
is plugged-in.
There is a controller job that runs every 5 minutes by default and emits metrics about Minion tasks scheduled in Pinot. The following metrics are emitted for each task type:
NumMinionTasksInProgress: Number of running tasks
NumMinionSubtasksRunning: Number of running sub-tasks
NumMinionSubtasksWaiting: Number of waiting sub-tasks (unassigned to a minion as yet)
NumMinionSubtasksError: Number of error sub-tasks (completed with an error/exception)
PercentMinionSubtasksInQueue: Percent of sub-tasks in waiting or running states
PercentMinionSubtasksInError: Percent of sub-tasks in error
For each task, the Minion will emit these metrics:
TASK_QUEUEING: Task queueing time (task_dequeue_time - task_inqueue_time), assuming the time drift between helix controller and pinot minion is minor, otherwise the value may be negative
TASK_EXECUTION: Task execution time, which is the time spent on executing the task
NUMBER_OF_TASKS: number of tasks in progress on that minion. Whenever a Minion starts a task, increase the Gauge by 1, whenever a Minion completes (either succeeded or failed) a task, decrease it by 1
This section contains quick start guides to help you get up and running with Pinot.
We want your experience getting started with Pinot to be both low effort and high reward. Here you'll find a collection of quick start guides that contain starter distributions of the Pinot platform.
This video will show you a step-by-step walk through for launching the individual components of Pinot and scaling them to multiple instances. This is an excellent resource for developers and operators that want to understand setting up each component and debugging a cluster.
We also have a step-by-step guide for manually setting up a Pinot cluster using Docker or shell scripts.
Explore the data on our Pinot cluster
Once you have set up the Cluster, you can start exploring the data and the APIs. Pinot 0.5.0 comes bundled with a completely new UI.
Let's take a look at the following two features on the UI
We can see our baseballStats
table listed on the left (you will see meetupRSVP
or airlineStats
if you used the streaming or the hybrid quick start). Click on the table name to display all the names along with the data types of the columns of the table.
You can also execute a sample query select * from baseballStats limit 10
by typing it in the text box and clicking the Run Query
button.
Cmd + Enter
can also be used to run the query when focused on the console.
You can also try out the following queries:
Pinot supports the following types of table:
The user querying the database does not need to know the type of the table. They only need to specify the table name in the query.
e.g. regardless of whether we have an offline table myTable_OFFLINE
, a real-time table myTable_REALTIME
, or a hybrid table containing both of these, the query will be:
You can use the following properties to make your tables faster or leaner:
Segment
Indexing
Tenants
For real-time tables, segments are built in a specific interval inside Pinot. You can tune the following for the real-time segments:
The Pinot real-time consumer ingests the data, creates the segment, and then flushes the in-memory segment to disk. Pinot allows you to configure when to flush the segment in the following ways:
Number of consumed rows - After consuming X no. of rows from the stream, Pinot will persist the segment to disk
Number of desired rows per segment - Pinot learns and then estimates the number of rows that need to be consumed so that the persisted segment is approximately the size. The learning phase starts by setting the number of rows to 100,000 (this value can be changed) and adjusts it to reach the desired segment size. The segment size may go significantly over the desired size during the learning phase. Pinot corrects the estimation as it goes along, so it is not guaranteed that the resulting completed segments are of the exact size as configured. You should set this value to optimize the performance of queries.
Max time duration to wait - Pinot consumers wait for the configured time duration after which segments are persisted to the disk.
Replicas A segment can have multiple replicas to provide higher availability. You can configure the number of replicas for a table segment using
However, in certain scenarios, the segment build can get very memory intensive. It might be desirable to enforce the non-committer servers to just download the segment from the controller, instead of building it again. You can do this by setting completionMode: "DOWNLOAD"
in the table configuration
Download Scheme
A Pinot server may fail to download segments from the deep store such as HDFS after its completion. However, you can configure servers to download these segments from peer servers instead of the deep store. Currently, only HTTP and HTTPS download schemes are supported. More methods such as gRPC/Thrift can be added in the future.
You can create multiple indices on a table to increase the performance of the queries. The following types of indices are supported:
Dictionary-encoded forward index with bit compression
Raw value forward index
Sorted forward index with run-length encoding
Bitmap inverted index
Sorted inverted index
You can aggregate the real-time stream data as it is consumed to reduce segment sizes. We sum the metric column values of all rows that have the same dimensions and create a single row in the segment. This feature is only available on REALTIME
tables.
The only supported aggregation is SUM
. The columns on which pre-aggregation is to be done need to satisfy the following requirements:
All metrics should be listed in noDictionaryColumns
.
There should not be any multi-value dimensions.
All dimension columns are treated to have a dictionary, even if they appear as noDictionaryColumns
in the config.
The following table config snippet shows an example of enabling pre-aggregation during real-time ingestion.
You can also override if a table should move to a server with different tenant based on segment status.
A tagOverrideConfig
can be added under the tenants
section for realtime tables, to override tags for consuming and completed segments. For example:
A hybrid table is a table composed of 2 tables, one offline and one real-time that share the same name. In such a table, offline segments may be pushed periodically. The retention on the offline table can be set to a high value since segments are coming in on a periodic basis, whereas the retention on the real-time part can be small.
Once an offline segment is pushed to cover a recent time period, the brokers automatically switch to using the offline table for segments for that time period and use the real-time table only for data not available in the offline table.
A typical scenario is pushing a deduped cleaned up data into an offline table every day while consuming real-time data as and when it arrives. The data can be kept in offline tables for even a few years while the real-time data would be cleaned every few days.
Prerequisites
Sample Console Output
Start Kafka
Create a Kafka Topic
Create a Streaming table
Sample output
Start Kafka-Zookeeper
Start Kafka
Create stream table
Note that creating a hybrid table has to be done in 2 separate steps of creating an offline and real-time table individually.
This quick start guide will help you bootstrap a Pinot standalone instance on your local machine.
In this guide you'll learn how to download and install Apache Pinot as a standalone instance.
First, let's download the Pinot distribution for this tutorial. You can either build the distribution from source or download a packaged release.
Prerequisites
Install JDK11 or higher (JDK16 is not yet supported) For JDK 8 support use Pinot 0.7.1 or compile from source
Prerequisites
Add maven option -Djdk.version=8
when building with JDK 8
Note that Pinot scripts is located under pinot-distribution/target not target directory under root.
Once you have the tar file,
We'll be using the quick-start scripts provided along with pinot distribution, which do the following:
Set up the Pinot cluster QuickStartCluster
Create a sample table and load sample data
Batch quick start creates the pinot cluster, creates an offline table baseballStats
and pushes sample offline data to the table.
Streaming quick start sets up a Kafka cluster and pushes sample data to a Kafka topic. Then, it creates the Pinot cluster and creates a realtime table meetupRSVP
which ingests data from the Kafka topic.
Hybrid quick start sets up a Kafka cluster and pushes sample data to a Kafka topic. Then, it creates the Pinot cluster and creates a hybrid table airlineStats
. The realtime table ingests data from the Kafka topic. Lastly, sample data is pushed into the offline table.
Rest API | Description |
---|---|
Implementation | Annotation |
---|---|
You can find the commands that are shown in this video on GitHub
Getting data into Pinot is easy. Take a look at these two quick start guides which will help you get up and running with sample data for offline and real-time .
Navigate to in your browser to open the controller UI.
Let us run some queries on the data in the Pinot cluster. Head over to to see the querying interface.
Pinot supports a subset of standard SQL. For more information, see .
The contains all the APIs that you will need to operate and manage your cluster. It provides a set of APIs for Pinot cluster management including health check, instances management, schema and table management, data segments management.
Let's check out the tables in this cluster by going to and click on Try it out!
. We can see thebaseballStats
table listed here. We can also see the exactcurl
call made to the controller API.
You can look at the configuration of this table by going to , type in baseballStats
in the table name, and click Try it out!
Let's check out the schemas in the cluster by going to and click Try it out!
. We can see a schema called baseballStats
in this list.
Take a look at the schema by going to , type baseballStats
in the schema name, and click Try it out!
.
Finally, let's checkout the data segments in the cluster by going to , type in baseballStats
in the table name, and click Try it out!
. There's 1 segment for this table, called baseballStats_OFFLINE_0
.
To learn how to uploaded your own data and schema, head over to or .
A table is a logical abstraction that represents a collection of related data. It is composed of columns and rows (known as documents in Pinot). The columns, data types, and other metadata related to the table are defined using a .
Pinot breaks a table into multiple and stores these segments in a deep-store such as HDFS as well as Pinot servers.
In the Pinot cluster, a table is modeled as a and each segment of a table is modeled as a .
Type | Description |
---|
is used to define the table properties, such as name, type, indexing, routing, retention etc. It is written in JSON format and is stored in Zookeeper, along with the table schema.
A table is comprised of small chunks of data. These chunks are known as Segments. To learn more about how Pinot creates and manages segments see
For offline tables, Segments are built outside of pinot and uploaded using a distributed executor such as Spark or Hadoop. For more details, see .
Completion Mode By default, if the in-memory segment in the is equivalent to the committed segment, then the non-winner server builds and replaces the segment. If the available segment is not equivalent to the committed segment, the server simply downloads the committed segment from the controller.
For more details on why this is needed, see
For more details about peer segment download during real-time ingestion, please refer to this design doc on
For more details on each indexing mechanism and corresponding configurations, see .
You can also set up on columns to make queries faster. Further, you can also keep segments in off-heap instead of on-heap memory for faster queries.
Each table is associated with a tenant. A segment resides on the server, which has the same tenant as itself. For more details on how tenants work, see .
In the above example, the consuming segments will still be assigned to serverTenantName_REALTIME
hosts, but once they are completed, the segments will be moved to serverTeantnName_OFFLINE
. It is possible to specify the full name of any tag in this section (so, for example, you could decide that completed segments for this table should be in pinot servers tagged as allTables_COMPLETED
). To learn more about this config, see the section.
To understand how time boundary works in the case of a hybrid table, see .
Create a table config for your data, or see for all possible batch/streaming tables.
Check out the table config in the to make sure it was successfully uploaded.
Check out the table config in the to make sure it was successfully uploaded.
This is a quick start guide that will show you how to quickly start an example recipe in a standalone instance and is meant for learning. To run Pinot in cluster mode, please take a look at .
Follow these steps to checkout code from and build Pinot locally
Install 3.6 or higher
Download the latest binary release from , or use this command
The following quick start scripts are available. Please note though, these scripts launch the Pinot cluster with minimal resources. If you intend to play with sizable data (more than few MB), you may want to follow the and provide required resources.
That's it! We've spun up a Pinot cluster. You can continue playing with other types of quick start, or simply head on to to check out the data in the baseballStats
table.
We now have a Pinot cluster with a realtime table! You can head over to to check out the data in the meetupRSVP
table.
Let's head over to to check out the data we pushed to the airlineStats
table.
Category
Description
Dimension
Dimension columns are typically used in slice and dice operations for answering business queries. Some operations for which dimension columns are used:
GROUP BY
- group by one or more dimension columns along with aggregations on one or more metric columns
Filter clauses such as WHERE
Metric
These columns represent the quantitative data of the table. Such columns are used for aggregation. In data warehouse terminology, these can also be referred to as fact or measure columns.
Some operation for which metric columns are used:
Aggregation - SUM
, MIN
, MAX
, COUNT
, AVG
etc
Filter clause such as WHERE
DateTime
This column represents time columns in the data. There can be multiple time columns in a table, but only one of them can be treated as primary. The primary time column is the one that is present in the segment config.
The primary time column is used by Pinot to maintain the time boundary between offline and real-time data in a hybrid table and for retention management. A primary time column is mandatory if the table's push type is APPEND
and optional if the push type is REFRESH
.
Common operations that can be done on time column:
GROUP BY
Filter clauses such as WHERE
Data Type
Default Dimension Value
Default Metric Value
INT
0
LONG
0
FLOAT
0.0
DOUBLE
0.0
BOOLEAN
0 (false)
N/A
TIMESTAMP
0 (1970-01-01 00:00:00 UTC)
N/A
STRING
"null"
N/A
JSON
"null"
N/A
BYTES
byte array of length 0
byte array of length 0
Column Name
Column Type
Data Type
Description
$hostName
Dimension
STRING
Name of the server hosting the data
$segmentName
Dimension
STRING
Name of the segment containing the record
$docId
Dimension
INT
Document id of the record within the segment
POST /tasks/schedule
Schedule tasks for all task types on all enabled tables
POST /tasks/schedule?taskType=myTask
Schedule tasks for the given task type on all enabled tables
POST /tasks/schedule?tableName=myTable_OFFLINE
Schedule tasks for all task types on the given table
POST /tasks/schedule?taskType=myTask&tableName=myTable_OFFLINE
Schedule tasks for the given task type on the given table
PinotTaskGenerator
@TaskGenerator
PinotTaskExecutorFactory
@TaskExecutorFactory
MinionEventObserverFactory
@EventObserverFactory
Offline | Offline tables ingest pre-built pinot-segments from external data stores. This is generally used for batch ingestion. |
Realtime | Realtime tables ingest data from streams (such as Kafka) and build segments from the consumed data. |
Hybrid | A hybrid Pinot table has both realtime as well as offline tables under the hood. By default, all tables in Pinot are Hybrid in nature. |
This page contains multiple quick start guides for deploying Pinot to a public cloud provider.
The following quick start guides will show you how to run an Apache Pinot cluster using Kubernetes on different public cloud providers.
This quick start guide will show you how to run a Pinot cluster using Docker.
This is a quickstart guide that will show you how to quickly start an example recipe in a standalone instance and is meant for learning. To run Pinot in cluster mode, please take a look at Manual cluster setup.
Prerequisites
Install Docker
You can also try Kubernetes quick start if you already have a local minikube cluster installed or Docker Kubernetes setup.
If running locally, please ensure your docker cluster has enough resources, below is a sample config.
We'll be using our docker image apachepinot/pinot:latest
to run this quick start, which does the following:
Sets up the Pinot cluster
Creates a sample table and loads sample data
The following quick-start scripts are available
Batch example
Streaming example
Hybrid example
Before running the scripts, create an isolated bridge network pinot-demo
in docker. This will allow all docker containers to easily communicate with each other. You can create the network using the following command -
In this example we demonstrate how to do batch processing with Pinot.
Starts Pinot deployment by starting
Apache Zookeeper
Pinot Controller
Pinot Broker
Pinot Server
Creates a demo table
baseballStats
Launches a standalone data ingestion job
Builds one Pinot segment for a given CSV data file for table baseballStats
Pushes the built segment to the Pinot controller
Issues sample queries to Pinot
Once the Docker container is running, you can view the logs by running the following command.
That's it! We've spun up a Pinot cluster.
It may take a while for all the Pinot components to start and for the sample data to be loaded.
Use the below command to check the status in the container logs.
Your cluster is ready once you see the cluster setup completion messages and sample queries, as demonstrated below.
You can head over to Exploring Pinot to check out the data in the baseballStats
table.
In this example we demonstrate how to do stream processing with Pinot.
Starts Pinot deployment by starting
Apache Kafka
Apache Zookeeper
Pinot Controller
Pinot Broker
Pinot Server
Creates a demo table
meetupRsvp
Launches a meetup
stream
Publishes data to a Kafka topic meetupRSVPEvents
to be subscribed to by Pinot
Issues sample queries to Pinot
Once the cluster is up, you can head over to Exploring Pinot to check out the data in the meetupRSVPEvents
table.
In this example we demonstrate how to do hybrid stream and batch processing with Pinot.
Starts Pinot deployment by starting
Apache Kafka
Apache Zookeeper
Pinot Controller
Pinot Broker
Pinot Server
Creates a demo table
airlineStats
Launches a standalone data ingestion job
Builds Pinot segments under a given directory of Avro files for table airlineStats
Pushes built segments to Pinot controller
Launches a stream of flights stats
Publishes data to a Kafka topic airlineStatsEvents
to be subscribed to by Pinot
Issues sample queries to Pinot
Once the cluster is up, you can head over to Exploring Pinot to check out the data in the airlineStats
table.
Pinot quick start in Kubernetes
This quickstart assumes that you already have a running Kubernetes cluster. Please follow the links below to set up a Kubernetes cluster.
Install Minikube for local setup (make sure to run with enough resources e.g. minikube start --vm=true --cpus=4 --memory=8g --disk-size=50g)
Before continuing, please make sure that you've downloaded Apache Pinot. The scripts for the setup in this guide can be found in our open source project on GitHub.
The scripts can be found in the Pinot source at ./pinot/kubernetes/helm
Pinot repo has pre-packaged HelmCharts for Pinot and Presto. Helm Repo index file is here.
NOTE: Please specify StorageClass based on your cloud vendor. For Pinot Server, please don't mount blob store like AzureFile/GoogleCloudStorage/S3 as the data serving file system.
Only use Amazon EBS/GCP Persistent Disk/Azure Disk style disks.
For AWS: "gp2"
For GCP: "pd-ssd" or "standard"
For Azure: "AzureDisk"
For Docker-Desktop: "hostpath"
For Helm v2.12.1
If your Kubernetes cluster is recently provisioned, ensure Helm is initialized by running:
Then deploy a new HA Pinot cluster using the following command:
For Helm v3.0.0
Error: Please run the below command if encountering the following issue:
Resolution:
Error: Please run the command below if encountering a permission issue:
Error: release pinot failed: namespaces "pinot-quickstart" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "namespaces" in API group "" in the namespace "pinot-quickstart"
Resolution:
Ensure the Kafka deployment is ready before executing the scripts in the following next steps.
The scripts below will create two Kafka topics for data ingestion:
The script below will deploy 3 batch jobs.
Ingest 19492 JSON messages to Kafka topic flights-realtime
at a speed of 1 msg/sec
Ingest 19492 Avro messages to Kafka topic flights-realtime-avro
at a speed of 1 msg/sec
Upload Pinot schema airlineStats
Create Pinot table airlineStats
to ingest data from JSON encoded Kafka topic flights-realtime
Create Pinot table airlineStatsAvro
to ingest data from Avro encoded Kafka topic flights-realtime-avro
Please use the script below to perform local port-forwarding, which will also open Pinot query console in your default web browser.
This script can be found in the Pinot source at ./pinot/kubernetes/helm/pinot
Open superset.yaml
file and goto the line showing storageClass
. And change it based on your cloud vendor. kubectl get sc
will get you the storageClass
value for your Kubernetes system. E.g.
For AWS: "gp2"
For GCP: "pd-ssd" or "standard"
For Azure: "AzureDisk"
For Docker-Desktop: "hostpath"
Then run:
Ensure your cluster is up by running:
You can run below command to navigate superset in your browser with the previous admin credential.
You can open the imported dashboard by clicking Dashboards
banner and then click on AirlineStats
.
You can run the command below to deploy Trino with the Pinot plugin installed.
The above command adds Trino HelmChart repo. You can then run the below command to see the charts.
In order to connect Trino to Pinot, we need to add Pinot catalog, which requires extra configurations. You can run the below command to get all the configurable values.
To add Pinot catalog, you can edit the additionalCatalogs
section by adding:
Pinot is deployed at namespace pinot-quickstart
, so the controller serviceURL is pinot-controller.pinot-quickstart:9000
After modifying the /tmp/trino-values.yaml
file, you can deploy Trino with:
Once you deployed the Trino, You can check Trino deployment status by:
Once Trino is deployed, you can run the below command to get a runnable Trino CLI.
6.2.2 Port forward Trino service to your local if it's not already exposed
6.2.3 Use Trino console client to connect to Trino service
6.2.4 Query Pinot data using Trino CLI
List all catalogs
List All tables
Show schema
Count total documents
You can run the command below to deploy a customized Presto with the Pinot plugin installed.
The above command deploys Presto with default configs. For customizing your deployment, you can run the below command to get all the configurable values.
After modifying the /tmp/presto-values.yaml
file, you can deploy Presto with:
Once you deployed the Presto, You can check Presto deployment status by:
Once Presto is deployed, you can run the below command from here, or just follow steps 6.2.1 to 6.2.3.
6.2.2 Port forward presto-coordinator port 8080 to localhost port 18080
6.2.4 Query Pinot data using Presto CLI
List all catalogs
List All tables
Show schema
Count total documents
This starter guide provides a quick start for running Pinot on Microsoft Azure
For Mac User
Please check kubectl version after installation.
QuickStart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12
For Mac User
Please check helm version after installation.
This QuickStart provides helm supports for helm v3.0.0 and v2.12.1. Please pick the script based on your helm version.
For Mac User
Below script will open default browser to sign-in to your Azure Account.
Below script will create a resource group in location eastus.
Below script will create a 3 nodes cluster named pinot-quickstart for demo purposes.
Please modify the parameters in the example command below:
Once the command is succeed, it's ready to be used.
Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.
To verify the connection, you can run:
This starter provides a quick start for running Pinot on Google Cloud Platform (GCP)
For Mac User
Please check kubectl version after installation.
QuickStart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12
For Mac User
Please check helm version after installation.
This QuickStart provides helm supports for helm v3.0.0 and v2.12.1. Please pick the script based on your helm version.
Install Google Cloud SDK
Restart your shell
Below script will create a 3 nodes cluster named pinot-quickstart in us-west1-b with n1-standard-2 machines for demo purposes.
Please modify the parameters in the example command below:
You can monitor cluster status by command:
Once the cluster is in RUNNING status, it's ready to be used.
Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.
To verify the connection, you can run:
This guide provides a quick start for running Pinot on Amazon Web Services (AWS).
For Mac User
Please check kubectl version after installation.
QuickStart scripts are tested under kubectl client version v1.16.3 and server version v1.13.12
For Mac User
Please check helm version after installation.
This QuickStart provides helm supports for helm v3.0.0 and v2.12.1. Please pick the script based on your helm version.
For Mac User
For Mac User
Environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
will override AWS configuration stored in file ~/.aws/credentials
The script below will create a 1 node cluster named pinot-quickstart in us-west-2 with a t3.xlarge machine for demo purposes:
You can monitor the cluster status via this command:
Once the cluster is in ACTIVE status, it's ready to be used.
Simply run below command to get the credential for the cluster pinot-quickstart that you just created or your existing cluster.
To verify the connection, you can run:
This document provides the basic instruction to set up a Kubernetes Cluster on
Please follow this link () to install kubectl.
Please follow this link () to install helm.
Please follow this link () to install Azure CLI.
Please follow this to deploy your Pinot Demo.
This document provides the basic instruction to set up a Kubernetes Cluster on
Please follow this link () to install kubectl.
Please follow this link () to install helm.
Please follow this link () to install Google Cloud SDK.
Please follow this to deploy your Pinot Demo.
This document provides the basic instruction to set up a Kubernetes Cluster on
Please follow this link () to install kubectl.
Please follow this link () to install helm.
Please follow this link () to install AWS CLI.
Please follow this link () to install AWS CLI.
For first time AWS user, please register your account at .
Once created the account, you can go to to create a user and create access keys under Security Credential tab.
Please follow this to deploy your Pinot Demo.
Pinot offers various ways to assist with troubleshooting and debugging problems that might happen. It is recommended to start off with the debug api which may quickly surface some of the commonly occurring problems. The debug api provides information such as tableSize, ingestion status, any error messages related to state transition in server, among other things.
The table debug api can be invoked via the Swagger UI as follows:
It can also be invoked directly by accessing the URL as follows. The api requires the tableName
, and can optionally take tableType (offline|realtime)
and verbosity
level.
Pinot also provides a wide-variety of operational metrics that can be used for creating dashboards, alerting and monitoring. Also, all pinot components log debug information related to error conditions that can be used for troubleshooting.
Please use these steps:
If the query executes, look at the query result. Specifically look at numEntriesScannedInFilter
and numDocsScanned
.
If numEntriesScannedInFilter
is very high, consider adding indexes for the corresponding columns being used in the filter predicates. You should also think about partitioning the incoming data based on the dimension most heavily used in your filter queries.
If numDocsScanned
is very high, that means the selectivity for the query is low and lots of documents need to be processed after the filtering. Consider refining the filter to increase the selectivity of the query.
If the query is not executing, you can extend the query timeout by appending a timeoutMs
parameter to the query (eg: select * from mytable limit 10 option(timeoutMs=60000)
). Then you can repeat step 1.
You can also look at GC stats for the corresponding Pinot servers. If a particular server seems to be running full GC all the time, you can do a couple of things such as
Increase JVM heap (Xmx)
Consider using off-heap memory for segments
Decrease the total number of segments per server (by partitioning the data in a better way)
The Docker instructions on this page are still WIP
So far, we setup our cluster, ran some queries on the demo tables and explored the admin endpoints. We also uploaded some sample batch data for transcript table.
Now, it's time to ingest from a sample stream into Pinot. The rest of the instructions assume you're using Pinot running in Docker (inside a pinot-quickstart container).
First, we need to setup a stream. Pinot has out-of-the-box realtime ingestion support for Kafka. Other streams can be plugged in, more details in Pluggable Streams.
Let's setup a demo Kafka cluster locally, and create a sample topic transcript-topic
Start Kafka
Create a Kafka Topic
Start Kafka
Start Kafka cluster on port 9876
using the same Zookeeper from the quick-start examples
Create a Kafka topic
Download the latest Kafka. Create a topic
If you followed the Batch upload sample data, you have already pushed a schema for your sample table. If not, head over to Creating a schema on that page, to learn how to create a schema for your sample data.
If you followed Batch upload sample data, you learnt how to push an offline table and schema. Similar to the offline table config, we will create a realtime table config for the sample. Here's the realtime table config for the transcript table. For a more detailed overview about table, checkout Table.
Now that we have our table and schema, let's upload them to the cluster. As soon as the realtime table is created, it will begin ingesting from the Kafka topic.
Here's a JSON file for transcript table data:
Push sample JSON into Kafka topic, using the Kafka script from the Kafka download
As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to checkout the realtime data
This quick start guide will show you how to set up a Pinot cluster manually.
A manual cluster setup consists of the following components - 1. Zookeeper 2. Controller 3. Broker 4. Server 5. Kafka
We will run each of these components in separate containers
If running locally, please ensure your docker cluster has enough resources, below is a sample config.
You can try out the pre-built Pinot all-in-one docker image.
(Optional) You can also follow the instructions here to build your own images.
Create an isolated bridge network in docker
Start Zookeeper in daemon mode. This is a single node zookeeper setup. Zookeeper is the central metadata store for Pinot and should be set up with replication for production use. For more information, see Running Replicated Zookeeper.
Start Pinot Controller in daemon and connect to Zookeeper.
The command below expects a 4GB memory container. Please tune-Xms
and-Xmx
if your machine doesn't have enough resources.
Start Pinot Broker in daemon and connect to Zookeeper.
The command below expects a 4GB memory container. Please tune-Xms
and-Xmx
if your machine doesn't have enough resources.
Start Pinot Server in daemon and connect to Zookeeper.
The command below expects a 16GB memory container. Please tune-Xms
and-Xmx
if your machine doesn't have enough resources.
Optionally, you can also start Kafka for setting up realtime streams. This brings up the Kafka broker on port 9092.
Now all Pinot related components are started as an empty cluster.
You can run the below command to check container status.
Sample Console Output
Prerequisites
Follow this instruction in Getting Pinot to get Pinot
You can use Zooinspector to browse the Zookeeper instance.
The examples below are for Java 8 users.
For Java 11+ users, please remove the GC settings insideJAVA_OPTS.
So it looks like: export JAVA_OPTS="-Xms4G -Xmx8G"
Now all Pinot related components are started as an empty cluster.
Run docker-compose up
to launch all the components.
You can run the below command to check container status.
Sample Console Output
Now it's time to start adding data to the cluster. Check out some of the Recipes or follow the Batch upload sample data and Stream sample data for instructions on loading your own data.
Step-by-step guide on pushing your own data into the Pinot cluster
So far, we setup our cluster, ran some queries, explored the admin endpoints. Now, it's time to get our own data into Pinot. The rest of the instructions assume you're using Pinot running in Docker (inside a pinot-quickstart container).
Let's gather our data files and put it in pinot-quick-start/rawdata
.
Supported file formats are CVS, JSON, AVRO, PARQUET, THRIFT, ORC. If you don't have sample data, you can use this sample CSV.
Schema is used to define the columns and data types of the Pinot table. A detailed overview of the schema can be found in Schema.
Briefly, we categorize our columns into 3 types
For example, in our sample table, the playerID, yearID, teamID, league, playerName
columns are the dimensions, the playerStint, numberOfgames, numberOfGamesAsBatter, AtBatting, runs, hits, doules, triples, homeRuns, runsBattedIn, stolenBases, caughtStealing, baseOnBalls, strikeouts, intentionalWalks, hitsByPitch, sacrificeHits, sacrificeFlies, groundedIntoDoublePlays, G_old
columns are the metrics and there is no time column.
Once you have identified the dimensions, metrics and time columns, create a schema for your data, using the reference below.
A table config is used to define the config related to the Pinot table. A detailed overview of the table can be found in Table.
Here's the table config for the sample CSV file. You can use this as a reference to build your own table config. Simply edit the tableName and schemaName.
Check the directory structure so far
Upload the table config using the following command
Check out the table config and schema in the Rest API to make sure it was successfully uploaded.
A Pinot table's data is stored as Pinot segments. A detailed overview of the segment can be found in Segment.
To generate a segment, we need to first create a job spec yaml file. JobSpec yaml file has all the information regarding data format, input data location and pinot cluster coordinates. You can just copy over this job spec file. If you're using your own data, be sure to 1) replace transcript
with your table name 2) set the right recordReaderSpec
Use the following command to generate a segment and upload it
Sample output
Check that your segment made it to the table using the Rest API
You're all set! You should see your table in the Query Console and be able to run queries against it now.
This page has a collection of frequently asked questions with answers from the community.
This is a list of frequent questions most often asked in our troubleshooting channel on Slack. Please feel free to contribute your questions and answers here and make a pull request.
Below is an example of AWS EKS.
In the K8s cluster, check the storage class: in AWS, it should be gp2.
Then update StorageClass to ensure:
Once StorageClass is updated, it should be like:
Once the storage class is updated, then we can update PVC for the server disk size.
Now we want to double the disk size for pinot-server-3.
Below is an example of current disks:
Below is the output of data-pinot-server-3
Now, let's change the PVC size to 2T by editing the server PVC.
Once updated, the spec's PVC size is updated to 2T, but the status's PVC size is still 1T.
Restart pinot-server-3 pod:
Recheck PVC size:
FAQ for general questions around Pinot
When data is pushed in to Pinot, it makes a backup copy of the data and stores it on the configured deep-storage (S3/GCP/ADLS/NFS/etc). This copy is stored as tar.gz Pinot segments. Note, that pinot servers keep a (untarred) copy of the segments on their local disk as well. This is done for performance reasons.
Pinot uses Apache Helix for cluster management, which in turn is built on top of Zookeeper. Helix uses Zookeeper to store the cluster state, including Ideal State, External View, Participants, etc. Besides that, Pinot uses Zookeeper to store other information such as Table configs, schema, Segment Metadata, etc.
Please check the JDK version you are using. The release 0.8.0 binary is on JDK 11. You may be getting this error if you are using JDK8. In that case, please consider using JDK11, or you will need to download the for the release and it locally.
Column Type
Description
Dimensions
Typically used in filters and group by, for slicing and dicing into data
Metrics
Typically used in aggregations, represents the quantitative data
Time
Optional column, represents the timestamp associated with each row
While Pinot can work with segments of various sizes, for optimal use of Pinot, you want to get your segments sized in the 100MB to 500MB (un-tarred/uncompressed) range. Please note that having too many (thousands or more) of tiny segments for a single table just creates more overhead in terms of the metadata storage in Zookeeper as well as in the Pinot servers' heap. At the same time, having too few really large (GBs) segments reduces parallelism of query execution, as on the server side, the thread parallelism of query execution is at segment level.
Yes. Each table can be independently configured to consume from any given Kafka topic, regardless of whether there are other tables that are also consuming from the same Kafka topic.
Setup partitioner in the Kafka producer: https://docs.confluent.io/current/clients/producer.html
The partitioning logic in the stream should match the partitioning config in Pinot. Kafka uses murmur2
, and the equivalent in Pinot is Murmur
function.
Set partitioning config as below using same column used in Kafka
and also set
More details about how partitioner works in Pinot here.
For JSON, you can use hex encoded string to ingest BYTES
We have json_format(field) function which can store a top level json field as a STRING in Pinot.
Then you can use these json functions during query time, to extract fields from the json string.
NOTE This works well if some of your fields are nested json, but most of your fields are top level json keys. If all of your fields are within a nested JSON key, you will have to store the entire payload as 1 column, which is not ideal.
Support for flattening during ingestion is on the roadmap: https://github.com/apache/pinot/issues/5264
To use explicit code points, you must double-quote (not single-quote) the string, and escape the code point via "\uHHHH", where HHHH is the four digit hex code for the character. See https://yaml.org/spec/spec.html#escaping/in%20double-quoted%20scalars/ for more details.
By default, Pinot limits the length of a String column to 512 bytes. If you want to overwrite this value, you can set the maxLength attribute in the schema as follows:
Events are available to be read by queries as soon as they are ingested. This is because events are instantly indexed in-memory upon ingestion.
The ingestion of events into the real-time table is not transactional, so replicas of the open segment are not immediately consistent. Pinot trades consistency for availability upon network partitioning (CAP theorem) to provide ultra-low ingestion latencies at high throughput.
However, when the open segment is closed and its in-memory indexes are flushed to persistent storage, all its replicas are guaranteed to be consistent, with the commit protocol.
Inverted indexes are set in the tableConfig's tableIndexConfig -> invertedIndexColumns list. Here's the documentation for tableIndexConfig: https://docs.pinot.apache.org/basics/components/table#tableindexconfig-1 along with a sample table that has set inverted indexes on some columns.
Applying inverted indexes to a table config will generate inverted index to all new segments. In order to apply the inverted indexes to all existing segments, follow steps in How to apply inverted index to existing setup?
Add the columns you wish to index to the tableIndexConfig-> invertedIndexColumns list. This sample table config show inverted indexes set: https://docs.pinot.apache.org/basics/components/table#offline-table-config To update the table config use the Pinot Swagger API: http://localhost:9000/help#!/Table/updateTableConfig
Invoke the reload API: http://localhost:9000/help#!/Segment/reloadAllSegments
Right now, there’s no easy way to confirm that reload succeeded. One way it to check out the index_map file inside the segment metadata, you should see inverted index entries for the new columns. An API for this is coming soon: https://github.com/apache/pinot/issues/5390
Star-tree indexes are configured in the table config under the tableIndexConfig -> starTreeIndexConfigs (list) and enableDefaultStarTree (boolean). Read more about how to configure star-tree indexes: https://docs.pinot.apache.org/basics/indexing/star-tree-index#index-generation
The new segments will have star-tree indexes generated after applying the star-tree index configs to the table config. Currently Pinot does not support adding star-tree indexes to the existing segments.
Pinot does not require ordering of event time stamps. Out of order events are still consumed and indexed into the "currently consuming" segment. In a pathological case, if you have a 2 day old event come in "now", it will still be stored in the segment that is open for consumption "now". There is no strict time-based partitioning for segments, but star-indexes and hybrid tables will handle this as appropriate.
See the Components > Broker for more details about how hybrid tables handle this. Specifically, the time-boundary is computed as max(OfflineTIme) - 1 unit of granularity
. Pinot does store the min-max time for each segment and uses it for pruning segments, so segments with multiple time intervals may not be perfectly pruned.
When generating star-indexes, the time column will be part of the star-tree so the tree can still be efficiently queried for segments with multiple time intervals.
max(OfflineTime)
to determine the time-boundary, and instead using an offset?This lets you have an old event up come in without building complex offline pipelines that perfectly partition your events by event timestamps. With this offset, even if your offline data pipeline produces segments with a maximum timestamp, Pinot will not use the offline dataset for that last chunk of segments. The expectation is if you process offline the next time-range of data, your data pipeline will include any late events.
It might seem odd that segments are not strictly time-partitioned, unlike similar systems such as Apache Druid. This allows real-time ingestion to consume out-of-order events. Even though segments are not strictly time-partitioned, Pinot will still index, prune, and query segments intelligently by time-intervals to for performance of hybrid tables and time-filtered data.
When generating offline segments, the segments generated such that segments only contain one time-interval and are well partitioned by the time column.
This essentially implies that the Pinot Broker assigned to the table specified in the query was not found. A common root cause for this is a typo in the table name in the query. Another uncommon reason could be if there wasn't actually a broker with required broker tenant tag for the table.
Here's the page explaining the Pinot response format: https://docs.pinot.apache.org/users/api/querying-pinot-using-standard-sql/response-format
"timestamp" is a reserved keyword in SQL. Escape timestamp with double quotes.
Other commonly encountered reserved keywords are date, time, table.
For filtering on STRING columns, use single quotes
The fields in the ORDER BY
clause must be one of the group by clauses or aggregations, BEFORE applying the alias. Therefore, this will not work
Instead, this will work
No. Pagination only works for SELECTION queries
You can add this at the end of your query: option(timeoutMs=X)
. For eg: the following example will use a timeout of 20 seconds for the query:
In order to speed up aggregations, you can enable metrics aggregation on the required column by adding a metric field in the corresponding schema and setting aggregateMetrics
to true in the table config. You can also use a star-tree index config for such columns (read more about star-tree here)
There are 2 ways to verify this:
Log in to a server that hosts segments of this table. Inside the data directory, locate the segment directory for this table. In this directory, there is a file named index_map
which lists all the indexes and other data structures created for each segment. Verify that the requested index is present here.
During query: Use the column in the filter predicate and check the value of numEntriesScannedInFilter
. If this value is 0, then indexing is working as expected (works for Inverted index)
Yes, Pinot uses a default value of LIMIT 10 in queries. The reason behind this default value is to avoid unintentionally submitting expensive queries that end up fetching or processing a lot of data from Pinot. Users can always overwrite this by explicitly specifying a LIMIT value.
Pinot does not cache query results, each query is computed in its entirety. Note though, running the same or similar query multiple times will naturally pull in segment pages into memory making subsequent calls faster. Also, for realtime systems, the data is changing in realtime, so results cannot be cached. For offline-only systems, caching layer can be built on top of Pinot, with invalidation mechanism built-in to invalidate the cache when data is pushed into Pinot.
The query execution engine will prefer to use StarTree index for all queries where it can be used. The criteria to determine whether StarTree index can be used is as follows:
All aggregation function + column pairs in the query must exist in the StarTree index.
All dimensions that appear in filter predicates and group-by should be StarTree dimensions.
For queries where above is true, StarTree index is used. For other queries, the execution engine will default to using the next best index available.
Typically, Pinot components try to use as much off-heap (MMAP/DirectMemory) where ever possible. For example, Pinot servers load segments in memory-mapped files in MMAP mode (recommended), or direct memory in HEAP mode. Heap memory is used mostly for query execution and storing some metadata. We have seen production deployments with high throughput and low-latency work well with just 16 GB of heap for Pinot servers and brokers. Pinot controller may also cache some metadata (table configs etc) in heap, so if there are just a few tables in the Pinot cluster, a few GB of heap should suffice.
Pinot relies on deep-storage for storing backup copy of segments (offline as well as realtime). It relies on Zookeeper to store metadata (table configs, schema, cluster state, etc). It does not explicitly provide tools to take backups or restore these data, but relies on the deep-storage (ADLS/S3/GCP/etc), and ZK to persist these data/metadata.
Changing a column name or data type is considered backward incompatible change. While Pinot does support schema evolution for backward compatible changes, it does not support backward incompatible changes like changing name/data-type of a column.
You can change the number of replicas by updating the table config's segmentsConfig section. Make sure you have at least as many servers as the replication.
For OFFLINE table, update replication
For REALTIME table update replicasPerPartition
After changing the replication, run a table rebalance.
Refer to Rebalance.
The number of segments generated depends on the number of input files. If you provide only 1 input file, you will get 1 segment. If you break up the input file into multiple files, you will get as many segments as the input files.
This typically happens when the server is unable to load the segment. Possible causes: Out-Of-Memory, no-disk space, unable to download segment from deep-store, and similar other errors. Please check server logs for more information.
Use the segment reset controller REST API to reset the segment:
RESET: this gets a segment in ERROR state back to ONLINE or CONSUMING state. Behind the scenes, Pinot controller takes the segment to OFFLINE state, waits for External View to stabilize, and then moves it back to ONLINE/CONSUMING state, thus effectively resetting segments or consumers in error states.
REFRESH: this replaces the segment with a new one, with the same name but often different data. Under the hood, Pinot controller sets new segment metadata in Zookeeper, and notifies brokers and servers to check their local states about this segment and update accordingly. Servers also download the new segment to replace the old one, when both have different checksums. There is no separate rest API for refreshing, and it is done as part of SegmentUpload API today.
RELOAD: this reloads the segment, often to generate a new index as updated in table config. Underlying, Pinot server gets the new table config from Zookeeper, and uses it to guide the segment reloading. In fact, the last step of REFRESH as explained above is to load the segment into memory to serve queries. There is a dedicated rest API for reloading. By default, it doesn't download segment. But option is provided to force server to download segment to replace the local one cleanly.
In addition, RESET brings the segment OFFLINE temporarily; while REFRESH and RELOAD swap the segment on server atomically without bringing down the segment or affecting ongoing queries.
Set this property in your controller.conf file
Now your brokers and servers should join the cluster as broker_untagged
and server_untagged
. You can then directly use the POST /tenants
API to create the desired tenants
Yes, replica groups work for realtime. There's 2 parts to enabling replica groups:
Replica groups segment assignment
Replica group query routing
Replica group segment assignment
Replica group segment assignment is achieved in realtime, if number of servers is a multiple of number of replicas. The partitions get uniformly sprayed across the servers, creating replica groups. For example, consider we have 6 partitions, 2 replicas, and 4 servers.
As you can see, the set (S0, S2) contains r1 of every partition, and (s1, S3) contains r2 of every partition. The query will only be routed to one of the sets, and not span every server. If you are are adding/removing servers from an existing table setup, you have to run rebalance for segment assignment changes to take effect.
Replica group query routing
Once replica group segment assignment is in effect, the query routing can take advantage of it. For replica group based query routing, set the following in the table config's routing section, and then restart brokers
You can follow the [wiki] to build pinot distribution from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
Next, you need to change the execution config in the job spec to the following -
You can check out the sample job spec here.
Finally execute the hadoop job using the command -
Please ensure environment variables PINOT_ROOT_DIR
and PINOT_VERSION
are set properly.
We’ve seen some requests that data should be massaged (like partitioning, sorting, resizing) before creating and pushing segments to Pinot.
The MapReduce job called SegmentPreprocessingJob
would be the best fit for this use case, regardless of whether the input data is of AVRO or ORC format.
Check the below example to see how to use SegmentPreprocessingJob
.
In Hadoop properties, set the following to enable this job:
In table config, specify the operations in preprocessing.operations
that you'd like to enable in the MR job, and then specify the exact configs regarding those operations:
Minimum number of reducers. Optional. Fetched when partitioning gets disabled and resizing is enabled. This parameter is to avoid having too many small input files for Pinot, which leads to the case where Pinot server is holding too many small segments, causing too many threads.
Maximum number of records per reducer. Optional.Unlike, “preprocessing.num.reducers”, this parameter is to avoid having too few large input files for Pinot, which misses the advantage of muti-threading when querying. When not set, each reducer will finally generate one output file. When set (e.g. M), the original output file will be split into multiple files and each new output file contains at most M records. It does not matter whether partitioning is enabled or not.
Pinot supports as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.
For more details on this MR job, please refer to this .
r1
r2
p1
S0
S1
p2
S2
S3
p3
S0
S1
p4
S2
S3
p5
S0
S1
p6
S2
S3
This section is an overview of the various options for importing data into Pinot.
There are multiple options for importing data into Pinot. These guides are ready-made examples that show you step-by-step instructions for importing records into Pinot, supported by our plugin architecture.
These guides are meant to get you up and running with imported data as quick as possible. Pinot supports multiple file input formats without needing to change anything other than the file name. Each example imports a ready-made dataset so you can see how things work without needing to bring your own dataset.
This guide will show you how to import data using stream ingestion from Apache Kafka topics.
This guide will show you how to import data using stream ingestion with upsert.
By default, Pinot does not come with a storage layer, so all the data sent, won't be stored in case of system crash. In order to persistently store the generated segments, you will need to change controller and server configs to add a deep storage. Checkout File systems for all the info and related configs.
These guides will show you how to import data as well as persist it in the file systems.
These guides will show you how to import data from a Pinot supported input format.
This guide will show you how to handle the complex type in the ingested data, such as map and array.
Pinot supports Apache spark as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.
You can check out the sample job spec here.
Now, add the pinot jar to spark's classpath using following options -
Please ensure environment variables PINOT_ROOT_DIR
and PINOT_VERSION
are set properly.
Finally execute the spark job using the command -
Note: You should change the master
to yarn
and deploy-mode
to cluster
for production.
You can follow the to build pinot distribution from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
Next, you need to change the execution config in the to the following -
This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.
In this guide, you'll learn how to import data into Pinot using Apache Kafka for real-time stream ingestion. Pinot has out-of-the-box real-time ingestion support for Kafka.
Let's setup a demo Kafka cluster locally, and create a sample topic transcript-topic
Start Kafka
Create a Kafka Topic
Start Kafka
Start Kafka cluster on port 9876
using the same Zookeeper from the quick-start examples.
Create a Kafka topic
Download the latest Kafka. Create a topic.
We will publish the data in the same format as mentioned in the Stream ingestion docs. So you can use the same schema mentioned under Create Schema Configuration.
The real-time table configuration for the transcript
table described in the schema from the previous step.
For Kafka, we use streamType as kafka
. Currently only JSON format is supported but you can easily write your own decoder by extending the StreamMessageDecoder
interface. You can then access your decoder class by putting the jar file in plugins
directory
The lowLevel
consumer reads data per partition whereas the highLevel
consumer utilises Kafka high level consumer to read data from the whole stream. It doesn't have the control over which partition to read at a particular momemt.
For Kafka versions below 2.X, use org.apache.pinot.plugin.stream.kafka09.KafkaConsumerFactory
For Kafka version 2.X and above, use
org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory
You can set the offset to -
smallest
to start consumer from the earliest offset
largest
to start consumer from the latest offset
timestamp in milliseconds
to start the consumer from the offset after the timestamp.
The resulting configuration should look as follows -
Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name
from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory
to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory
.
If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server
into tableIndexConfig.streamConfigs
. This config should be the URI of Kafka broker lists, e.g. localhost:9092
.
This connector is also suitable for Kafka lib version higher than 2.0.0
. In Kafka 2.0 connector pom.xml, change the kafka.lib.version
from 2.0.0
to 2.1.1
will make this Connector working with Kafka 2.1.1
.
Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the real-time table is created, it will begin ingesting available records from the Kafka topic.
We will publish data in the following format to Kafka. Let us save the data in a file named as transcript.json
.
Push sample JSON into the transcript-topic
Kafka topic, using the Kafka console producer. This will add 12 records to the topic described in the transcript.json
file.
As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to checkout the real-time data.
Here is an example config which uses SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, ones starting with ssl.
are for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry.
are for SchemaRegistryClient
used by KafkaConfluentSchemaRegistryAvroMessageDecoder
.
With Kafka consumer 2.0, you can ingest transactionally committed messages only by configuring kafka.isolation.level
to read_committed
. For example,
Note that the default value of this config read_uncommitted
to read all messages. Also, this config supports low-level consumer only.
Pinot batch ingestion involves two parts: routing ingestion job(hourly/daily) and backfill. Here are some tutorials on how routine batch ingestion works in Pinot Offline Table:
High Level Idea
Organize raw data into buckets (eg: /var/pinot/airlineStats/rawdata/2014/01/01). Each bucket typically contains several files (eg: /var/pinot/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01_0.avro)
Run a Pinot batch ingestion job, which points to a specific date folder like ‘/var/pinot/airlineStats/rawdata/2014/01/01’. The segment generation job will convert each such avro file into a Pinot segment for that day and give it a unique name.
Run Pinot segment push job to upload those segments with those uniques names via a Controller API
IMPORTANT: The segment name is the unique identifier used to uniquely identify that segment in Pinot. If the controller gets an upload request for a segment with the same name - it will attempt to replace it with the new one.
This newly uploaded data can now be queried in Pinot. However, sometimes users will make changes to the raw data which need to be reflected in Pinot. This process is known as 'Backfill'.
Pinot supports data modification only at the segment level, which means we should update entire segments for doing backfills. The high level idea is to repeat steps 2 (segment generation) and 3 (segment upload) mentioned above:
Backfill jobs must run at the same granularity as the daily job. E.g., if you need to backfill data for 2014/01/01, specify that input folder for your backfill job (e.g.: ‘/var/pinot/airlineStats/rawdata/2014/01/01’)
The backfill job will then generate segments with the same name as the original job (with the new data).
When uploading those segments to Pinot, the controller will replace the old segments with the new ones (segment names act like primary keys within Pinot) one by one.
Backfill jobs expect the same number of (or more) data files on the backfill date. So the segment generation job will create the same number of (or more) segments than the original run.
E.g. assuming table airlineStats has 2 segments(airlineStats_2014-01-01_2014-01-01_0, airlineStats_2014-01-01_2014-01-01_1) on date 2014/01/01 and the backfill input directory contains only 1 input file. Then the segment generation job will create just one segment: airlineStats_2014-01-01_2014-01-01_0. After the segment push job, only segment airlineStats_2014-01-01_2014-01-01_0 got replaced and stale data in segment airlineStats_2014-01-01_2014-01-01_1 are still there.
In case the raw data is modified in such a way that the original time bucket has fewer input files than the first ingestion run, backfill will fail.
Dimension tables in Apache Pinot.
Dimension tables are a special kind of offline tables from which data can be looked up via the lookup UDF, providing a join like functionality. These dimension tables are replicated on all the hosts for a given tenant to allow faster lookups.
To mark an offline table as a dim table the configuration isDimTable
should be set to true in the table config as shown below
As dimension table are used to perform lookups of dimension values, they are required to have a primary key (can be a composite key).
As mentioned above, when a table is marked as a dimension table it will be replicated on all the hosts, because of this the size of the dim table has to be small. The maximum size quota for a dimension table in a cluster is controlled by controller.dimTable.maxSize
controller property. Table creation will fail if the storage quota exceeds this maximum size.
This is not tested in production. You may hit some snags while trying to use this.
To ingest events from an Amazon Kinesis stream into Pinot, set the following configs into the table config
where the Kinesis specific properties are:
Environment Variables - AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
(RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY
and AWS_SECRET_KEY
(only recognized by Java SDK)
Java System Properties - aws.accessKeyId
and aws.secretKey
Web Identity Token credentials from the environment or container
Credential profiles file at the default location (~/.aws/credentials)
shared by all AWS SDKs and the AWS CLI
Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
environment variable is set and security manager has permission to access the variable,
Instance profile credentials delivered through the Amazon EC2 metadata service
You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups. You can also specify other aws fields such as AWS_SESSION_TOKEN as environment variables and config and it will work.
ShardID is of the format "shardId-000000000001". We use the numeric part as partitionId. Our partitionId variable is integer. If shardIds grow beyond Integer.MAX_VALUE, we will overflow
Segment size based thresholds for segment completion will not work. It assumes that partition "0" always exists. However, once the shard 0 is split/merged, we will no longer have partition 0.
Apache Pinot allows user to consume data from streams and push it directly to pinot database. This process is known as Stream Ingestion. Stream Ingestion allows user to query data within seconds of publishing.
Stream Ingestion provides support for checkpoints out of the box for preventing data loss.
Stream ingestion requires the following steps -
Create schema configuration
Create table configuration
Upload table and schema spec
Let's take a look at each of the following steps in a bit more detail. Let us assume the data to be ingested is in the following format -
Schema defines the fields along with their data types which are available in the datasource. Schema also defines the fields which serve as dimensions
, metrics
and timestamp
respectively.
The realtime table configuration consists of the the following fields -
tableName - The name of the table where the data should flow
tableType - The internal type for the table. Should always be set to REALTIME
for realtime ingestion
segmentsConfig -
tableIndexConfig - defines which column to use for indexing along with the type of index. You can refer [Indexing Configs] for full configuration. It consists of the following required fields -
loadMode - specifies how the segments should be loaded. Should be one of heap
or mmap
. Here's the difference between both the configs
streamConfig - specifies the datasource along with the necessary configs to start consuming the realtime data. The streamConfig can be thought of as equivalent of job spec in case of batch ingestion. The following options are supported in this config -
You can also specify additional configs for the consumer by prefixing the key with stream.[streamType] where streamType
is the name of the streaming platform. For our sample data and schema, the table config will look like -
Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the configs are uploaded, pinot will start ingesting available records from the topic.
Batch ingestion allows users to create a table using data already present in a file system such as S3. This is particularly useful for the cases where the user wants to utilize Pinot's ability to query large data with minimal latency or test out new features using a simple data file.
Ingesting data from a filesystem involves the following steps -
Define Schema
Define Table Config
Upload Schema and Table configs
Upload data
Batch Ingestion currently supports the following mechanisms to upload the data -
Standalone
Here we'll take a look at the standalone local processing to get you started.
Let's create a table for the following CSV data source.
In our data, the only column on which aggregations can be performed is score. Secondly, timestampInEpoch is the only timestamp column. So, on our schema, we keep score as metric and timestampInEpoch as timestamp column.
Here, we have also defined two extra fields - format and granularity. The format specifies the formatting of our timestamp column in the data source. Currently, it is in milliseconds hence we have specified 1:MILLISECONDS:EPOCH
We define a tabletranscript
and map the schema created in the previous step to the table. For batch data, we keep the tableType
as OFFLINE
Now that we have both the configs, we can simply upload them and create a table. To achieve that, just run the command -
Check out the table config and schema in the [Rest API] to make sure it was successfully uploaded.
We now have an empty table in pinot. So as the next step we will upload our CSV file to this table.
A table is composed of multiple segments. The segments can be created using three ways
1) Minion based ingestion 2) Upload API 3) Ingestion jobs
There are 2 Controller APIs that can be used for a quick ingestion test using a small file.
When these APIs are invoked, the controller has to download the file and build the segment locally.
Hence, these APIs are NOT meant for production environments and for large input files.
This API creates a segment using the given file and pushes it to Pinot. All steps happen on the controller. Example usage:
To upload a JSON file data.json
to a table called foo_OFFLINE
, use below command
Note that query params need to be URLEncoded. For example, {"inputFormat":"json"} in the command below needs to be converted to %7B%22inputFormat%22%3A%22json%22%7D.
The batchConfigMapStr
can be used to pass in additional properties needed for decoding the file. For example, in case of csv, you may need to provide the delimiter
This API creates a segment using file at the given URI and pushes it to Pinot. Properties to access the FS need to be provided in the batchConfigMap. All steps happen on the controller. Example usage:
Segments can be created and uploaded using tasks known as DataIngestionJobs. A job also needs a config of its own. We call this config the JobSpec.
For our CSV file and table, the job spec should look like below.
Now that we have the job spec for our table transcript
, we can trigger the job using the following command
Once the job has successfully finished, you can head over to the [query console] and start playing with the data.
There are 3 ways to upload a Pinot segment:
This is the original and default push mechanism.
Tar push requires the segment to be stored locally or can be opened as an InputStream on PinotFS. So we can stream the entire segment tar file to the controller.
The push job will:
Upload the entire segment tar file to the Pinot controller.
Pinot controller will:
Save the segment into the controller segment directory(Local or any PinotFS).
Extract segment metadata.
Add the segment to the table.
This push mechanism requires the segment Tar file stored on a deep store with a globally accessible segment tar URI.
URI push is light-weight on the client-side, and the controller side requires equivalent work as the Tar push.
The push job will:
POST this segment Tar URI to the Pinot controller.
Pinot controller will:
Download segment from the URI and save it to controller segment directory(Local or any PinotFS).
Extract segment metadata.
Add the segment to the table.
This push mechanism also requires the segment Tar file stored on a deep store with a globally accessible segment tar URI.
Metadata push is light-weight on the controller side, there is no deep store download involves from the controller side.
The push job will:
Download the segment based on URI.
Extract metadata.
Upload metadata to the Pinot Controller.
Pinot Controller will:
Add the segment to the table based on the metadata.
When pinot segment files are created in external systems (Hadoop/spark/etc), there are several ways to push those data to the Pinot Controller and Server:
Push segment to other systems and implement your own segment fetcher to pull data from those systems.
Since pinot is written in Java, you can set the following basic java configurations to tune the segment runner job -
Log4j2 file location with -Dlog4j2.configurationFile
Plugin directory location with -Dplugins.dir=/opt/pinot/plugins
JVM props, like -Xmx8g -Xms4G
If you are using the docker, you can set the following under JAVA_OPTS
variable.
You can set -D mapreduce.map.memory.mb=8192
to set the mapper memory size when submitting the Hadoop job.
You can add config spark.executor.memory
to tune the memory usage for segment creation when submitting the Spark job.
Kinesis supports authentication using the . The credential provider looks for the credentials in the following order -
Follow for more details on schema configuration. For our sample data, the schema configuration should look as follows
The next step is to create a table where all the ingested data will flow and can be queried. Unlike batch ingestion, table configuration for realtime ingestion also triggers the data ingestion job.For a more detailed overview about tables, check out the reference.
We are working on adding more integrations such as Kinesis out of the box. You can easily write your on ingestion plugin in case it is not supported out of the box. Follow for a walkthrough.
Refer to
You can refer to for more details.
Push segment to shared NFS and let pinot pull segment files from the location of that NFS. See .
Push segment to a Web server and let pinot pull segment files from the Web server with HTTP/HTTPS link. See .
Push segment to PinotFS(HDFS/S3/GCS/ADLS) and let pinot pull segment files from PinotFS URI. See and .
The first three options are supported out of the box within the Pinot package. As long your remote jobs send Pinot controller with the corresponding URI to the files it will pick up the file and allocate it to proper Pinot Servers and brokers. To enable Pinot support for PinotFS, you will need to provide configuration and proper Hadoop dependencies.
By default, Pinot does not come with a storage layer, so all the data sent, won't be stored in case of a system crash. In order to persistently store the generated segments, you will need to change controller and server configs to add deep storage. Checkout for all the info and related configs.
Property | Description |
streamType | This should be set to "kinesis" |
stream.kinesis.topic.name | Kinesis stream name |
region | Kinesis region e.g. us-west-1 |
accessKey | Kinesis access key |
secretKey | Kinesis secret key |
shardIteratorType | Set to "LATEST" for largest offset (default),TRIM_HORIZON for earliest offset. |
maxRecordsToFetch | ... Default is 20. |
Config key | Description | Supported values |
streamType | the streaming platform from which to consume the data |
|
stream.[streamType].consumer.type | whether to use per partition low-level consumer or high-level stream consumer |
|
stream.[streamType].topic.name | the datasource (e.g. topic, data stream) from which to consume the data | String |
stream.[streamType].decoder.class.name | name of the class to be used for parsing the data. The class should implement | String |
stream.[streamType].consumer.factory.class.name | name of the factory class to be used to provide the appropriate implementation of low level and high level consumer as well as the metadata | String |
stream.[streamType].consumer.prop.auto.offset.reset | determines the offset from which to start the ingestion |
|
realtime.segment.flush.threshold.time | Time threshold that will keep the realtime segment open for before we complete the segment |
realtime.segment.flush.threshold.size | Row count flush threshold for realtime segments. This behaves in a similar way for HLC and LLC. For HLC, since there is only one consumer per server, this size is used as the size of the consumption buffer and determines after how many rows we flush to disk. For example, if this threshold is set to two million rows, then a high level consumer would have a buffer size of two million. If this value is set to 0, then the consumers adjust the number of rows consumed by a partition such that the size of the completed segment is the desired size (unless threshold.time is reached first) |
realtime.segment.flush.desired.size | The desired size of a completed realtime segment.This config is used only if |
This section contains a collection of short guides to show you how to import from a Pinot supported file system.
FileSystem is an abstraction provided by Pinot to access data in distributed file systems
Pinot uses the distributed file systems or the following purposes -
Batch Ingestion Job - To read the input data (CSV, Avro, Thrift etc.) and to write generated segments to DFS
Controller - When a segment is uploaded to controller, the controller saves it in the DFS configured.
Server - When a server(s) is notified of a new segment, the server copy the segment from remote DFS to their local node using the DFS abstraction
Pinot allows you to choose the distributed file system provider. Currently, the following file systems are supported by Pinot out of the box.
To use a distributed file-system, you need to enable plugins in pinot. To do that, specify the plugin directory and include the required plugins -
Now, You can proceed to change the filesystem in the controller
and server
config as follows -
Here, scheme
refers to the prefix used in URI of the filesystem. e.g. for the URI, s3://bucket/path/to/file
the scheme is s3
You can also change the filesystem during ingestion. In the ingestion job spec, specify the filesystem with the following config -
You can enable Amazon S3 Filesystem backend by including the plugin pinot-s3
.
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include
, you need to put all the plugins you want to use, e.g. pinot-json
, pinot-avro
, pinot-kafka-2.0...
You can also configure the S3 filesystem using the following options:
Each of these properties should be prefixed by pinot.[node].storage.factory.s3.
where node
is either controller
or server
depending on the config
e.g.
S3 Filesystem supports authentication using the DefaultCredentialsProviderChain. The credential provider looks for the credentials in the following order -
Environment Variables - AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
(RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY
and AWS_SECRET_KEY
(only recognized by Java SDK)
Java System Properties - aws.accessKeyId
and aws.secretKey
Web Identity Token credentials from the environment or container
Credential profiles file at the default location (~/.aws/credentials)
shared by all AWS SDKs and the AWS CLI
Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
environment variable is set and security manager has permission to access the variable,
Instance profile credentials delivered through the Amazon EC2 metadata service
You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups.
Upsert support in Apache Pinot.
Pinot provides native support of upsert during the real-time ingestion (v0.6.0+). There are scenarios that the records need modifications, such as correcting a ride fare and updating a delivery status.
With the foundation of full upsert support in Pinot, another category of use cases on partial upsert are enabled (v0.8.0+). Partial upsert is convenient to users so that they only need to specify the columns whose value changes, and ignore the others.
To enable upsert on a Pinot table, there are a couple of configurations to make on the table configurations as well as on the input stream.
To update a record, a primary key is needed to uniquely identify the record. To define a primary key, add the field primaryKeyColumns
to the schema definition. For example, the schema definition of UpsertMeetupRSVP
in the quick start example has this definition.
Note this field expects a list of columns, as the primary key can be composite.
When two records of the same primary key are ingested, the record with the greater event time (as defined by the time column) is used. When records with the same primary key and event time, then the order is not determined. In most cases, the later ingested record will be used, but may not be so in the cases when the table has a column to sort by.
An important requirement for the Pinot upsert table is to partition the input stream by the primary key. For Kafka messages, this means the producer shall set the key in the send
API. If the original stream is not partitioned, then a streaming processing job (e.g. Flink) is needed to shuffle and repartition the input stream into a partitioned one for Pinot's ingestion.
There are a few configurations needed in the table configurations to enable upsert.
For append-only tables, the upsert mode defaults to NONE
. To enable the full upsert, set the mode
to FULL
for the full update. For example:
Pinot also added the partial update support in v0.8.0+. To enable the partial upsert, set the mode
to PARTIAL
and specify partialUpsertStrategies
for partial upsert columns. For example:
Pinot supports the following partial upsert strategies -
By default, Pinot uses the value in the time column to determine the latest record. That means, for two records with the same primary key, the record with the larger value of the time column is picked as the latest update. However, there are cases when users need to use another column to determine the order. In such case, you can use option comparisonColumn
to override the column used for comparison. For example,
The upsert Pinot table can use only the low-level consumer for the input streams. As a result, it uses the partitioned replica-group assignment for the segments. Moreover,upsert poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires to use strictReplicaGroup
as the routing strategy. To use that, configure instanceSelectorType
in Routing
as the following:
There are some limitations for the upsert Pinot tables.
First, the high-level consumer is not allowed for the input stream ingestion, which means stream.kafka.consumer.type
must be lowLevel
.
Second, the star-tree index cannot be used for indexing, as the star-tree index performs pre-aggregation during the ingestion.
Putting these together, you can find the table configurations of the quick start example as the following:
To illustrate how the full upsert works, the Pinot binary comes with a quick start example. Use the following command to creates a realtime upsert table meetupRSVP
.
You can also run partial upsert demo with the following command
As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to checkout the realtime data.
For partial upsert you can see only the value from configured column changed based on specified partial upsert strategy.
An example for partial upsert is shown below, each of the event_id kept being unique during ingestion, meanwhile the value of rsvp_count incremented.
To see the difference from the append-only table, you can use a query option skipUpsert
to skip the upsert effect in the query result.
This guide shows you how to import data from HDFS.
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include
, you need to put all the plugins you want to use, e.g. pinot-json
, pinot-avro
, pinot-kafka-2.0...
HDFS implementation provides the following options -
hadoop.conf.path
: Absolute path of the directory containing hadoop XML configuration files such as hdfs-site.xml, core-site.xml .
hadoop.write.checksum
: create checksum while pushing an object. Default is false
hadoop.kerberos.principle
hadoop.kerberos.keytab
Each of these properties should be prefixed by pinot.[node].storage.factory.class.hdfs.
where node
is either controller
or server
depending on the config
You will also need to provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.
To push HDFS segment files to Pinot controller, you just need to ensure you have proper Hadoop configuration as we mentioned in the previous part. Then your remote segment creation/push job can send the HDFS path of your newly created segment files to the Pinot Controller and let it download the files.
For example, the following curl requests to Controller will notify it to download segment files to the proper table:
Standalone Job:
Hadoop Job:
This guide shows you how to import data from files stored in Azure Data Lake Storage (ADLS)
You can enable the Azure Data Lake Storage using the plugin pinot-adls
. In the controller or server, add the config -
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include
, you need to put all the plugins you want to use, e.g. pinot-json
, pinot-avro
, pinot-kafka-2.0...
Azure Blob Storage provides the following options -
accountName
: Name of the azure account under which the storage is created
accessKey
: access key required for the authentication
fileSystemName
- name of the filesystem to use i.e. container name (container name is similar to bucket name in S3)
enableChecksum
- enable MD5 checksum for verification. Default is false
.
Each of these properties should be prefixed by pinot.[node].storage.factory.class.abfss.
where node
is either controller
or server
depending on the config
e.g.
This guide shows you how to import data from GCP (Google Cloud Platform).
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include
, you need to put all the plugins you want to use, e.g. pinot-json
, pinot-avro
, pinot-kafka-2.0...
GCP filesystems provides the following options -
projectId
- The name of the Google Cloud Platform project under which you have created your storage bucket.
Each of these properties should be prefixed by pinot.[node].storage.factory.class.gs.
where node
is either controller
or server
depending on the config
e.g.
This section contains a collection of guides that will show you how to import data from a Pinot supported input format.
Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time that goes in serialization-deserialization and speed up the ingestion.
The input format can be changed using the recordReaderSpec
config in the ingestion job spec.
The config consists of the following keys -
dataFormat
- Name of the data format to consume.
className
- name of the class that implements the RecordReader
interface. This class is used for parsing the data.
configClassName
- name of the class that implements the RecordReaderConfig
interface. This class is used the parse the values mentioned in configs
configs
- Key value pair for format specific configs. This field can be left out.
Pinot supports the multiple input formats out of the box. You just need to specify the corresponding readers and the associated custom configs to switch between the formats.
CSV Record Reader supports the following configs -
fileFormat
- can be one of default, rfc4180, excel, tdf, mysql
header
- header of the file. The columnNames should be seperated by the delimiter mentioned in the config
delimiter
- The character seperating the columns
multiValueDelimiter
- The character seperating multiple values in a single column. This can be used to split a column into a list.
Your CSV file may have raw text fields that cannot be reliably delimited using any character. In this case, explicitly set the multiValueDelimeter field to empty in the ingestion config.
multiValueDelimiter: ''
The Avro record reader converts the data in file to a GenericRecord
. A java class or .avro
file is not required.
Note: Thrift requires the generated class using .thrift
file to parse the data. The .class file should be available in the Pinot's classpath. You can put the files in the lib/
folder of pinot distribution directory.
The above class doesn't read the Parquet INT96
and Decimal
type.
Please use the below class to handle INT96
and Decimal
type.
ORC record reader supports the following data types -
In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.
The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc
) from the .proto
file using the command -
You can enable the using the plugin pinot-hdfs
. In the controller or server, add the config:
The kerberos
configs should be used only if your Hadoop installation is secured with Kerberos. Please check on how to generate Kerberos security identification.
You can enable the using the plugin pinot-gcs
. In the controller or server, add the config -
gcpKey
- Location of the json file containing GCP keys. You can refer to download the keys.
Configuration
Description
region
The AWS Data center region in which the bucket is located
accessKey
(Optional) AWS access key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
secretKey
(Optional) AWS secret key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
endpoint
(Optional) Override endpoint for s3 client.
disableAcl
If this is set tofalse
, bucket owner is granted full access to the objects created by pinot. Default value is true
.
serverSideEncryption
(Optional) The server-side encryption algorithm used when storing this object in Amazon S3 (Now supports aws:kms
), set to null to disable SSE.
ssekmsKeyId
(Optional, but required when serverSideEncryption=aws:kms
) Specifies the AWS KMS key ID to use for object encryption. All GET and PUT requests for an object protected by AWS KMS will fail if not made via SSL or using SigV4.
ssekmsEncryptionContext
(Optional) Specifies the AWS KMS Encryption Context to use for object encryption. The value of this header is a base64-encoded UTF-8 string holding JSON with the encryption context key-value pairs.
Strategy
Description
OVERWRITE
Overwrite the column of the last record
INCREMENT
Add the new value to the existing values
APPEND
Add the new item to the Pinot unordered set
UNION
Add the new item to the Pinot unordered set if not exists
Parquet Data Type | Java Data Type | Comment |
INT96 | INT64 | Parquet to Pinot |
DECIMAL | DOUBLE |
ORC Data Type | Java Data Type |
BOOLEAN | String |
SHORT | Integer |
INT | Integer |
LONG | Integer |
FLOAT | Float |
DOUBLE | Double |
STRING | String |
VARCHAR | String |
CHAR | String |
LIST | Object[] |
MAP | Map<Object, Object> |
DATE | Long |
TIMESTAMP | Long |
BINARY | byte[] |
BYTE | Integer |
This page describes the different indexing techniques available in Pinot
Pinot supports the following indexing techniques:
Dictionary-encoded forward index with bit compression
Raw value forward index
Sorted forward index with run-length encoding
Bitmap inverted index
Sorted inverted index
Each of these techniques has advantages in different query scenarios. By default, Pinot creates a dictionary-encoded forward index for each column.
There are 2 ways to create indexes for a Pinot table.
Indexing is enabled by specifying the desired column names in the table config. More details about how to configure each type of index can be found in the respective index's section above or in the Table Config section.
Indexes can also be dynamically added to or removed from segments at any point. Update your table config with the latest set of indexes you wish to have. For example, if you had an inverted index on foo
and now want to include bar
, you would update your table config from this:
To this:
Next, invoke reload API. This API sends reload messages via Helix to all servers, as part of which indexes are added or removed from the local segments. This happens without any downtime and is completely transparent to the queries. In the case of the addition of an index, only the new index is created and appended to the existing segment. In the case of the removal of an index, its related states are cleaned up from Pinot servers. You can find this API under Segments
tab on Swagger:
Or you can also find this action on the Pinot UI, on the specific table's page.
The inverted index will provide good performance for most use cases, especially if your use case doesn't have a strict low latency requirement. You should start by using this, and if your queries aren't fast enough, switch to advanced indices like the sorted or Star-Tree index.
When inverted index is enabled for a column, Pinot maintains a map from each value to a bitmap, which makes value lookup to be constant time. When you have a column that is used for filtering frequently, adding an inverted index will improve the performance greatly.
Inverted index can be configured for a table by setting it in the table config as
Sorted forward index can directly be used as inverted index, with log(n)
time lookup and it can benefit from data locality.
For the below example, if the query has a filter on memberId
, Pinot will perform binary search on memberId
values to find the range pair of docIds for corresponding filtering value. If the query requires to scan values for other columns after filtering, values within the range docId pair will be located together; therefore, we can benefit a lot from data locality.
Sorted index performs much better than inverted index; however, it can only be applied to one column. When the query performance with inverted index is not good enough and most of queries have a filter on a specific column (e.g. memberId), sorted index can improve the query performance.
For each unique value from a column, we assign an id to it, and build a dictionary from the id to the value. Then in the forward index, we only store the bit-compressed ids instead of the values. With few number of unique values, dictionary-encoding can significantly improve the space efficiency of the storage.
The below diagram shows the dictionary encoding for two columns with integer
and string
types. As seen in the colA
, dictionary encoding will save significant amount of space for duplicated values. On the other hand, colB
has no duplicated data. Dictionary encoding will not compress much data in this case where there are a lot of unique values in the column. For string
type, we pick the length of the longest value and use it as the length for dictionary’s fixed length value array. In this case, padding overhead can be high if there are a large number of unique values for a column.
In contrast to the dictionary-encoded forward index, raw value forward index directly stores values instead of ids.
Without the dictionary, the dictionary lookup step can be skipped for each value fetch. Also, the index can take advantage of the good locality of the values, thus improve the performance of scanning large number of values.
A typical use case to apply raw value forward index is when the column has a large number of unique values and the dictionary does not provide much compression. As seen the above diagram for dictionary encoding, scanning values with a dictionary involves a lot of random access because we need to perform dictionary look up. On the other hand, we can scan values sequentially with raw value forward index and this can improve performance a lot when applied appropriately.
Raw value forward index can be configured for a table by setting it in the table config as
When a column is physically sorted, Pinot uses a sorted forward index with run-length encoding on top of the dictionary-encoding. Instead of saving dictionary ids for each document id, we store a pair of start and end document id for each value. (The below diagram does not include dictionary encoding layer for simplicity.)
Sorted forward index has the advantages of both good compression and data locality. Sorted forward index can also be used as inverted index.
Sorted index can be configured for a table by setting it in the table config as
Note: A given Pinot table can only have 1 sorted column
Real-time server will sort data on sortedColumn
when generating segment internally. For offline push, input data needs to be sorted before running Pinot segment conversion and push job.
When applied correctly, one can find the following information on the segment metadata.
Complex-type handling in Apache Pinot.
It's common for the ingested data to have complex structure. For example, Avro schema has records and arrays, and JSON data has objects and arrays. In Apache Pinot, the data model supports primitive data types (including int, long, float, double, string, bytes), as well as limited multi-value types such as an array of primitive types. Such simple data types allow Pinot to build fast indexing structures for good query performance, but it requires some handling on the complex structures. There are in general two options for such handling: convert the complex-type data into JSON string and then build JSON index; or use the inbuilt complex-type handling rules in the ingestion config.
In this page, we'll show how to handle this complex-type structure with these two approaches, to process the example data in the following figure, which is a field group
from the Meetup events Quickstart example. Note this object has two child fields, and the child group
is a nested array with the element of object type.
Apache Pinot provides powerful JSON index to accelerate the value lookup and filtering for the column. To convert an object group
with complex type to JSON, you can add the following config to table config.
Note the config transformConfigs
transforms the object group
to a JSON string group_json
, which then creates the JSON indexing with config jsonIndexColumns
. To read the full spec, please check out this file. Also note that group
is a reserved keyword in SQL, and that's why it's quoted in the transformFunction
.
Additionall, you need to overwrite the maxLength
of the field group_json
on the schema, because by default, a string column has a limited length. For example,
For the full spec, please check out this file.
With this, you can start to query the nested fields under group
. For the deatils about the supported JSON function, please check out this guide).
Though JSON indexing is a handy way to process the complex types, there are some limitations:
It’s not performant to group by or order by a JSON field, because JSON_EXTRACT_SCALAR
is needed to extract the values in the GROUP BY and ORDER BY clauses, which invokes the function evaluation.
For cases that you want to use Pinot's multi-column functions such as DISTINCTCOUNTMV
Alternatively, from Pinot 0.8, you can use the complex-type handling in ingestion configurations to flatten and unnest the complex structure and convert them into primitive types. Then you can reduce the complex-type data into a flattened Pinot table, and query it via SQL. With the inbuilt processing rules, you do not need to write ETL jobs in another compute framework such as Flink or Spark.
To process this complex-type, you can add the configuration complexTypeConfig
to the ingestionConfig
. For example:
With the complexTypeConfig
, all the map objects will be flattened to direct fields automatically. And with unnestFields
, a record with the nested collection will unnest into multiple records. For instance, the example in the beginning will transform into two rows with this configuration example.
Note that
The nested field group_id
under group
is flattened to field group.group_id
. The default value of the delimiter is .
, you can choose other delimiter by changing the configuration delimiter
under complexTypeConfig
. This flattening rule also apllies on the maps in the collections to be unnested.
The nested array group_topics
under group
is unnested into the top-level, and convert the output to a collection of two rows. Note the handling of the nested field within group_topics
, and the eventual top-level field of group.group_topics.urlkey
. All the collections to unnest shall be included in configuration fieldsToUnnest
.
For the collections not in specified in fieldsToUnnest
, the ingestion by default will serialize them into JSON string, except for the array of primitive values, which will be ingested as multi-value column by default. The behavior is defined in config collectionNotUnnestedToJson
with default value to NON_PRIMITIVE
. Other behaviors include (1) ALL
, which aslo convert the array of primitive values to JSON string; (2) NONE
, this does not do conversion, but leave it to the users to use transform functions for handling.
You can find the full spec of the table config here and the table schema here.
With the flattening/unnesting, you can then query the table with primitive values using the SQL query like:
Note .
is a reserved character in SQL, so you need to quote the flattened column.
When there are complex structures, it could be challenging and tedious to figure out the Pinot schema manually. To help the schema inference, Pinot provides utility tools to take the Avro schema or JSON data as input and output the inferred Pinot schema.
To infer the Pinot schema from Avro schema, you can use the command like the following
Note you can input configurations like fieldsToUnnest
similar to the ones in complexTypeConfig
. And this will simulate the complex-type handling rules on the Avro schema and output the Pinot schema in the file specified in outputDir
.
Similarly, you can use the command like the following to infer the Pinot schema from a file of JSON objects.
You can check out an example of this run in this PR.
Unlike other index techniques which work on single column, Star-Tree index is built on multiple columns, and utilize the pre-aggregated results to significantly reduce the number of values to be processed, thus improve the query performance.
One of the biggest challenges in realtime OLAP systems is achieving and maintaining tight SLA’s on latency and throughput on large data sets. Existing techniques such as sorted index or inverted index help improve query latencies, but speed-ups are still limited by number of documents necessary to process for computing the results. On the other hand, pre-aggregating the results ensures a constant upper bound on query latencies, but can lead to storage space explosion.
Here we introduce star-tree index to utilize the pre-aggregated documents in a smart way to achieve low query latencies but also use the storage space efficiently for aggregation/group-by queries.
Consider the following data set as an example to discuss the existing approaches:
In this approach, data is sorted on a primary key, which is likely to appear as filter in most queries in the query set.
This reduces the time to search the documents for a given primary key value from linear scan O(n) to binary search O(logn), and also keeps good locality for the documents selected.
While this is a good improvement over linear scan, there are still a few issues with this approach:
While sorting on one column does not require additional space, sorting on additional columns would require additional storage space to re-index the records for the various sort orders.
While search time is reduced from O(n) to O(logn), overall latency is still a function of total number of documents need to be processed to answer a query.
In this approach, for each value of a given column, we maintain a list of document id’s where this value appears.
Below are the inverted indexes for columns ‘Browser’ and ‘Locale’ for our example data set:
For example, if we want to get all the documents where ‘Browser’ is ‘Firefox’, we can simply look up the inverted index for ‘Browser’ and identify that it appears in documents [1, 5, 6].
Using inverted index, we can reduce the search time to constant time O(1). However, the query latency is still a function of the selectivity of the query, i.e. increases with the number of documents need to be processed to answer the query.
In this technique, we pre-compute the answer for a given query set upfront.
In the example below, we have pre-aggregated the total impressions for each country:
Doing so makes answering queries about total impressions for a country just a value lookup, by eliminating the need of processing a large number of documents. However, to be able to answer with multiple predicates implies pre-aggregating for various combinations of different dimensions. This leads to exponential explosion in storage space.
On one end of the spectrum we have indexing techniques that improve search times with limited increase in space, but do not guarantee a hard upper bound on query latencies. On the other end of the spectrum we have pre-aggregation techniques that offer hard upper bound on query latencies, but suffer from exponential explosion of storage space
Space-Time Trade Off Between Different Techniques
We propose the Star-Tree data structure that offers a configurable trade-off between space and time and allows us to achieve hard upper bound for query latencies for a given use case. In the following sections we will define the Star-Tree data structure, and discuss how it is utilized within Pinot for achieving low latencies with high throughput.
Tree structure
Star-tree is a tree data structure that is consisted of the following properties:
Star-tree Structure
Root Node (Orange): Single root node, from which the rest of the tree can be traversed.
Leaf Node (Blue): A leaf node can containing at most T records, where T is configurable.
Non-leaf Node (Green): Nodes with more than T records are further split into children nodes.
Star-Node (Yellow): Non-leaf nodes can also have a special child node called the Star-Node. This node contains the pre-aggregated records after removing the dimension on which the data was split for this level.
Dimensions Split Order ([D1, D2]): Nodes at a given level in the tree are split into children nodes on all values of a particular dimension. The dimensions split order is an ordered list of dimensions that is used to determine the dimension to split on for a given level in the tree.
Node properties
The properties stored in each node are as follows:
Dimension: The dimension which the node is split on
Start/End Document Id: The range of documents this node points to
Aggregated Document Id: One single document which is the aggregation result of all documents pointed by this node
Star-tree index is generated in the following steps:
The data is first projected as per the dimensionsSplitOrder. Only the dimensions from the split order are reserved, others are dropped. For each unique combination of reserved dimensions, metrics are aggregated per configuration. The aggregated documents are written to a file and served as the initial Star-Tree documents (separate from the original documents).
Sort the Star-Tree documents based on the dimensionsSplitOrder. It is primary-sorted on the first dimension in this list, and then secondary sorted on the rest of the dimensions based on their order in the list. Each node in the tree points to a range in the sorted documents.
The tree structure can be created recursively (starting at root node) as follows:
If a node has more than T records, it is split into multiple children nodes, one for each value of the dimension in the split order corresponding to current level in the tree.
A Star-Node can be created (per configuration) for the current node, by dropping the dimension being split on, and aggregating the metrics for rows containing dimensions with identical values. These aggregated documents are appended to the end of the Star-Tree documents.
If there is only one value for the current dimension, Star-Node won’t be created because the documents under the Star-Node are identical to the single node.
The above step is repeated recursively until there are no more nodes to split.
Multiple Star-Trees can be generated based on different configurations (dimensionsSplitOrder, aggregations, T)
Aggregation is configured as a pair of aggregation function and the column to apply the aggregation.
All types of aggregation function with bounded-sized intermediate result are supported.
Supported functions
COUNT
MIN
MAX
SUM
AVG
MIN_MAX_RANGE
DISTINCT_COUNT_HLL
PERCENTILE_EST
PERCENTILE_TDIGEST
DISTINCT_COUNT_BITMAP
NOTE: The intermediate result RoaringBitmap is not bounded-sized, use carefully on high cardinality columns)
Unsupported functions
DISTINCT_COUNT
Intermediate result Set is unbounded
SEGMENT_PARTITIONED_DISTINCT_COUNT:
Intermediate result Set is unbounded
PERCENTILE
Intermediate result List is unbounded
Functions to be supported
DISTINCT_COUNT_THETA_SKETCH
ST_UNION
Multiple index generation configurations can be provided to generate multiple star-trees. Each configuration should contain the following properties:
dimensionsSplitOrder: An ordered list of dimension names can be specified to configure the split order. Only the dimensions in this list are reserved in the aggregated documents. The nodes will be split based on the order of this list. For example, split at level i is performed on the values of dimension at index i in the list.
The star-tree dimension does not have to be a dimension column in the table, it can also be time column, date-time column, or metric column if necessary.
The star-tree dimension column should be dictionary encoded in order to generate the star-tree index.
All columns in the filter and group-by clause of a query should be included in this list in order to use the star-tree index.
skipStarNodeCreationForDimensions (Optional, default empty): A list of dimension names for which to not create the Star-Node.
functionColumnPairs: A list of aggregation function and column pairs (split by double underscore “__”). E.g. SUM__Impressions (SUM of column Impressions) or COUNT__*.
The column within the function-column pair can be either dictionary encoded or raw.
All aggregations of a query should be included in this list in order to use the star-tree index.
maxLeafRecords (Optional, default 10000): The threshold T to determine whether to further split each node.
Default star-tree index can be added to the segment with a boolean config enableDefaultStarTree under the tableIndexConfig.
The default star-tree will have the following configuration:
All dictionary-encoded single-value dimensions with cardinality smaller or equal to a threshold (10000) will be included in the dimensionsSplitOrder, sorted by their cardinality in descending order.
All dictionary-encoded Time/DateTime columns will be appended to the dimensionsSplitOrder following the dimensions, sorted by their cardinality in descending order. Here we assume that time columns will be included in most queries as the range filter column and/or the group by column, so for better performance, we always include them as the last elements in the dimensionsSplitOrder.
Include COUNT(*) and SUM for all numeric metrics in the functionColumnPairs.
Use default maxLeafRecords (10000).
For our example data set, in order to solve the following query efficiently:
We may config the star-tree index as following:
The star-tree and documents should be something like below:
The values in the parentheses are the aggregated sum of Impressions for all the documents under the node.
Star-tree documents
For query execution, the idea is to first check metadata to determine whether the query can be solved with the Star-Tree documents, then traverse the Star-Tree to identify documents that satisfy all the predicates. After applying any remaining predicates that were missed while traversing the Star-Tree to the identified documents, apply aggregation/group-by on the qualified documents.
The algorithm to traverse the tree can be described as follows:
Start from root node.
For each level, what child node(s) to select depends on whether there are any predicates/group-by on the split dimension for the level in the query.
If there is no predicate or group-by on the split dimension, select the Star-Node if exists, or all child nodes to traverse further.
If there are predicate(s) on the split dimension, select the child node(s) that satisfy the predicate(s).
If there is no predicate, but there is a group-by on the split dimension, select all child nodes except Star-Node.
Recursively repeat the previous step until all leaf nodes are reached, or all predicates are satisfied.
Collect all the documents pointed by the selected nodes.
If all predicates and group-by's are satisfied, pick the single aggregated document from each selected node.
Otherwise, collect all the documents in the document range from each selected node.
Range indexing allows user to get better performance for queries which involve filtering over a range. e.g.
SELECT COUNT(*) from baseballStats where hits > 11
Range index is just a variant of inverted index. Instead of creating mapping from values to columns, we create mapping of a range of values to columns. You can use the range index by setting the following config in the table config json.
Currently, range indexing is only supported for dictionary columns. Range indexing support for raw value columns is WIP.
Country | Browser | Locale | Impressions |
CA | Chrome | en | 400 |
CA | Firefox | fr | 200 |
MX | Safari | es | 300 |
MX | Safari | en | 100 |
USA | Chrome | en | 600 |
USA | Firefox | es | 200 |
USA | Firefox | en | 400 |
Browser | Doc Id |
Firefox | 1,5,6 |
Chrome | 0,4 |
Safari | 2,3 |
Locale | Doc Id |
en | 0,3,4,6 |
es | 2,5 |
fr | 1 |
Country | Impressions |
CA | 600 |
MX | 400 |
USA | 1200 |
Country | Browser | Locale | SUM__Impressions |
CA | Chrome | en | 400 |
CA | Firefox | fr | 200 |
MX | Safari | en | 100 |
MX | Safari | es | 300 |
USA | Chrome | en | 600 |
USA | Firefox | en | 400 |
USA | Firefox | es | 200 |
CA | * | en | 400 |
CA | * | fr | 200 |
CA | * | * | 600 |
MX | Safari | * | 400 |
USA | Firefox | * | 600 |
USA | * | en | 1000 |
USA | * | es | 200 |
USA | * | * | 1200 |
* | Chrome | en | 1000 |
* | Firefox | en | 400 |
* | Firefox | es | 200 |
* | Firefox | fr | 200 |
* | Firefox | * | 800 |
* | Safari | en | 100 |
* | Safari | es | 300 |
* | Safari | * | 400 |
* | * | en | 1500 |
* | * | es | 500 |
* | * | fr | 200 |
* | * | * | 2200 |
Bloom filter helps prune segments that do not contain any record matching a EQUALITY predicate, e.g.
SELECT COUNT(*) from baseballStats where playerID = 12345
There are 3 parameters to configure the bloom filter:
fpp
: False positive probability of the bloom filter (from 0
to 1
, 0.05
by default). The lower the fpp
, the higher accuracy the bloom filter has, but it will also increase the size of the bloom filter.
maxSizeInBytes
: Maximum size of the bloom filter (unlimited by default). If a certain fpp
generates a bloom filter larger than this size, we will increase the fpp
to keep the bloom filter size within this limit.
loadOnHeap
: Whether to load the bloom filter using heap memory or off-heap memory (false
by default).
There are 2 ways of configuring bloom filter for a table in the table config:
Configure bloom filter columns with default settings
Configure bloom filter columns with customized parameters
Currently bloom filter can only be applied to the dictionary-encoded columns. Bloom filter support for raw value columns is WIP.
JSON index can be applied to JSON string columns to accelerate the value lookup and filtering for the column.
JSON string can be used to represent the array, map, nested field without forcing a fixed schema. It is very flexible, but the flexibility comes with a cost - filtering on JSON string column is very expensive.
Suppose we have some JSON records similar to the following sample record stored in the person
column:
Without an index, in order to look up a key and filter records based on the value, we need to scan and reconstruct the JSON object from the JSON string for every record, look up the key and then compare the value.
For example, in order to find all persons whose name is "adam", the query will look like:
JSON index is designed to accelerate the filtering on JSON string columns without scanning and reconstructing all the JSON objects.
To enable the JSON index, set the following config in the table config:
Note that JSON index can only be applied to STRING
columns whose values are JSON strings.
JSON index can be used via the JSON_MATCH
predicate: JSON_MATCH(<column>, '<filterExpression>')
. For example, to find all persons whose name is "adam", the query will look like:
Note that the quotes within the filter expression need to be escaped.
In release 0.7.1
, we use the old syntax for filterExpression
: 'name=''adam'''
Find all persons whose name is "adam":
In release 0.7.1
, we use the old syntax for filterExpression: 'name=''adam'''
Find all persons who have an address (one of the addresses) with number 112:
In release 0.7.1
, we use the old syntax for filterExpression: 'addresses.number=112'
Find all persons whose name is "adam" and also have an address (one of the addresses) with number 112:
In release 0.7.1
, we use the old syntax for filterExpression: 'name=''adam'' AND addresses.number=112'
Find all persons whose first address has number 112:
In release 0.7.1
, we use the old syntax for filterExpression: '"addresses[0].number"=112'
Find all persons who have phone field within the JSON:
In release 0.7.1
, we use the old syntax for filterExpression: 'phone IS NOT NULL'
Find all persons whose first address does not contain floor field within the JSON:
In release 0.7.1
, we use the old syntax for filterExpression: '"addresses[0].floor" IS NULL'
The JSON context is maintained for object elements within an array, i.e. the filter won't cross match different objects in the array.
To find all persons who live on "main st" in "ca":
This query won't match "adam" because none of his addresses matches both the street and the country.
If JSON context is not desired, use multiple separate JSON_MATCH
predicates. E.g. to find all persons who have addresses on "main st" and have addressed in "ca" (doesn't have to be the same address):
This query will match "adam" because one of his addressed matches the street and another one matches the country.
Note that the array index is maintained as a separate entry within the element, so in order to query different elements within an array, multiple JSON_MATCH
predicates are required. E.g. to find all persons who have first address on "main st" and second address on "second st":
See examples above.
To find the records with array element "item1" in "arrayCol":
To find the records with second array element "item2" in "arrayCol":
To find the records with value 123 in "valueCol":
To find the records with null in "nullableCol":
In release 0.7.1
, json string must be object (cannot be null
, value or array); multi-dimensional array is not supported.
The key (left-hand side) of the filter expression must be the leaf level of the JSON object, e.g. "$.addresses[*]"='main st'
won't work.
The following summarizes Pinot's releases, from the latest one to the earliest one.
Before upgrading from one version to another one, please read the release notes. While the Pinot committers strive to keep releases backward-compatible and introduce new features in a compatible manner, your environment may have a unique combination of configurations/data/schema that may have been somehow overlooked. Before you roll out a new release of Pinot on your cluster, it is best that you run the compatibility test suite that Pinot provides. The tests can be easily customized to suit the configurations and tables in your pinot cluster(s). As a good practice, you should build your own test suite, mirroring the table configurations, schema, sample data, and queries that are used in your cluster.
This page talks about geospatial support in Pinot.
Pinot supports SQL/MM geospatial data and is compliant with the Open Geospatial Consortium’s (OGC) OpenGIS Specifications. This includes:
Geospatial data types, such as point, line and polygon;
Geospatial functions, for querying of spatial properties and relationships.
Geospatial indexing, used for efficient processing of spatial operations
Geospatial data types abstract and encapsulate spatial structures such as boundary and dimension. In many respects, spatial data types can be understood simply as shapes. Pinot supports the Well-Known Text (WKT) and Well-Known Binary (WKB) form of geospatial objects, for example:
POINT (0, 0)
LINESTRING (0 0, 1 1, 2 1, 2 2)
POLYGON (0 0, 10 0, 10 10, 0 10, 0 0),(1 1, 1 2, 2 2, 2 1, 1 1)
MULTIPOINT (0 0, 1 2)
MULTILINESTRING ((0 0, 1 1, 1 2), (2 3, 3 2, 5 4))
MULTIPOLYGON (((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1)), ((-1 -1, -1 -2, -2 -2, -2 -1, -1 -1)))
GEOMETRYCOLLECTION(POINT(2 0),POLYGON((0 0, 1 0, 1 1, 0 1, 0 0)))
It is common to have data in which the coordinate are geographics
or latitude/longitude.
Unlike coordinates in Mercator or UTM, geographic coordinates are not Cartesian coordinates.
Geographic coordinates do not represent a linear distance from an origin as plotted on a plane. Rather, these spherical coordinates describe angular coordinates on a globe.
Spherical coordinates specify a point by the angle of rotation from a reference meridian (longitude), and the angle from the equator (latitude).
You can treat geographic coordinates as approximate Cartesian coordinates and continue to do spatial calculations. However, measurements of distance, length and area will be nonsensical. Since spherical coordinates measure angular distance, the units are in degrees.
Pinot supports both geometry and geography types, which can be constructed by the corresponding functions as shown in section. And for the geography types, the measurement functions such as ST_Distance
and ST_Area
calculate the spherical distance and area on earth respectively.
For manipulating geospatial data, Pinot provides a set of functions for analyzing geometric components, determining spatial relationships, and manipulating geometries. In particular, geospatial functions that begin with the ST_
prefix support the SQL/MM specification.
Following geospatial functions are available out of the box in Pinot-
ST_Union(geometry[] g1_array) → Geometry This aggregate function returns a MULTI geometry or NON-MULTI geometry from a set of geometries. it ignores NULL geometries.
ST_GeomFromText(String wkt) → Geometry Returns a geometry type object from WKT representation, with the optional spatial system reference.
ST_GeomFromWKB(bytes wkb) → Geometry Returns a geometry type object from WKB representation.
ST_Point(double x, double y) → Point Returns a geometry type point object with the given coordinate values.
ST_Polygon(String wkt) → Polygon Returns a geometry type polygon object from WKT representation.
ST_GeogFromWKB(bytes wkb) → Geography Creates a geography instance from a Well-Known Binary geometry representation (WKB)
ST_GeogFromText(String wkt) → Geography Return a specified geography value from Well-Known Text representation or extended (WKT).
ST_Area(Geometry/Geography g) → double For geometry type, it returns the 2D Euclidean area of a geometry. For geography, returns the area of a polygon or multi-polygon in square meters using a spherical model for Earth.
ST_Distance(Geometry/Geography g1, Geometry/Geography g2) → double For geometry type, returns the 2-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units. For geography, returns the great-circle distance in meters between two SphericalGeography points. Note that g1, g2 shall have the same type.
ST_GeometryType(Geometry g) → String Returns the type of the geometry as a string. e.g.: ST_Linestring
, ST_Polygon
,ST_MultiPolygon
etc.
ST_AsBinary(Geometry/Geography g) → bytes Returns the WKB representation of the geometry.
ST_AsText(Geometry/Geography g) → string Returns the WKT representation of the geometry/geography.
ST_Contains(Geometry, Geometry) → boolean Returns true if and only if no points of the second geometry/geography lie in the exterior of the first geometry/geography, and at least one point of the interior of the first geometry lies in the interior of the second geometry.
ST_Equals(Geometry, Geometry) → boolean Returns true if the given geometries represent the same geometry/geography.
ST_Within(Geometry, Geometry) → boolean Returns true if first geometry is completely inside second geometry.
Geospatial functions are typically expensive to evaluate, and using geoindex can greatly accelebrate the query evaluation. Geoindexing in Pinot is based on Uber’s H3, a hexagons-based hierarchical gridding. A given geospatial location (longitude, latitude) can map to one hexagon (represented as H3Index). And its neighbors in H3 can be approximated by a ring of hexagons. To quickly identify the distance between any given two geospatial locations, we can convert the two locations in the H3Index, and then check the H3 distance between them. H3 distance is measured as the number of hexagons. For example, in the diagram below, the red hexagons are within the 1 distance of the central hexagon. Moreover, the size of the hexagon is determined by the resolution of the indexing. Please check this table for the level of resolutions and the corresponding precision (measured in km).
To use the geoindex, first declare the geolocation field as bytes in the schema, as in the example of the QuickStart example.
Note the use of transformFunction
that converts the created point into SphericalGeography
format, which is needed in the ST_Distance
function.
Next, declare the geospatial index in the table config:
So the query below will use the geoindex to filter the Starbucks stores within 5km of the given point in the bay area.
Geoindex in Pinot accelerates the query evaluation without compromising the correctness of the query result. Currently, geoindex supports the ST_Distance
function used in the range predicates in the WHERE
clause, as shown in the query example in the previous section.
At the high level, geoindex is used for retrieving the records within the nearby hexagons of the given location, and then use ST_Distance
to accurately filter the matched results.
As in the example diagram above, if we want to find all relevant points within a given distance at San Francisco (represented in the area within the red circle), then the algorithm with geoindex works as the following:
Find the H3 distance x
that contains the range (i.e. red circle)
For the points within the H3 distance (i.e. covered by the hexagons within kRing(x)
), we can directly take those points without filtering
For the points falling into the H3 distance (i.e. in the hexagons of kRing(x)
), we do filtering on them by evaluating the condition ST_Distance(loc1, loc2) < x
This page talks about support for text search functionality in Pinot.
Pinot supports super fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index and sorted index. An example:
In the above query, we are doing exact match on two columns of type STRING and INT respectively.
For arbitrary text data which falls into the BLOB/CLOB territory, we need more than exact matches. Users are interested in doing regex, phrase, fuzzy queries on BLOB like data. Before 0.3.0, one had to use regexp_like to achieve this. However, this was scan based which was not performant and features like fuzzy search (edit distance search) were not possible.
In version 0.3.0, we added support for text indexes to efficiently do arbitrary search on STRING columns where each column value is a large BLOB of text. This can be achieved by using the new built-in function TEXT_MATCH.
where <column_name> is the column text index is created on and <search_expression> can be:
Text search should ideally be used on STRING columns where doing standard filter operations (EQUALITY, RANGE, BETWEEN) doesn't fit the bill because each column value is a reasonably large blob of text.
Consider the following snippet from Apache access log. Each line in the log consists of arbitrary data (IP addresses, URLs, timestamps, symbols etc) and represents a column value. Data like this is a good candidate for doing text search.
Let's say the following snippet of data is stored in ACCESS_LOG_COL column in Pinot table.
Few examples of search queries on this data:
Count the number of GET requests.
Count the number of POST requests that have administrator in the URL (administrator/index)
Count the number of POST requests that have a particular URL and handled by Firefox browser
Consider another example of simple resume text. Each line in the file represents skill-data from resumes of different candidates
Let's say the following snippet of data is stored in SKILLS_COL column in Pinot table. Each line in the input text represents a column value.
Few examples of search queries on this data:
Count the number of candidates that have "machine learning" and "gpu processing" - a phrase search (more on this further in the document) where we are looking for exact match of phrases "machine learning" and "gpu processing" not necessarily in the same order in original data.
Count the number of candidates that have "distributed systems" and either 'Java' or 'C++' - a combination of searching for exact phrase "distributed systems" along with other terms.
Consider a snippet from a log file containing SQL queries handled by a database. Each line (query) in the file represents a column value in QUERY_LOG_COL column in Pinot table.
Few examples of search queries on this data:
Count the number of queries that have GROUP BY
Count the number of queries that have the SELECT count... pattern
Count the number of queries that use BETWEEN filter on timestamp column along with GROUP BY
Further sections in the document cover several concrete examples on each kind of query and step-by-step guide on how to write text search queries in Pinot.
Currently we support text search in a restricted manner. More specifically, we have the following constraints:
The column type should be STRING.
The column should be single-valued.
Co-existence of text index with other Pinot indexes is currently not supported.
The last two restrictions are going to be relaxed very soon in the upcoming releases.
Currently, a column in Pinot can be dictionary encoded or stored RAW. Furthermore, we can create inverted index on the dictionary encoded column. We can also create a sorted index on the dictionary encoded column.
Text index is an addition to the type of per-column indexes users can create in Pinot. However, the current implementation supports text index on RAW column. In other words, the column should not be dictionary encoded. As we relax this constraint in upcoming releases, text index can be created on a dictionary encoded column that also has other indexes (inverted, sorted etc).
Similar to other indexes, users can enable text index on a column through table config. As part of text-search feature, we have also introduced a new generic way of specifying the per-column encoding and index information. In the table config, there will be a new section with name "fieldConfigList".
fieldConfigList
is currently ONLY used for text indexes. Our plan is to migrate all other indexes to this model. We are going to do that in upcoming releases and accordingly modify user documentation. So please continue to specify other index info in table config as you have done till now and use the fieldConfigList
only for text indexes.
"fieldConfigList" will be a new section in table config. It is essentially a list of per-column encoding and index information. In the above example, the list contains text index information for two columns text_col_1 and text_col_2. Each object in fieldConfigList contains the following information
name - Name of the column text index is enabled on
encodingType - As mentioned earlier, we can store a column either as RAW or dictionary encoded. Since for now we have a restriction on the text index, this should always be RAW.
indexType - This should be TEXT.
Since we haven't yet removed the old way of specifying the index info, each column that has a text index should also be specified as noDictionaryColumns
in tableIndexConfig
:
The above mechanism can be used to configure text indexes in the following scenarios:
Adding a new table with text index enabled on one or more columns.
Adding a new column with text index enabled to an existing table.
Enabling text index on an existing column.
Once the text index is enabled on one or more columns through table config, our segment generation code will pick up the config and automatically create text index (per column). This is exactly how other indexes in Pinot are created.
Text index is supported for both offline and realtime segments.
The original text document (a value in the column with text index enabled) is parsed, tokenized and individual "indexable" terms are extracted. These terms are inserted into the index.
Pinot's text index is built on top of Lucene. Lucene's standard english text tokenizer generally works well for most classes of text. We might want to build custom text parser and tokenizer to suit particular user requirements. Accordingly, we can make this configurable for the user to specify on per column text index basis.
A new built-in function TEXT_MATCH has been introduced for using text search in SQL/PQL.
TEXT_MATCH(text_column_name, search_expression)
text_column_name - name of the column to do text search on.
search_expression - search query
We can use TEXT_MATCH function as part of our queries in the WHERE clause. Examples:
We can also use the TEXT_MATCH filter clause with other filter operators. For example:
Combining multiple TEXT_MATCH filter clauses
TEXT_MATCH can be used in WHERE clause of all kinds of queries supported by Pinot
Selection query which projects one or more columns
User can also include the text column name in select list
Aggregation query
Aggregation GROUP BY query
The search expression (second argument to TEXT_MATCH function) is the query string that Pinot will use to perform text search on the column's text index. **Following expression types are supported
This query is used to do exact match of a given phrase. Exact match implies that terms in the user specified phrase should appear in the exact same order in the original text document. Note that document is referred to as the column value.
Let's take the example of resume text data containing 14 documents to walk through queries. The data is stored in column named SKILLS_COL and we have created a text index on this column.
Example 1 - Search in SKILL_COL column to look for documents where each matching document MUST contain phrase "distributed systems" as is
The search expression is '\"Distributed systems\"'
The search expression is always specified within single quotes '<your expression>'
Since we are doing a phrase search, the phrase should be specified within double quotes inside the single quotes and the double quotes should be escaped
'\"<your phrase>\"'
The above query will match the following documents:
But it won't match the following document:
This is because the phrase query looks for the phrase occurring in the original document "as is". The terms as specified by the user in phrase should be in the exact same order in the original document for the document to be considered as a match.
NOTE: Matching is always done in a case-insensitive manner.
Example 2 - Search in SKILL_COL column to look for documents where each matching document MUST contain phrase "query processing" as is
The above query will match the following documents:
Term queries are used to search for individual terms
Example 3 - Search in SKILL_COL column to look for documents where each matching document MUST contain the term 'java'
As mentioned earlier, the search expression is always within single quotes. However, since this is a term query, we don't have to use double quotes within single quotes.
Boolean operators AND, OR are supported and we can use them to build a composite query. Boolean operators can be used to combine phrase and term queries in any arbitrary manner
Example 4 - Search in SKILL_COL column to look for documents where each matching document MUST contain phrases "distributed systems" and "tensor flow". This combines two phrases using AND boolean operator
The above query will match the following documents:
Example 5 - Search in SKILL_COL column to look for documents where each document MUST contain phrase "machine learning" and term 'gpu' and term 'python'. This combines a phrase and two terms using boolean operator
The above query will match the following documents:
When using boolean operators to combine term(s) and phrase(s) or both, please note that:
The matching document can contain the terms and phrases in any order.
The matching document may not have the terms adjacent to each other (if this is needed, please use appropriate phrase query for the concerned terms).
Use of OR operator is implicit. In other words, if phrase(s) and term(s) are not combined using AND operator in the search expression, OR operator is used by default:
Example 6 - Search in SKILL_COL column to look for documents where each document MUST contain ANY one of:
phrase "distributed systems" OR
term 'java' OR
term 'C++'.
We can also do grouping using parentheses:
Example 7 - Search in SKILL_COL column to look for documents where each document MUST contain
phrase "distributed systems" AND
at least one of the terms Java or C++
In the below query, we group terms Java and C++ without any operator which implies the use of OR. The root operator AND is used to combine this with phrase "distributed systems"
Prefix searches can also be done in the context of a single term. We can't use prefix matches for phrases.
Example 8 - Search in SKILL_COL column to look for documents where each document MUST contain text like stream, streaming, streams etc
The above query will match the following documents:
Phrase and term queries work on the fundamental logic of looking up the terms (aka tokens) in the text index. The original text document (a value in the column with text index enabled) is parsed, tokenized and individual "indexable" terms are extracted. These terms are inserted into the index.
Based on the nature of original text and how the text is segmented into tokens, it is possible that some terms don't get indexed individually. In such cases, it is better to use regular expression queries on the text index.
Consider server log as an example and we want to look for exceptions. A regex query is suitable for this scenario as it is unlikely that 'exception' is present as an individual indexed token.
Syntax of a regex query is slightly different from queries mentioned earlier. The regular expression is written between a pair of forward slashes (/).
The above query will match any text document containing exception.
Generally, a combination of phrase and term queries using boolean operators and grouping should allow us to build a complex text search query expression.
The key thing to remember is that phrases should be used when the order of terms in the document is important and if separating the phrase into individual terms doesn't make sense from end user's perspective.
An example would be phrase "machine learning".
However, if we are searching for documents matching Java and C++ terms, using phrase query "Java C++" will actually result in in partial results (could be empty too) since now we are relying the on the user specifying these skills in the exact same order (adjacent to each other) in the resume text.
Term query using boolean AND operator is more appropriate for such cases
This release introduced some excellent new features, including upsert, tiered storage, pinot-spark-connector, support of having clause, more validations on table config and schema, support of ordinals
This release introduced some excellent new features, including upsert, tiered storage, pinot-spark-connector, support of having clause, more validations on table config and schema, support of ordinals in GROUP BY and ORDER BY clause, array transform functions, adding push job type of segment metadata only mode, and some new APIs like updating instance tags, new health check endpoint. It also contains many key bug fixes. See details below.
Add table level lock for segment upload ([#6165])
Brokers should be upgraded before servers in order to keep backward-compatible:
Pinot Components have to be deployed in the following order:
(PinotServiceManager -> Bootstrap services in role ServiceRole.CONTROLLER -> All remaining bootstrap services in parallel)
New settings introduced and old ones deprecated:
This aggregation function is still in beta version. This PR involves change on the format of data sent from server to broker, so it works only when both broker and server are upgraded to the new version:
This release introduced several awesome new features, including JSON index, lookup-based join support, geospatial support, TLS support for pinot connections, and various performance optimizations.
This release introduced several awesome new features, including JSON index, lookup-based join support, geospatial support, TLS support for pinot connections, and various performance optimizations and improvements.
It also adds several new APIs to better manage the segments and upload data to the offline table. It also contains many key bug fixes. See details below.
and the following cherry-picks:
First, configure alternate ingress ports for https/netty-tls on brokers, controllers, and servers. Restart the components with a rolling strategy to avoid cluster downtime.
Second, verify manually that https access to controllers and brokers is live. Then, configure all components to prefer TLS-enabled connections (while still allowing unsecured access). Restart the individual components.
Third, disable insecure connections via configuration. You may also have to set controller.vip.protocol and controller.vip.port and update the configuration files of any ingestion jobs. Restart components a final time and verify that insecure ingress via http is not available anymore.
Apache Pinot has adopted SQL syntax and semantics. Legacy PQL (Pinot Query Language) is deprecated and no longer supported. Please use SQL syntax to query Pinot on broker endpoint /query/sql and controller endpoint /sql
Fix license headers and plugin checks
This release introduced several new features, including compatibility tests, enhanced complex type and Json support, partial upsert support, and new stream ingestion plugins.
This release introduced several awesome new features, including compatibility tests, enhanced complex type and Json support, partial upsert support, and new stream ingestion plugins (AWS Kinesis, Apache Pulsar). It contains a lot of query enhancements such as new timestamp
and boolean
type support and flexible numerical column comparison. It also includes many key bug fixes. See details below.
The release was cut from the following commit: fe83e95aa9124ee59787c580846793ff7456eaa5
and the following cherry-picks:
))
— timeColumnTransformFunction
is removed (backward-incompatible, but rollup is not supported anyway)
— Deprecate collectorType
and replace it with mergeType
— Add roundBucketTimePeriod
and partitionBucketTimePeriod
to config the time bucket for round and partition
This release introduces a new features: Segment Merge and Rollup to simplify users day to day operational work. A new metrics plugin is added to support dropwizard. As usual, new functionalities and many UI/ Performance improvements.
LinkedIn operates a large multi-tenant cluster that serves a business metrics dashboard, and noticed that their tables consisted of millions of small segments. This was leading to slow operations in Helix/Zookeeper, long running queries due to having too many tasks to process, as well as using more space because of a lack of compression.
To solve this problem they added the Segment Merge task, which compresses segments based on timestamps and rolls up/aggregates older data. The task can be run on a schedule or triggered manually via the Pinot REST API.
At the moment this feature is only available for offline tables, but will be added for real-time tables in a future release.
Major Changes:
This release also sees improvements to Pinot’s query console UI.
There have also been improvements and additions to Pinot’s SQL implementation.
This release contains many performance improvement, you may sense it for you day to day queries. Thanks to all the great contributions listed below:
The release was cut from the following commit: and the following cherry-picks:
Tiered storage ()
Upsert feature (, , , , )
Pre-generate aggregation functions in QueryContext ()
Adding controller healthcheck endpoint: /health ()
Add pinot-spark-connector ()
Support multi-value non-dictionary group by ()
Support type conversion for all scalar functions ()
Add additional datetime functionality ()
Support post-aggregation in ORDER-BY ()
Support post-aggregation in SELECT ()
Add RANGE FilterKind to support merging ranges for SQL ()
Add HAVING support (5889)
Support for exact distinct count for non int data types ()
Add max qps bucket count ()
Add Range Indexing support for raw values ()
Add IdSet and IdSetAggregationFunction ()
[Deepstore by-pass]Add a Deepstore bypass integration test with minor bug fixes. ()
Add Hadoop counters for detecting schema mismatch ()
Add RawThetaSketchAggregationFunction ()
Instance API to directly updateTags ()
Add streaming query handler ()
Add InIdSetTransformFunction ()
Add ingestion descriptor in the header ()
Zookeeper put api ()
Feature/#5390 segment indexing reload status api ()
Segment processing framework ()
Support streaming query in QueryExecutor ()
Add list of allowed tables for emitting table level metrics ()
Add FilterOptimizer which supports optimizing both PQL and SQL query filter ()
Adding push job type of segment metadata only mode ()
Minion taskExecutor for RealtimeToOfflineSegments task (, )
Adding array transform functions: array_average, array_max, array_min, array_sum ()
Allow modifying/removing existing star-trees during segment reload ()
Implement off-heap bloom filter reader ()
Support for multi-threaded Group By reducer for SQL. ()
Add OnHeapGuavaBloomFilterReader ()
Support using ordinals in GROUP BY and ORDER BY clause ()
Merge common APIs for Dictionary ()
Added recursive functions validation check for group by ()
Add StrictReplicaGroupInstanceSelector ()
Add IN_SUBQUERY support ()
Add IN_PARTITIONED_SUBQUERY support ()
Some UI features (, , , )
Change group key delimiter from '\t' to '\0' ()
Support for exact distinct count for non int data types ()
Starts Broker and Server in parallel when using ServiceManager ()
Make realtime threshold property names less ambiguous ()
Change Signature of Broker API in Controller ()
Enhance DistinctCountThetaSketchAggregationFunction ()
Improve performance of DistinctCountThetaSketch by eliminating empty sketches and unions. ()
Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data ()
Fixing backward-compatible issue of schema fetch call ()
Fix race condition in MetricsHelper ()
Fixing the race condition that segment finished before ControllerLeaderLocator created. ()
Fix CSV and JSON converter on BYTES column ()
Fixing the issue that transform UDFs are parsed as function name 'OTHER', not the real function names ()
Incorporating embedded exception while trying to fetch stream offset ()
Use query timeout for planning phase ()
Add null check while fetching the schema ()
Validate timeColumnName when adding/updating schema/tableConfig ()
Handle the partitioning mismatch between table config and stream ()
Fix built-in virtual columns for immutable segment ()
Refresh the routing when realtime segment is committed ()
Add support for Decimal with Precision Sum aggregation ()
Fixing the calls to Helix to throw exception if zk connection is broken ()
Allow modifying/removing existing star-trees during segment reload ()
Add max length support in schema builder ()
Enhance star-tree to skip matching-all predicate on non-star-tree dimension ()
Make realtime threshold property names less ambiguous ()
Enhance DistinctCountThetaSketchAggregationFunction ()
Deep Extraction Support for ORC, Thrift, and ProtoBuf Records ()
The release was cut from the following commit:
Add a server metric: queriesDisabled
to check if queries disabled or not. ()
Optimization on GroupKey to save the overhead of ser/de the group keys () ()
Support validation for jsonExtractKey
and jsonExtractScalar
functions () ()
Real Time Provisioning Helper tool improvement to take data characteristics as input instead of an actual segment ()
Add the isolation level config isolation.level
to Kafka consumer (2.0) to ingest transactionally committed messages only ()
Enhance StarTreeIndexViewer to support multiple trees ()
Improves ADLSGen2PinotFS with service principal based auth, auto create container on initial run. It's backwards compatible with key based auth. ()
Add metrics for minion tasks status ()
Use minion data directory as tmp directory for SegmentGenerationAndPushTask to ensure directory is always cleaned up ()
Add optional HTTP basic auth to pinot broker, which enables user- and table-level authentication of incoming queries. ()
Add Access Control for REST endpoints of Controller ()
Add date_trunc to scalar functions to support date_trunc during ingestion ()
Allow tar gz with > 8gb size ()
Add Lookup UDF Join support (), (), () ()
Add cron scheduler metrics reporting ()
Support generating derived column during segment load, so that derived columns can be added on-the-fly ()
Support chained transform functions ()
Add scalar function JsonPathArray to extract arrays from json ()
Add a guard against multiple consuming segments for same partition ()
Remove the usage of deprecated range delimiter ()
Handle scheduler calls with proper response when it's disabled. ()
Simplify SegmentGenerationAndPushTask handling getting schema and table config ()
Add a cluster config to config number of concurrent tasks per instance for minion task: SegmentGenerationAndPushTaskGenerator ()
Replace BrokerRequestOptimizer with QueryOptimizer to also optimize the PinotQuery ()
Add additional string scalar functions ()
Add additional scalar functions for array type ()
Add CRON scheduler for Pinot tasks ()
Set default Data Type while setting type in Add Schema UI dialog ()
Add ImportData sub command in pinot admin ()
H3-based geospatial index () ()
Add JSON index support () () ()
Make minion tasks pluggable via reflection ()
Add compatibility test for segment operations upload and delete ()
Add segment reset API that disables and then enables the segment ()
Add Pinot minion segment generation and push task. ()
Add a version option to pinot admin to show all the component versions ()
Add FST index using lucene lib to speedup REGEXP_LIKE operator on text ()
Add APIs for uploading data to an offline table. ()
Allow the use of environment variables in stream configs ()
Enhance task schedule api for single type/table support ()
Add broker time range based pruner for routing. Query operators supported: RANGE, =, <, <=, >, >=, AND, OR
()
Add json path functions to extract values from json object ()
Create a pluggable interface for Table config tuner ()
Add a Controller endpoint to return table creation time ()
Add tooltips, ability to enable-disable table state to the UI ()
Add Pinot Minion client ()
Add more efficient use of RoaringBitmap in OnHeapBitmapInvertedIndexCreator and OffHeapBitmapInvertedIndexCreator ()
Add decimal percentile support. ()
Add API to get status of consumption of a table ()
Add support to add offline and realtime tables, individually able to add schema and schema listing in UI ()
Improve performance for distinct queries ()
Allow adding custom configs during the segment creation phase ()
Use sorted index based filtering only for dictionary encoded column ()
Enhance forward index reader for better performance ()
Support for text index without raw ()
Add api for cluster manager to get table state ()
Perf optimization for SQL GROUP BY ORDER BY ()
Add support using environment variables in the format of ${VAR_NAME:DEFAULT_VALUE}
in Pinot table configs. ()
Pinot controller metrics prefix is fixed to add a missing dot (). This is a backward-incompatible change that JMX query on controller metrics must be updated
Legacy group key delimiter (\t) was removed to be backward-compatible with release 0.5.0 ()
Upgrade zookeeper version to 3.5.8 to fix ZOOKEEPER-2184: Zookeeper Client should re-resolve hosts when connection attempts fail. ()
Add TLS-support for client-pinot and pinot-internode connections () Upgrades to a TLS-enabled cluster can be performed safely and without downtime. To achieve a live-upgrade, go through the following steps:
PQL endpoint on Broker is deprecated ()
Fix the SIGSEGV for large index ()
Handle creation of segments with 0 rows so segment creation does not fail if data source has 0 rows. ()
Fix QueryRunner tool for multiple runs ()
Use URL encoding for the generated segment tar name to handle characters that cannot be parsed to URI. ()
Fix a bug of miscounting the top nodes in StarTreeIndexViewer ()
Fix the raw bytes column in real-time segment ()
Fixes a bug to allow using JSON_MATCH predicate in SQL queries ()
Fix the overflow issue when loading the large dictionary into the buffer ()
Fix empty data table for distinct query ()
Fix the default map return value in DictionaryBasedGroupKeyGenerator ()
Fix log message in ControllerPeriodicTask ()
Fix bug : RealtimeTableDataManager shuts down SegmentBuildTimeLeaseExtender for all tables in the host ()
Extract time handling for SegmentProcessorFramework ()
Add Apache Pulsar low level and high level connector ()
Enable parallel builds for compat checker ()
Add controller/server API to fetch aggregated segment metadata ()
Support Dictionary Based Plan For DISTINCT ()
Provide HTTP client to kinesis builder ()
Add datetime function with 2 arguments ()
Adding ability to check ingestion status for Offline Pinot table ()
Add timestamp datatype support in JDBC ()
Allow updating controller and broker helix hostname ()
Cancel running Kinesis consumer tasks when timeout occurs ()
Implement Append merger for partial upsert ()
`* SegmentProcessorFramework Enhancement ()
Added TaskMetricsEmitted periodic controler job ()
Support json path expressions in query. ()
Support data preprocessing for AVRO and ORC formats ()
Add partial upsert config and mergers ()
Add support for range index rule recommendation(#7034) ()
Allow reloading consuming segment by default ()
Add LZ4 Compression Codec (#6804) ([#7035](
Make Pinot JDK 11 Compilable (\
Introduce in-Segment Trim for GroupBy OrderBy Query ()
Produce GenericRow file in segment processing mapper ()
Add ago() scalar transform function ()
Add Bloom Filter support for IN predicate(#7005) ()
Add genericRow file reader and writer ()
Normalize LHS and RHS numerical types for >, >=, <, and <= operators. ()
Add Kinesis Stream Ingestion Plugin ()
feature/#6766 JSON and Startree index information in API ()
Support null value fields in generic row ser/de ()
Implement PassThroughTransformOperator to optimize select queries(#6972) ()
Optimize TIME_CONVERT/DATE_TIME_CONVERT predicates ()
Prefetch call to fetch buffers of columns seen in the query ()
Enabling compatibility tests in the script ()
Add collectionToJsonMode to schema inference ()
Add the complex-type support to decoder/reader ()
Adding a new Controller API to retrieve ingestion status for realtime… ()
Add support for Long in Modulo partition function. ()
Enhance PinotSegmentRecordReader to preserve null values ()
add complex-type support to avro-to-pinot schema inference ()
Add correct yaml files for real time data(#6787) ()
Add complex-type transformation to offline segment creation ()
Add config File support(#6787) ()
Enhance JSON index to support nested array ()
Add debug endpoint for tables. ()
JSON column datatype support. ()
Allow empty string in MV column ()
Add Zstandard compression support with JMH benchmarking(#6804) ()
Normalize LHS and RHS numerical types for = and != operator. ()
Change ConcatCollector implementation to use off-heap ()
[PQL Deprecation] Clean up the old BrokerRequestOptimizer ()
[PQL Deprecation] Do not compile PQL broker request for SQL query ()
Add TIMESTAMP and BOOLEAN data type support ()
Add admin endpoint for Pinot Minon. ()
Remove the usage of PQL compiler ()
Add endpoints in Pinot Controller, Broker and Server to get system and application configs. ()
Support IN predicate in ColumnValue SegmentPruner(#6756) ()
Enable adding new segments to a upsert-enabled realtime table ()
Interface changes for Kinesis connector ()
Pinot Minion SegmentGenerationAndPush task: PinotFS configs inside taskSpec is always temporary and has higher priority than default PinotFS created by the minion server configs ()
DataTable V3 implementation and measure data table serialization cost on server ()
add uploadLLCSegment endpoint in TableResource ()
File-based SegmentWriter implementation ()
Basic Auth for pinot-controller ()
UI integration with Authentication API and added login page ()
Support data ingestion for offline segment in one pass ()
SumPrecision: support all data types and star-tree ()
complete compatibility regression testing ()
Kinesis implementation Part 1: Rename partitionId to partitionGroupId ()
Make Pinot metrics pluggable ()
Recover the segment from controller when LLC table cannot load it ()
Adding a new API for validating specified TableConfig and Schema ()
Introduce a metric for query/response size on broker. ()
Adding a controller periodic task to clean up dead minion instances ()
Adding new validation for Json, TEXT indexing ()
Always return a response from query execution. ()
After the 0.8.0 release, we will officially support jdk 11, and can now safely start to use jdk 11 features. Code is still compilable with jdk 8 ()
RealtimeToOfflineSegmentsTask config has some backward incompatible changes ()
Regex path for pluggable MinionEventObserverFactory
is changed from org.apache.pinot.*.event.*
to org.apache.pinot.*.plugin.minion.tasks.*
()
Moved all pinot built-in minion tasks to the pinot-minion-builtin-tasks
module and package them into a shaded jar ()
Reloading consuming segment flag pinot.server.instance.reload.consumingSegment
will be true by default ()
Move JSON decoder from pinot-kafka
to pinot-json
package. ()
Backward incompatible schema change through controller rest API PUT /schemas/{schemaName}
will be blocked. ()
Deprecated /tables/validateTableAndSchema
in favor of the new configs/validate API and introduced new APIs for /tableConfigs
to operate on the realtime table config, offline table config and schema in one shot. ()
Fix race condition in MinionInstancesCleanupTask ()
Fix custom instance id for controller/broker/minion ()
Fix UpsertConfig JSON deserialization. ()
Fix the memory issue for selection query with large limit ()
Fix the deleted segments directory not exist warning ()
Fixing docker build scripts by providing JDK_VERSION as parameter ()
Misc fixes for json data type ()
Fix handling of date time columns in query recommender(#7018) ()
fixing pinot-hadoop and pinot-spark test ()
Fixing HadoopPinotFS listFiles method to always contain scheme ()
fixed GenericRow compare for different _fieldToValueMap size ()
Fix NPE in NumericalFilterOptimizer due to IS NULL and IS NOT NULL operator. ()
Fix the race condition in realtime text index refresh thread (#6858) ()
Fix deep store directory structure ()
Fix NPE issue when consumed kafka message is null or the record value is null. ()
Mitigate calcite NPE bug. ()
Fix the exception thrown in the case that a specified table name does not exist (#6328) ()
Fix CAST transform function for chained transforms ()
Fixed failing pinot-controller npm build ()
The release was cut from the following commit: and the following cherry-picks: ,
Integrate enhanced SegmentProcessorFramework into MergeRollupTaskExecutor ()
Merge/Rollup task scheduler for offline tables. ()
Fix MergeRollupTask uploading segments not updating their metadata ()
MergeRollupTask integration tests ()
Add mergeRollupTask delay metrics ()
MergeRollupTaskGenerator enhancement: enable parallel buckets scheduling ()
Use maxEndTimeMs for merge/roll-up delay metrics. ()
Cmd+Enter shortcut to run query in query console ()
Showing tooltip in SQL Editor ()
Make the SQL Editor box expandable ()
Fix tables ordering by number of segments ()
IN ()
LASTWITHTIME ()
ID_SET on MV columns ()
Raw results for Percentile TDigest and Est (),
Add timezone as argument in function toDateTime ()
LIKE()
REGEXP_EXTRACT()
FILTER()
Infer data type for Literal ()
Support logical identifier in predicate ()
Support JSON queries with top-level array path expression. ()
Support configurable group by trim size to improve results accuracy ()
Reduce the disk usage for segment conversion task ()
Simplify association between Java Class and PinotDataType for faster mapping ()
Avoid creating stateless ParseContextImpl once per jsonpath evaluation, avoid varargs allocation ()
Replace MINUS with STRCMP ()
Bit-sliced range index for int, long, float, double, dictionarized SV columns ()
Use MethodHandle to access vectorized unsigned comparison on JDK9+ ()
Add option to limit thread usage per query ()
Improved range queries ()
Faster bitmap scans ()
Optimize EmptySegmentPruner to skip pruning when there is no empty segments ()
Map bitmaps through a bounded window to avoid excessive disk pressure ()
Allow RLE compression of bitmaps for smaller file sizes ()
Support raw index properties for columns with JSON and RANGE indexes ()
Enhance BloomFilter rule to include IN predicate() ()
Introduce LZ4_WITH_LENGTH
chunk compression type ()
Enhance ColumnValueSegmentPruner and support bloom filter prefetch ()
Apply the optimization on dictIds within the segment to DistinctCountHLL aggregation func ()
During segment pruning, release the bloom filter after each segment is processed ()
Fix JSONPath cache inefficient issue ()
Optimize getUnpaddedString with SWAR padding search ()
Lighter weight LiteralTransformFunction, avoid excessive array fills ()
Inline binary comparison ops to prevent function call overhead ()
Memoize literals in query context in order to deduplicate them ()
Human Readable Controller Configs ()
Add the support of geoToH3 function ()
Add Apache Pulsar as Pinot Plugin () ()
Add dropwizard metrics plugin ()
Introduce OR Predicate Execution On Star Tree Index ()
Allow to extract values from array of objects with jsonPathArray ()
Add Realtime table metadata and indexes API. ()
Support array with mixing data types ()
Support force download segment in reload API ()
Show uncompressed znRecord from zk api ()
Add debug endpoint to get minion task status. ()
Validate CSV Header For Configured Delimiter ()
Add auth tokens and user/password support to ingestion job command ()
Add option to store the hash of the upsert primary key ()
Add null support for time column ()
Add mode aggregation function ()
Support disable swagger in Pinot servers ()
Delete metadata properly on table deletion ()
Add basic Obfuscator Support ()
Add AWS sts dependency to enable auth using web identity token. ()()
Mask credentials in debug endpoint /appconfigs ()
Fix /sql query endpoint now compatible with auth ()
Fix case sensitive issue in BasicAuthPrincipal permission check ()
Fix auth token injection in SegmentGenerationAndPushTaskExecutor ()
Add segmentNameGeneratorType config to IndexingConfig ()
Support trigger PeriodicTask manually ()
Add endpoint to check minion task status for a single task. ()
Showing partial status of segment and counting CONSUMING state as good segment status ()
Add "num rows in segments" and "num segments queried per host" to the output of Realtime Provisioning Rule ()
Check schema backward-compatibility when updating schema through addSchema with override ()
Optimize IndexedTable ()
Support indices remove in V3 segment format ()
Optimize TableResizer ()
Introduce resultSize in IndexedTable ()
Offset based realtime consumption status checker ()
Add causes to stack trace return ()
Create controller resource packages config key ()
Enhance TableCache to support schema name different from table name ()
Add validation for realtimeToOffline task ()
Unify CombineOperator multi-threading logic ()
Support no downtime rebalance for table with 1 replica in TableRebalancer ()
Introduce MinionConf, move END_REPLACE_SEGMENTS_TIMEOUT_MS to minion config instead of task config. ()
Adjust tuner api ()
Adding config for metrics library ()
Add geo type conversion scalar functions ()
Add BOOLEAN_ARRAY and TIMESTAMP_ARRAY types ()
Add MV raw forward index and MV BYTES
data type ()
Enhance TableRebalancer to offload the segments from most loaded instances first ()
Improve get tenant API to differentiate offline and realtime tenants ()
Refactor query rewriter to interfaces and implementations to allow customization ()
In ServiceStartable, apply global cluster config in ZK to instance config ()
Make dimension tables creation bypass tenant validation ()
Allow Metadata and Dictionary Based Plans for No Op Filters ()
Reject query with identifiers not in schema ()
Round Robin IP addresses when retry uploading/downloading segments ()
Support multi-value derived column in offline table reload ()
Support segmentNamePostfix in segment name ()
Add select segments API ()
Controller getTableInstance() call now returns the list of live brokers of a table. ()
Allow MV Field Support For Raw Columns in Text Indices ()
Allow override distinctCount to segmentPartitionedDistinctCount ()
Add a quick start with both UPSERT and JSON index ()
Add revertSegmentReplacement API ()
Smooth segment reloading with non blocking semantic ()
Clear the reused record in PartitionUpsertMetadataManager ()
Replace args4j with picocli ()
Handle datetime column consistently ()()
Allow to carry headers with query requests () ()
Allow adding JSON data type for dimension column types ()
Separate SegmentDirectoryLoader and tierBackend concepts ()
Implement size balanced V4 raw chunk format ()
Add presto-pinot-driver lib ()
Fix null pointer exception for non-existed metric columns in schema for JDBC driver ()
Fix the config key for TASK_MANAGER_FREQUENCY_PERIOD ()
Fixed pinot java client to add zkClient close ()
Ignore query json parse errors ()
Fix shutdown hook for PinotServiceManager () ()
Make STRING to BOOLEAN data type change as backward compatible schema change ()
Replace gcp hardcoded values with generic annotations ()
Fix segment conversion executor for in-place conversion ()
Fix reporting consuming rate when the Kafka partition level consumer isn't stopped ()
Fix the issue with concurrent modification for segment lineage ()
Fix TableNotFound error message in PinotHelixResourceManager ()
Fix upload LLC segment endpoint truncated download URL ()
Fix task scheduling on table update ()
Fix metric method for ONLINE_MINION_INSTANCES metric ()
Fix JsonToPinotSchema behavior to be consistent with AvroSchemaToPinotSchema ()
Fix currentOffset volatility in consuming segment()
Fix misleading error msg for missing URI ()
Fix the correctness of getColumnIndices method ()
Fix SegmentZKMetadta time handling ()
Fix retention for cleaning up segment lineage ()
Fix segment generator to not return illegal filenames ()
Fix missing LLC segments in segment store by adding controller periodic task to upload them ()
Fix parsing error messages returned to FileUploadDownloadClient ()
Fix manifest scan which drives /version endpoint ()
Fix missing rate limiter if brokerResourceEV becomes null due to ZK connection ()
Fix race conditions between segment merge/roll-up and purge (or convertToRawIndex) tasks: ()
Fix pql double quote checker exception ()
Fix minion metrics exporter config ()
Fix segment unable to retry issue by catching timeout exception during segment replace ()
Add Exception to Broker Response When Not All Segments Are Available (Partial Response) ()
Fix segment generation commands ()
Return non zero from main with exception ()
Fix parquet plugin shading error ()
Fix the lowest partition id is not 0 for LLC ()
Fix star-tree index map when column name contains '.' ()
Fix cluster manager URLs encoding issue()
Fix fieldConfig nullable validation ()
Fix verifyHostname issue in FileUploadDownloadClient ()
Fix TableCache schema to include the built-in virtual columns ()
Fix DISTINCT with AS function ()
Fix SDF pattern in DataPreprocessingHelper ()
Fix fields missing issue in the source in ParquetNativeRecordReader ()
Search Expression Type
Example
Phrase query
TEXT_MATCH (<column_name>, '\"distributed system\"')
Term Query
TEXT_MATCH (<column_name>, 'Java')
Boolean Query
TEXT_MATCH (<column_name>, 'Java and c++')
Prefix Query
TEXT_MATCH (<column_name>, 'stream*')
Regex Query
TEXT_MATCH (<column_name>, '/Exception.*/')
This release includes many new features on Pinot ingestion and connectors, query capability and a revamped controller UI.
This release includes many new features on Pinot ingestion and connectors (e.g., support for filtering during ingestion which is configurable in table config; support for json during ingestion; proto buf input format support and a new Pinot JDBC client), query capability (e.g., a new GROOVY transform function UDF) and admin functions (a revamped Cluster Manager UI & Query Console UI). It also contains many key bug fixes. See details below.
The release was cut from the following commit: d1b4586 and the following cherry-picks:
Allowing update on an existing instance config: PUT /instances/{instanceName} with Instance object as the pay-load (#PR4952)
Add PinotServiceManager to start Pinot components (#PR5266)
Support for protocol buffers input format. (#PR5293)
Add GenericTransformFunction wrapper for simple ScalarFunctions (PR#5440) — Adding support to invoke any scalar function via GenericTransformFunction
Add Support for SQL CASE Statement (PR#5461)
Support distinctCountRawThetaSketch aggregation that returns serialized sketch. (PR#5465)
Add multi-value support to SegmentDumpTool (PR#5487) — add segment dump tool as part of the pinot-tool.sh script
Add json_format function to convert json object to string during ingestion. (PR#5492) — Can be used to store complex objects as a json string (which can later be queries using jsonExtractScalar)
Support escaping single quote for SQL literal (PR#5501) — This is especially useful for DistinctCountThetaSketch because it stores expression as literal E.g. DistinctCountThetaSketch(..., 'foo=''bar''', ...)
Support expression as the left-hand side for BETWEEN and IN clause (PR#5502)
Add a new field IngestionConfig in TableConfig — FilterConfig: ingestion level filtering of records, based on filter function. (PR#5597) — TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release (PR#5681).
Allow star-tree creation during segment load (#PR5641) — Introduced a new boolean config enableDynamicStarTreeCreation in IndexingConfig to enable/disable star-tree creation during segment load.
Support for Pinot clients using JDBC connection (#PR5602)
Support customized accuracy for distinctCountHLL, distinctCountHLLMV functions by adding log2m value as the second parameter in the function. (#PR5564) —Adding cluster config: default.hyperloglog.log2m to allow user set default log2m value.
Add segment encryption on Controller based on table config (PR#5617)
Add a constraint to the message queue for all instances in Helix, with a large default value of 100000. (PR#5631)
Support order-by aggregations not present in SELECT (PR#5637) — Example: "select subject from transcript group by subject order by count() desc" This is equivalent to the following query but the return response should not contain count(). "select subject, count() from transcript group by subject order by count() desc"
Add geo support for Pinot queries (PR#5654) — Added geo-spatial data model and geospatial functions
Add Controller API to explore Zookeeper (PR#5687)
Support for ingestion job spec in JSON format (#PR5729)
Improvements to RealtimeProvisioningHelper command (#PR5737) — Improved docs related to ingestion and plugins
Added GROOVY transform function UDF (#PR5748) — Ability to run a groovy script in the query as a UDF. e.g. string concatenation: SELECT GROOVY('{"returnType": "INT", "isSingleValue": true}', 'arg0 + " " + arg1', columnA, columnB) FROM myTable
TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release (PR#5681).
Config key enable.case.insensitive.pql in Helix cluster config is deprecated, and replaced with enable.case.insensitive. (#PR5546)
Change default segment load mode to MMAP. (PR#5539) —The load mode for segments currently defaults to heap
.
Fix bug in distinctCountRawHLL on SQL path (#5494)
Fix backward incompatibility for existing stream implementations (#5549)
Fix backward incompatibility in StreamFactoryConsumerProvider (#5557)
Fix logic in isLiteralOnlyExpression. (#5611)
Fix double memory allocation during operator setup (#5619)
Allow segment download url in Zookeeper to be deep store uri instead of hardcoded controller uri (#5639)
Fix a backward compatible issue of converting BrokerRequest to QueryContext when querying from Presto segment splits (#5676)
Fix the issue that PinotSegmentToAvroConverter does not handle BYTES data type. (#5789)
PQL queries with HAVING clause will no longer be accepted for the following reasons: (#PR5570) — HAVING clause does not apply to PQL GROUP-BY semantic where each aggregation column is ordered individually — The current behavior can produce inaccurate results without any notice — HAVING support will be added for SQL queries in the next release
Because of the standardization of the DistinctCountThetaSketch predicate strings, please upgrade Broker before Server. The new Broker can handle both standard and non-standard predicate strings for backward-compatibility. (#PR5613)
0.4.0 release introduced the theta-sketch based distinct count function, an S3 filesystem plugin, a unified star-tree index implementation, migration from TimeFieldSpec to DateTimeFieldSpec, etc.
0.4.0 release introduced various new features, including the theta-sketch based distinct count aggregation function, an S3 filesystem plugin, a unified star-tree index implementation, deprecation of TimeFieldSpec in favor of DateTimeFieldSpec, etc. Miscellaneous refactoring, performance improvement and bug fixes were also included in this release. See details below.
Made DateTimeFieldSpecs mainstream and deprecated TimeFieldSpec (#2756)
Used time column from table config instead of schema (#5320)
Included dateTimeFieldSpec in schema columns of Pinot Query Console #5392
Used DATE_TIME as the primary time column for Pinot tables (#5399)
Supported range queries using indexes (#5240)
Supported complex aggregation functions
Supported Aggregation functions with multiple arguments (#5261)
Added api in AggregationFunction to get compiled input expressions (#5339)
Added a simple PinotFS benchmark driver (#5160)
Supported default star-tree (#5147)
Added an initial implementation for theta-sketch based distinct count aggregation function (#5316)
One minor side effect: DataSchemaPruner won't work for DistinctCountThetaSketchAggregatinoFunction (#5382)
Added access control for Pinot server segment download api (#5260)
Added Pinot S3 Filesystem Plugin (#5249)
Text search improvement
Pruned stop words for text index (#5297)
Used 8byte offsets in chunk based raw index creator (#5285)
Derived num docs per chunk from max column value length for varbyte raw index creator (#5256)
Added inter segment tests for text search and fixed a bug for Lucene query parser creation (#5226)
Made text index query cache a configurable option (#5176)
Added Lucene DocId to PinotDocId cache to improve performance (#5177)
Removed the construction of second bitmap in text index reader to improve performance (#5199)
Tooling/usability improvement
Added template support for Pinot Ingestion Job Spec (#5341)
Allowed user to specify zk data dir and don't do clean up during zk shutdown (#5295)
Allowed configuring minion task timeout in the PinotTaskGenerator (#5317)
Update JVM settings for scripts (#5127)
Added Stream github events demo (#5189)
Moved docs link from gitbook to docs.pinot.apache.org (#5193)
Re-implemented ORCRecordReader (#5267)
Evaluated schema transform expressions during ingestion (#5238)
Handled count distinct query in selection list (#5223)
Enabled async processing in pinot broker query api (#5229)
Supported bootstrap mode for table rebalance (#5224)
Supported order-by on BYTES column (#5213)
Added Nightly publish to binary (#5190)
Shuffled the segments when rebalancing the table to avoid creating hotspot servers (#5197)
Supported inbuilt transform functions (#5312)
Added date time transform functions (#5326)
Deepstore by-pass in LLC: introduced segment uploader (#5277, #5314)
APIs Additions/Changes
Added a new server api for download of segments
/GET /segments/{tableNameWithType}/{segmentName}
Upgraded helix to 0.9.7 (#5411)
Added support to execute functions during query compilation (#5406)
Other notable refactoring
Moved table config into pinot-spi (#5194)
Cleaned up integration tests. Standardized the creation of schema, table config and segments (#5385)
Added jsonExtractScalar function to extract field from json object (#4597)
Added template support for Pinot Ingestion Job Spec #5372
Cleaned up AggregationFunctionContext (#5364)
Optimized real-time range predicate when cardinality is high (#5331)
Made PinotOutputFormat use table config and schema to create segments (#5350)
Tracked unavailable segments in InstanceSelector (#5337)
Added a new best effort segment uploader with bounded upload time (#5314)
In SegmentPurger, used table config to generate the segment (#5325)
Decoupled schema from RecordReader and StreamMessageDecoder (#5309)
Implemented ARRAYLENGTH UDF for multi-valued columns (#5301)
Improved GroupBy query performance (#5291)
Optimized ExpressionFilterOperator (#5132)
Do not release the PinotDataBuffer when closing the index (#5400)
Handled a no-arg function in query parsing and expression tree (#5375)
Fixed compatibility issues during rolling upgrade due to unknown json fields (#5376)
Fixed missing error message from pinot-admin command (#5305)
Fixed HDFS copy logic (#5218)
Fixed spark ingestion issue (#5216)
Fixed the capacity of the DistinctTable (#5204)
Fixed various links in the Pinot website
Upsert: support overriding data in the real-time table (#4261).
Add pinot upsert features to pinot common (#5175)
Enhancements for theta-sketch, e.g. multiValue aggregation support, complex predicates, performance tuning, etc
TableConfig no longer support de-serialization from json string of nested json string (i.e. no \"
inside the json) (#5194)
The following APIs are changed in AggregationFunction (use TransformExpressionTree instead of String as the key of blockValSetMap) (#5371):
0.3.0 release of Apache Pinot introduces the concept of plugins that makes it easy to extend and integrate with other systems.
The reason behind the architectural change from the previous release (0.2.0) and this release (0.3.0), is the possibility of extending Apache Pinot. The 0.2.0 release was not flexible enough to support new storage types nor new stream types. Basically, inserting a new functionality required to change too much code. Thus, the Pinot team went through an extensive refactoring and improvement of the source code.
For instance, the picture below shows the module dependencies of the 0.2.X or previous releases. If we wanted to support a new storage type, we would have had to change several modules. Pretty bad, huh?
In order to conquer this challenge, below major changes are made:
Refactored common interfaces to pinot-spi
module
Concluded four types of modules:
Pinot input format: How to read records from various data/file formats: e.g. Avro
/CSV
/JSON
/ORC
/Parquet
/Thrift
Pinot filesystem: How to operate files on various filesystems: e.g. Azure Data Lake
/Google Cloud Storage
/S3
/HDFS
Pinot stream ingestion: How to ingest data stream from various upstream systems, e.g. Kafka
/Kinesis
/Eventhub
Pinot batch ingestion: How to run Pinot batch ingestion jobs in various frameworks, like Standalone
, Hadoop
, Spark
.
Built shaded jars for each individual plugin
Added support to dynamically load pinot plugins at server startup time
Now the architecture supports a plug-and-play fashion, where new tools can be supported with little and simple extensions, without affecting big chunks of code. Integrations with new streaming services and data formats can be developed in a much more simple and convenient way.
SQL Support
Added support for DISTINCT
(#4535)
Added support default value for BYTES
column (#4583)
JDK 11
Support
Added support to tune size vs accuracy for approximation aggregation functions: DistinctCountHLL
, PercentileEst
, PercentileTDigest
(#4666)
Added Data Anonymizer Tool (#4747)
Deprecated pinot-hadoop
and pinot-spark
modules, replace with pinot-batch-ingestion-hadoop
and pinot-batch-ingestion-spark
Support STRING
and BYTES
for no dictionary columns in realtime consuming segments (#4791)
Make pinot-distribution
to build a pinot-all jar and assemble it (#4977)
Added support for PQL case insensitive (#4983)
Added experimental support for Text Search (#4993)
Upgraded Helix to version 0.9.4, task management now works as expected (#5020)
Added date_trunc
transformation function. (#4740)
Support schema evolution for consuming segment. (#4954)
APIs Additions/Changes
Pinot Controller Rest APIs
Get Table leader controller resource (#4545)
Support HTTP POST
/PUT
to upload JSON encoded schema (#4639)
Table rebalance API now requires both table name and type as parameters. (#4824)
Refactored Segments APIs (#4806)
Added segment batch deletion REST API (#4828)
Update schema API to reload table on schema change when applicable (#4838)
Enhance the task related REST APIs (#5054)
Added PinotClusterConfig REST APIs (#5073)
GET /cluster/configs
POST /cluster/configs
DELETE /cluster/configs/{configName}
Configurations Additions/Changes
Config: controller.host
is now optional in Pinot Controller
Added instance config: queriesDisabled
to disable query sending to a running server (#4767)
Added broker config: pinot.broker.enable.query.limit.override
configurable max query response size (#5040)
Removed deprecated server configs (#4903)
pinot.server.starter.enableSegmentsLoadingCheck
pinot.server.starter.timeoutInSeconds
pinot.server.instance.enable.shutdown.delay
pinot.server.instance.starter.maxShutdownWaitTime
pinot.server.instance.starter.checkIntervalTime
Decouple server instance id with hostname/port config. (#4995)
Add FieldConfig to encapsulate encoding, indexing info for a field.(#5006)
Fixed the bug of releasing the segment when there are still threads working on it. (#4764)
Fixed the bug of uneven task distribution for threads (#4793)
Fixed encryption for .tar.gz
segment file upload (#4855)
Fixed controller rest API to download segment from non local FS. (#4808)
Fixed the bug of not releasing segment lock if segment recovery throws exception (#4882)
Fixed the issue of server not registering state model factory before connecting the Helix manager (#4929)
Fixed the exception in server instance when Helix starts a new ZK session (#4976)
Fixed ThreadLocal DocIdSet issue in ExpressionFilterOperator (#5114)
Fixed the bug in default value provider classes (#5137)
Fixed the bug when no segment exists in RealtimeSegmentSelector (#5138)
We are in the process of supporting text search query functionalities.
It’s a disruptive upgrade from version 0.1.0 to this because of the protocol changes between Pinot Broker and Pinot Server. Please ensure that you upgrade to release 0.2.0 first, then upgrade to this version.
If you build your own startable or war without using scripts generated in Pinot-distribution module. For Java 8, an environment variable “plugins.dir” is required for Pinot to find out where to load all the Pinot plugin jars. For Java 11, plugins directory is required to be explicitly set into classpath. Please see pinot-admin.sh
as an example.
As always, we recommend that you upgrade controllers first, and then brokers and lastly the servers in order to have zero downtime in production clusters.
Kafka 0.9 is no longer included in the release distribution.
Pull request #4806 introduces a backward incompatible API change for segments management.
Removed segment toggle APIs
Removed list all segments in cluster APIs
Deprecated below APIs:
GET /tables/{tableName}/segments
GET /tables/{tableName}/segments/metadata
GET /tables/{tableName}/segments/crc
GET /tables/{tableName}/segments/{segmentName}
GET /tables/{tableName}/segments/{segmentName}/metadata
GET /tables/{tableName}/segments/{segmentName}/reload
POST /tables/{tableName}/segments/{segmentName}/reload
GET /tables/{tableName}/segments/reload
POST /tables/{tableName}/segments/reload
Pull request #5054 deprecated below task related APIs:
GET:
/tasks/taskqueues
: List all task queues
/tasks/taskqueuestate/{taskType}
-> /tasks/{taskType}/state
/tasks/tasks/{taskType}
-> /tasks/{taskType}/tasks
/tasks/taskstates/{taskType}
-> /tasks/{taskType}/taskstates
/tasks/taskstate/{taskName}
-> /tasks/task/{taskName}/taskstate
/tasks/taskconfig/{taskName}
-> /tasks/task/{taskName}/taskconfig
PUT:
/tasks/scheduletasks
-> POST
/tasks/schedule
/tasks/cleanuptasks/{taskType}
-> /tasks/{taskType}/cleanup
/tasks/taskqueue/{taskType}
: Toggle a task queue
DELETE:
/tasks/taskqueue/{taskType}
-> /tasks/{taskType}
Deprecated modules pinot-hadoop
and pinot-spark
and replaced with pinot-batch-ingestion-hadoop
and pinot-batch-ingestion-spark
.
Introduced new Pinot batch ingestion jobs and yaml based job specs to define segment generation jobs and segment push jobs.
You may see exceptions like below in pinot-brokers during cluster upgrade, but it's safe to ignore them.
Learn how to query Apache Pinot using SQL or explore data using the web-based Pinot query console.
The 0.2.0 release is the first release after the initial one and includes several improvements, reported following.
Added support for Kafka 2.0
Table rebalancer now supports a minimum number of serving replicas during rebalance
Added support for UDF in filter predicates and selection
Admin tool for listing segments with invalid intervals for offline tables
Added simple avro msg decoder
Added support for passing headers in Pinot client
Table rebalancer now supports a minimum number of serving replicas during rebalance
Configurations additions/changes
The following config variables are deprecated and will be removed in the next release:
pinot.broker.requestHandlerType will be removed, in favor of using the "singleConnection" broker request handler. If you have set this configuration, please remove it and use the default type ("singleConnection") for broker request handler.
We are in the process of separating Helix and Pinot controllers, so that administrators can have the option of running independent Helix controllers and Pinot controllers.
We are in the process of moving towards supporting SQL query format and results.
We are in the process of separating instance and segment assignment using instance pools to optimize the number of Helix state transitions in Pinot clusters with thousands of tables.
Task management does not work correctly in this release, due to bugs in Helix. We will upgrade to Helix 0.9.2 (or later) version to get this fixed.
You must upgrade to this release before moving onto newer versions of Pinot release. The protocol between Pinot-broker and Pinot-server has been changed and this release has the code to retain compatibility moving forward. Skipping this release may (depending on your environment) cause query errors if brokers are upgraded and servers are in the process of being upgraded.
As always, we recommend that you upgrade controllers first, and then brokers and lastly the servers in order to have zero downtime in production clusters.
If you used Pinot-admin command to start Pinot components, you don't need any change. If you used your own commands to start pinot components, you will need to pass the new log4j2 config as a jvm parameter (i.e. substitute -Dlog4j.configuration or -Dlog4j.configurationFile argument with -Dlog4j2.configurationFile=log4j2.xml).
Steps for setting up a Pinot cluster and a realtime table which consumes from the GitHub events stream.
In this recipe, we will
Set up a Pinot cluster, in the steps
a. Start zookeeper
b. Start controller
c. Start broker
d. Start server
Set up a Kafka cluster
Create a Kafka topic - pullRequestMergedEvents
Create a realtime table - pullRequestMergedEvents and a schema
Query the realtime data
Get the latest Docker image.
Zookeeper
Controller
Broker
Server
Kafka
Create a Kafka topic called pullRequestMergedEvents
for the demo.
The schema is present at examples/stream/githubEvents/pullRequestMergedEvents_schema.json
and is also pasted below
The table config is present at examples/stream/githubEvents/docker/pullRequestMergedEvents_realtime_table_config.json
and is also pasted below.
Note
If you're setting this up on a pre-configured cluster, set the properties stream.kafka.zk.broker.url
and stream.kafka.broker.list
correctly, depending on the configuration of your Kafka cluster.
Add the table and schema using the following command
Start streaming GitHub events into the Kafka topic
Prerequisites
For a single command to setup all the above steps, use the following command. Make sure to stop any previous running Pinot services.
Zookeeper
Controller
Broker
Server
Kafka
Create a Kafka topic called pullRequestMergedEvents
for the demo.
Schema can be found at /examples/stream/githubevents/
in the release, and is also pasted below:
Table config can be found at /examples/stream/githubevents/
in the release, and is also pasted below.
Note
If you're setting this up on a pre-configured cluster, set the properties stream.kafka.zk.broker.url
and stream.kafka.broker.list
correctly, depending on the configuration of your Kafka cluster.
Add the table and schema using the command
Start streaming GitHub events into the Kafka topic
Prerequisites
For a single command to setup all the above steps
You can use SuperSet to visualize this data. Some of the interesting insights we captures were
Repositories by number of commits in the Apache organization
Added support to use hex string as the representation of byte array for queries (see PR )
Added support for parquet reader (see PR )
Introduced interface stability and audience annotations (see PR )
Refactor HelixBrokerStarter to separate constructor and start() - backwards incompatible (see PR )
Migrated to log4j2 (see PR )
Support transform functions with AVG aggregation function (see PR )
Allow customized metrics prefix (see PR )
Controller.enable.batch.message.mode to false by default (see PR )
RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (see PR )
Config to control kafka fetcher size and increase default (see PR )
Added a percent threshold to consider startup of services (see PR )
Make SingleConnectionBrokerRequestHandler as default (see PR )
Always enable default column feature, remove the configuration (see PR )
Remove redundant default broker configurations (see PR )
Removed some config keys in server(see PR )
Add config to disable HLC realtime segment (see PR )
Make RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (see PR )
Pull Request introduces a backwards incompatible change to Pinot broker. If you use the Java constructor on HelixBrokerStarter class, then you will face a compilation error with this version. You will need to construct the object and call start() method in order to start the broker.
Pull Request introduces a backwards incompatible change for log4j configuration. If you used a custom log4j configuration (log4j.xml), you need to write a new log4j2 configuration (log4j2.xml). In addition, you may need to change the arguments on the command line to start Pinot components.
Start a task which reads from and publishes events about merged pull requests to the topic.
Follow the instructions in to setup the Pinot cluster with the components:
Generate a on GitHub.
Follow instructions in to get the latest Pinot code
Follow the instructions in to setup the Pinot cluster with the components:
Download release.
Generate a on GitHub.
If you already have a Kubernetes cluster with Pinot and Kafka (see ), first create the topic and then setup the table and streaming using
Head over to the to checkout the data!
To integrate with SuperSet you can check out the page.
Pinot currently supports two ways for you to implement your own functions:
Groovy Scripts
Scalar Functions
Pinot allows you to run any function using Apache Groovy scripts. The syntax for executing Groovy script within the query is as follows:
GROOVY('result value metadata json', ''groovy script', arg0, arg1, arg2...)
This function will execute the groovy script using the arguments provided and return the result that matches the provided result value metadata. The function requires the following arguments:
Result value metadata json
- json string representing result value metadata. Must contain non-null keys resultType
and isSingleValue
.
Groovy script to execute
- groovy script string, which uses arg0
, arg1
, arg2
etc to refer to the arguments provided within the script
arguments
- pinot columns/other transform functions that are arguments to the groovy script
Examples
Add colA and colB and return a single-value INT
groovy( '{"returnType":"INT","isSingleValue":true}', 'arg0 + arg1', colA, colB)
Find the max element in mvColumn array and return a single-value INT
groovy('{"returnType":"INT","isSingleValue":true}', 'arg0.toList().max()', mvColumn)
Find all elements of the array mvColumn and return as a multi-value LONG column
groovy('{"returnType":"LONG","isSingleValue":false}', 'arg0.findIndexValues{ it > 5 }', mvColumn)
Multiply length of array mvColumn with colB and return a single-value DOUBLE
groovy('{"returnType":"DOUBLE","isSingleValue":true}', 'arg0 * arg1', arraylength(mvColumn), colB)
Find all indexes in mvColumnA which have value foo
, add values at those indexes in mvColumnB
groovy( '{"returnType":"DOUBLE","isSingleValue":true}', 'def x = 0; arg0.eachWithIndex{item, idx-> if (item == "foo") {x = x + arg1[idx] }}; return x' , mvColumnA, mvColumnB)
Switch case which returns a FLOAT value depending on length of mvCol array
groovy('{\"returnType\":\"FLOAT\", \"isSingleValue\":true}', 'def result; switch(arg0.length()) { case 10: result = 1.1; break; case 20: result = 1.2; break; default: result = 1.3;}; return result.floatValue()', mvCol)
Any Groovy script which takes no arguments
groovy('new Date().format( "yyyyMMdd" )', '{"returnType":"STRING","isSingleValue":true}')
Since the 0.5.0 release, Pinot supports custom functions that return a single output for multiple inputs. Examples of scalar functions can be found in StringFunctions and DateTimeFunctions
Pinot automatically identifies and registers all the functions that have the @ScalarFunction
annotation.
Only Java methods are supported.
You can add new scalar functions as follows:
Create a new java project. Make sure you keep the package name as org.apache.pinot.scalar.XXXX
In your java project include the dependency
Annotate your methods with @ScalarFunction
annotation. Make sure the method is static
and returns only a single value output. The input and output can have one of the following types -
Integer
Long
Double
String
Place the compiled JAR in the /plugins
directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the function in a query as follows:
Note that the function name in SQL is the same as the function name in Java. The SQL function name is case-insensitive as well.
Learn how to query Pinot using SQL
Pinot uses Calcite SQL Parser to parse queries and uses MYSQL_ANSI dialect. You can see the grammar here.
Pinot does not support Joins or nested Subqueries and we recommend using Presto for queries that span multiple tables. Read Engineering Full SQL support for Pinot at Uber for more info.
No DDL support. Tables can be created via the REST API.
In Pinot SQL:
Double quotes(") are used to force string identifiers, e.g. column name.
Single quotes(') are used to enclose string literals.
Mis-using those might cause unexpected query results:
E.g.
WHERE a='b'
means the predicate on the column a
equals to a string literal value 'b'
WHERE a="b"
means the predicate on the column a
equals to the value of the column b
Use single quotes for literals and double quotes (optional) for identifiers (column names)
If you name the columns as timestamp
, date
, or other reserved keywords, or the column name includes special characters, you need to use double quotes when you refer to them in the query.
For performant filtering of ids in a list, see Filtering with IdSet.
Note: results might not be consistent if column ordered by has same value in multiple rows.
To count rows where the column airlineName
starts with U
Pinot supports the CASE-WHEN-ELSE statement.
Example 1:
Example 2:
Functions have to be implemented within Pinot. Injecting functions is not yet supported. The example below demonstrate the use of UDFs. More examples in Transform Function in Aggregation Grouping
Pinot supports queries on BYTES column using HEX string. The query response also uses hex string to represent bytes value.
E.g. the query below fetches all the rows for a given UID.
Learn how to write fast queries for looking up ids in a list of values.
A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but this approach doesn't perform well with large lists of ids. In these cases, you can use an IdSet.
ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )
This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:
INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
Other types - Bloom Filter
The following parameters are used to configure the Bloom Filter:
expectedInsertions - Number of expected insertions for the BloomFilter, must be positive
fpp - Desired false positive probability for the BloomFilter, must be positive and < 1.0
Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.
IN_ID_SET(columnName, base64EncodedIdSet)
This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.
IN_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.
IN_PARTITIONED_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.
This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.
You can create an IdSet of the values in the yearID column by running the following:
When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:
We can also configure the fpp parameter:
We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for yearIDs in the IdSet, run the following:
To return rows for yearIDs not in the IdSet, run the following:
To filter rows for yearIDs in the IdSet on a Pinot Broker, run the following query:
To filter rows for yearIDs not in the IdSet on a Pinot Broker, run the following query:
To filter rows for yearIDs in the IdSet on a Pinot Server, run the following query:
To filter rows for yearIDs not in the IdSet on a Pinot Server, run the following query:
This document contains the list of all the transformation functions supported by Pinot SQL.
Multiple string functions are supported out of the box from release-0.5.0 .
Date time functions allow you to perform transformations on columns that contain timestamps or dates.
Usage
'jsonPath'
and
'results_type'
are literals. Pinot uses single quotes to distinguish them from identifiers.
e.g.
JSONEXTRACTSCALAR(profile_json_str, '$.name', 'STRING')
is valid.
JSONEXTRACTSCALAR(profile_json_str, "$.name", "STRING")
is invalid.
Transform functions can only be used in Pinot SQL. Scalar functions can be used for column transformation in table ingestion configs.
Examples
The examples below are based on these 3 sample profile JSON documents:
Query 1: Extract string values from the field 'name'
Results are
Query 2: Extract integer values from the field 'age'
Results are
Query 3: Extract Bob's age from the JSON profile.
Results are
Query 4: Extract all field keys of JSON profile.
Results are
Another example of extracting JSON fields from below JSON record:
Extract JSON fields:
All of the functions mentioned till now only support single value columns. You can use the following functions to do operations on multi-value columns.
Pinot supports Geospatial queries on columns containing text-based geographies. For more details on the queries and how to enable them, see Geospatial.
Pinot supports pattern matching on text-based columns. Only the columns mentioned as text columns in table config can be queried using this method. For more details on how to enable pattern matching, see Text search support.
dimTableName
Name of the dim table to perform the lookup on.
dimColToLookUp
The column name of the dim table to be retrieved to decorate our result.
dimJoinKey
The column name on which we want to perform the lookup i.e. the join column name for dim table.
factJoinKeyVal
The value of the dim table join column for which we will retrieve the dimColToLookUp for the scope and invocation.
Return type of the UDF will be that of the dimColToLookUp column type. There can also be multiple primary keys and corresponding values.
Note: If the dimension table uses a composite primary key i.e multiple primary keys, then ensure that the order of keys appearing in the lookup() UDF is same as the order defined for "primaryKeyColumns" in the dimension table schema.
Cardinality estimation is a classic problem. Pinot solves it with multiple ways each of which has a trade-off between accuracy and latency.
Functions:
DistinctCount(x) -> LONG
Returns accurate count for all unique values in a column.
The underlying implementation is using a IntOpenHashSet in library: it.unimi.dsi:fastutil:8.2.3
to hold all the unique values.
It usually takes a lot of resources and time to compute accurate results for unique counting on large datasets. In some circumstances, we can tolerate a certain error rate, in which case we can use approximation functions to tackle this problem.
Functions:
DistinctCountHLL(x)_ -> LONG_
For column type INT/LONG/FLOAT/DOUBLE/STRING , Pinot treats each value as an individual entry to add into HyperLogLog Object, then compute the approximation by calling method cardinality().
For column type BYTES, Pinot treats each value as a serialized HyperLogLog Object with pre-aggregated values inside. The bytes value is generated by org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog)
.
All deserialized HyperLogLog object will be merged into one then calling method **cardinality() **to get the approximated unique count.
Functions:
DistinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**) **-> LONG
thetaSketchColumn (required): Name of the column to aggregate on.
thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches. Currently, the only supported parameter is nominalEntries
.
predicates (optional)_: _ These are individual predicates of form lhs <op> rhs
which are applied on rows selected by the where
clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn
that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA
are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
postAggregationExpressionToEvaluate (required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT
, where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.
In the example query below, the where
clause is responsible for identifying the matching rows. Note, the where clause can be completely independent of the postAggregationExpression
. Once matching rows are identified, each server unionizes all the sketches that match the individual predicates, i.e. country='USA'
, device='mobile'
in this case. Once the broker receives the intermediate sketches for each of these individual predicates from all servers, it performs the final aggregation by evaluating the postAggregationExpression
and returns the final cardinality of the resulting sketch.
DistinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**)** -> HexEncoded Serialized Sketch Bytes
This is the same as the previous function, except it returns the byte serialized sketch instead of the cardinality sketch. Since Pinot returns responses as JSON strings, bytes are returned as hex encoded strings. The hex encoded string can be deserialized into sketch by using the library org.apache.commons.codec.binary
as Hex.decodeHex(stringValue.toCharArray())
.
Pinot can be queried via a broker endpoint as follows. This example assumes broker is running on localhost:8099
The Pinot REST API can be accessed by invoking POST
operation with a JSON body containing the parameter sql
to the /query/sql
endpoint on a broker.
Note
This endpoint is deprecated, and will soon be removed. The standard-SQL endpoint is the recommended endpoint.
The PQL endpoint can be accessed by invoking POST
operation with a JSON body containing the parameter pql
to the /query
endpoint on a broker.
Query Console can be used for running ad-hoc queries (checkbox available to query the PQL endpoint). The Query Console can be accessed by entering the <controller host>:<controller port>
in your browser
idset(yearID) |
---|
idset(playerName) |
---|
idset(playerName) |
---|
idset(playerName) |
---|
Function | Description | Example |
---|---|---|
Function | Description | Example |
---|---|---|
Function | Description | Example |
---|---|---|
Expression | Value |
---|---|
Function | Description | Example |
---|---|---|
Function | Description | Example |
---|---|---|
Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with in Pinot. The UDF signature is as below:
is an approximation algorithm for unique counting. It uses fixed number of bits to estimate the cardinality of given data set.
Pinot leverages in library com.clearspring.analytics:stream:2.7.0
as the data structure to hold intermediate results.
The framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the and its extensions from the library org.apache.datasketches:datasketches-java:1.2.0-incubating
to perform distinct counting as well as evaluating set operations.
You can also query using the pinot-admin
scripts. Make sure you follow instructions in to get Pinot locally, and then
Function
Description
Example
COUNT
Get the count of rows in a group
COUNT(*)
MIN
Get the minimum value in a group
MIN(playerScore)
MAX
Get the maximum value in a group
MAX(playerScore)
SUM
Get the sum of values in a group
SUM(playerScore)
AVG
Get the average of the values in a group
AVG(playerScore)
MODE
Get the most frequent value in a group. When multiple modes are present it gives the minimum of all the modes. This behavior can be overridden to get the maximum or the average mode.
MODE(playerScore)
MODE(playerScore, 'MIN')
MODE(playerScore, 'MAX')
MODE(playerScore, 'AVG')
MINMAXRANGE
Returns the max - min
value in a group
MINMAXRANGE(playerScore)
PERCENTILE(column, N)
Returns the Nth percentile of the group where N is a decimal number between 0 and 100 inclusive
PERCENTILE(playerScore, 50), PERCENTILE(playerScore, 99.9)
PERCENTILEEST(column, N)
Returns the Nth percentile of the group using Quantile Digest algorithm
PERCENTILEEST(playerScore, 50), PERCENTILEEST(playerScore, 99.9)
PERCENTILETDigest(column, N)
Returns the Nth percentile of the group using T-digest algorithm
PERCENTILETDIGEST(playerScore, 50), PERCENTILETDIGEST(playerScore, 99.9)
DISTINCT
Returns the distinct row values in a group
DISTINCT(playerName)
DISTINCTCOUNT
Returns the count of distinct row values in a group
DISTINCTCOUNT(playerName)
DISTINCTCOUNTBITMAP
Returns the count of distinct row values in a group. This function is accurate for INT column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collisions.
DISTINCTCOUNTBITMAP(playerName)
DISTINCTCOUNTHLL
Returns an approximate distinct count using HyperLogLog. It also takes an optional second argument to configure the log2m for the HyperLogLog.
DISTINCTCOUNTHLL(playerName, 12)
DISTINCTCOUNTRAWHLL
Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.
DISTINCTCOUNTRAWHLL(playerName)
FASTHLL (Deprecated)
WARN: will be deprecated soon. FASTHLL stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format
FASTHLL(playerName)
DISTINCTCOUNTTHETASKETCH
DISTINCTCOUNTRAWTHETASKETCH
SEGMENTPARTITIONEDDISTINCTCOUNT
Returns the count of distinct values of a column when the column is pre-partitioned for each segment, where there is no common value within different segments. This function calculates the exact count of distinct values within the segment, then simply sums up the results from different segments to get the final result.
SEGMENTPARTITIONEDDISTINCTCOUNT(playerName)
Function
Description
Example
COUNTMV
Get the count of rows in a group
COUNTMV(playerName)
MINMV
Get the minimum value in a group
MINMV(playerScores)
MAXMV
Get the maximum value in a group
MAXMV(playerScores)
SUMMV
Get the sum of values in a group
SUMMV(playerScores)
AVGMV
Get the avg of values in a group
AVGMV(playerScores)
MINMAXRANGEMV
Returns the max - min
value in a group
MINMAXRANGEMV(playerScores)
PERCENTILEMV(column, N)
Returns the Nth percentile of the group where N is a decimal number between 0 and 100 inclusive
PERCENTILEMV(playerScores, 50),
PERCENTILEMV(playerScores, 99.9)
PERCENTILEESTMV(column, N)
Returns the Nth percentile of the group using Quantile Digest algorithm
PERCENTILEESTMV(playerScores, 50),
PERCENTILEESTMV(playerScores, 99.9)
PERCENTILETDIGESTMV(column, N)
Returns the Nth percentile of the group using T-digest algorithm
PERCENTILETDIGESTMV(playerScores, 50),
PERCENTILETDIGESTMV(playerScores, 99.9),
DISTINCTCOUNTMV
Returns the count of distinct row values in a group
DISTINCTCOUNTMV(playerNames)
DISTINCTCOUNTBITMAPMV
Returns the count of distinct row values in a group. This function is accurate for INT or dictionary encoded column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collision.
DISTINCTCOUNTBITMAPMV(playerNames)
DISTINCTCOUNTHLLMV
Returns an approximate distinct count using HyperLogLog in a group
DISTINCTCOUNTHLLMV(playerNames)
DISTINCTCOUNTRAWHLLMV
Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.
DISTINCTCOUNTRAWHLLMV(playerNames)
FASTHLLMV (Deprecated)
stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format
FASTHLLMV(playerNames)
ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc=
AwIBBQAAAAL/////////////////////
AwIBBQAAAAz///////////////////////////////////////////////9///////f///9/////7///////////////+/////////////////////////////////////////////8=
AwIBBwAAAA/////////////////////////////////////////////////////////////////////////////////////////////////////////9///////////////////////////////////////////////7//////8=
ADD(col1, col2, col3...)
Sum of at least two values
ADD(score_maths, score_science, score_history)
SUB(col1, col2)
Difference between two values
SUB(total_score, score_science)
MULT(col1, col2, col3...)
Product of at least two values
MUL(score_maths, score_science, score_history)
DIV(col1, col2)
Quotient of two values
SUB(total_score, total_subjects)
MOD(col1, col2)
Modulo of two values
MOD(total_score, total_subjects)
ABS(col1)
Absolute of a value
ABS(score)
CEIL(col1)
Rounded up to the nearest integer.
CEIL(percentage)
FLOOR(col1)
Rounded down to the nearest integer.
FLOOR(percentage)
EXP(col1)
Euler’s number(e) raised to the power of col.
EXP(age)
LN(col1)
Natural log of value i.e. ln(col1)
LN(age)
SQRT(col1)
Square root of a value
SQRT(height)
UPPER(col)
convert string to upper case
UPPER(playerName)
LOWER(col)
convert string to lower case
LOWER(playerName)
REVERSE(col)
reverse the string
REVERSE(playerName)
SUBSTR(col, startIndex, endIndex)
get substring of the input string from start to endIndex. Index begins at 0. Set endIndex to -1 to calculate till end of the string
SUBSTR(playerName, 1, -1)
<code></code>
SUBSTR(playerName, 1, 4)
CONCAT(col1, col2, seperator)
Concatenate two input strings using the seperator
CONCAT(firstName, lastName, '-')
TRIM(col)
trim spaces from both side of the string
TRIM(playerName)
LTRIM(col)
trim spaces from left side of the string
LTRIM(playerName)
RTRIM(col)
trim spaces from right side of the string
RTRIM(playerName)
LENGTH(col)
calculate length of the string
LENGTH(playerName)
STRPOS(col, find, N)
find Nth instance of find
string in input.
Returns 0 if input string is empty. Returns -1 if the Nth instance is not found or input string is null.
STRPOS(playerName, 'david', 1)
STARTSWITH(col, prefix)
returns true
if columns starts with prefix string.
STARTSWITH(playerName, 'david')
REPLACE(col, find, substitute)
replace all instances of find
with replace
in input
REPLACE(playerName, 'david', 'henry')
RPAD(col, size, pad)
string padded from the right side with pad
to reach final size
RPAD(playerName, 20, 'foo')
LPAD(col, size, pad)
string padded from the left side with pad
to reach final size
LPAD(playerName, 20, 'foo')
CODEPOINT(col)
the Unicode codepoint of the first character of the string
CODEPOINT(playerName)
CHR(codepoint)
the character corresponding to the Unicode codepoint
CHR(68)
TIMECONVERT
(col, fromUnit, toUnit)
Converts the value into another time unit. the column should be an epoch timestamp. Supported units are
DAYS HOURS MINUTES SECONDS MILLISECONDS MICROSECONDS NANOSECONDS
TIMECONVERT(time, 'MILLISECONDS', 'SECONDS')
This expression converts the value of column time
(taken to be in milliseconds) to the nearest seconds (i.e. the nearest seconds that is lower than the value of date
column)
DATETIMECONVERT
(columnName, inputFormat, outputFormat, outputGranularity)
Takes 4 arguments, converts the value into another date time format, and buckets time based on the given time granularity. Note that, for weeks/months/quarters/years, please use function: DateTrunc.
The format is expressed as <time size>:<time unit>:<time format>:<pattern>
where,
time size
- size of the time unit eg: 1, 10
time unit
- DAYS HOURS MINUTES SECONDS MILLISECONDS MICROSECONDS NANOSECONDS
time format
- EPOCH
or SIMPLE_DATE_FORMAT
pattern
- this is defined in case of SIMPLE_DATE_FORMAT
eg: yyyy-MM-dd
. A specific timezone can be passed using tz(timezone). Timezone can be long or short string format timezone. e.g. Asia/Kolkata
or PDT
granularity
- specified in the format<time size>:<time unit>
Date
from hoursSinceEpoch
to daysSinceEpoch
and bucket it to 1 day granularity
DATETIMECONVERT(Date, '1:HOURS:EPOCH', '1:DAYS:EPOCH', '1:DAYS')
Date
to 15 minutes granularity
DATETIMECONVERT(Date, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '15:MINUTES')
Date
from hoursSinceEpoch
to format yyyyMdd
and bucket it to 1 days granularity
DATETIMECONVERT(Date, '1:HOURS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd', '1:DAYS')
Date
from milliseconds to format yyyyMdd
in timezone PST
DATETIMECONVERT(Date, '1:MILLISECONDS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd tz(America/Los_Angeles)', '1:DAYS')
DATETRUNC
(Presto) SQL compatible date truncation, equivalent to the Presto function date_trunc.
Takes at least 3 and upto 5 arguments, converts the value into a specified output granularity seconds since UTC epoch that is bucketed on a unit in a specified timezone.
DATETRUNC('week', time_in_seconds, 'SECONDS')
This expression converts the column time_in_seconds
, which is a long containing seconds since UTC epoch truncated at WEEK
(where a Week starts at Monday UTC midnight). The output is a long seconds since UTC epoch.
DATETRUNC('quarter', DIV(time_milliseconds/1000), 'SECONDS', 'America/Los_Angeles', 'HOURS')
This expression converts the expression time_in_milliseconds/1000
into hours that are truncated to QUARTER
at the Los Angeles time zone (where a Quarter begins on 1/1, 4/1, 7/1, 10/1 in Los Angeles timezone). The output is expressed as hours since UTC epoch (note that the output is not Los Angeles timezone)
ToEpoch<TIME_UNIT>(timeInMillis)
Convert epoch milliseconds to epoch <Time Unit>. Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
ToEpochSeconds(tsInMillis):
Converts column tsInMillis
value from epoch milliseconds to epoch seconds.
ToEpochDays(tsInMillis):
Converts column tsInMillis
value from epoch milliseconds to epoch days.
ToEpoch<TIME_UNIT>Rounded(timeInMillis, bucketSize)
Convert epoch milliseconds to epoch <Time Unit>, round to nearest rounding bucket(Bucket size is defined in <Time Unit>). Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
ToEpochSecondsRound(tsInMillis, 10):
Converts column tsInMillis
value from epoch milliseconds to epoch seconds and round to the 10-minute bucket value. E.g.ToEpochSecondsRound(
1613472303000, 10) = 1613472300
ToEpochMinutesRound(tsInMillis, 1440):
Converts column tsInMillis
value from epoch milliseconds to epoch Minutes, but round to 1-day bucket value. E.g.ToEpochMinutesRound(
1613472303000, 1440) = 26890560
ToEpoch<TIME_UNIT>Bucket(timeInMillis, bucketSize)
Convert epoch milliseconds to epoch <Time Unit>, and divided by bucket size(Bucket size is defined in <Time Unit>). Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
ToEpochSecondsBucket(tsInMillis, 10):
Converts column tsInMillis
value from epoch milliseconds to epoch seconds then divide by 10 to get the 10 seconds since epoch value. E.g.
ToEpochSecondsBucket(
1613472303000, 10) = 161347230
ToEpochHoursBucket(tsInMillis, 24):
Converts column tsInMillis
value from epoch milliseconds to epoch Hours, then divide by 24 to get 24 hours since epoch value.
FromEpoch<TIME_UNIT>(timeIn<Time_UNIT>)
Convert epoch <Time Unit> to epoch milliseconds. Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
FromEpochSeconds(tsInSeconds):
Converts column tsInSeconds
value from epoch seconds to epoch milliseconds. E.g.
FromEpochSeconds(
1613472303) = 1613472303000
FromEpoch<TIME_UNIT>Bucket(timeIn<Time_UNIT>, bucketSizeIn<Time_UNIT>)
Convert epoch <Bucket Size><Time Unit> to epoch milliseconds. E.g. 10 seconds since epoch or 5 minutes since Epoch. Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
FromEpochSecondsBucket(tsInSeconds, 10):
Converts column tsInSeconds
value from epoch 10-seconds to epoch milliseconds. E.g.
FromEpochSeconds(161347231)= 1613472310000
ToDateTime(timeInMillis, pattern[, timezoneId])
Convert epoch millis value to DateTime string represented by pattern. Time zone will be set to UTC if timezoneId
is not specified.
ToDateTime(tsInMillis, 'yyyy-MM-dd')
converts tsInMillis value to date time pattern yyyy-MM-dd
ToDateTime(tsInMillis, 'yyyy-MM-dd ZZZ', 'America/Los_Angeles')
converts tsInMillis value to date time pattern yyyy-MM-dd ZZZ
in America/Los_Angeles time zone
FromDateTime(dateTimeString, pattern)
Convert DateTime string represented by pattern to epoch millis.
FromDateTime(dateTime, 'yyyy-MM-dd')
converts dateTime
string value to millis epoch value
round(timeValue, bucketSize)
Round the given time value to nearest bucket start value.
round(tsInSeconds, 60)
round seconds epoch value to the start value of the 60 seconds bucket it belongs to. E.g. round(161347231, 60)= 161347200
now()
Return current time as epoch millis
Typically used in predicate to filter on timestamp for recent data. E.g. filter data on recent 1 day(86400 seconds).WHERE tsInMillis > now() - 86400000
timezoneHour(timeZoneId)
Returns the hour of the time zone offset.
timezoneMinute(timeZoneId)
Returns the minute of the time zone offset.
year(tsInMillis)
Returns the year from the given epoch millis in UTC timezone.
year(tsInMillis, timeZoneId)
Returns the year from the given epoch millis and timezone id.
yearOfWeek(tsInMillis)
Returns the year of the ISO week from the given epoch millis in UTC timezone. Alias yow
is also supported.
yearOfWeek(tsInMillis, timeZoneId)
Returns the year of the ISO week from the given epoch millis and timezone id. Alias yow
is also supported.
quarter(tsInMillis)
Returns the quarter of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 4.
quarter(tsInMillis, timeZoneId)
Returns the quarter of the year from the given epoch millis and timezone id. The value ranges from 1 to 4.
month(tsInMillis)
Returns the month of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 12.
month(tsInMillis, timeZoneId)
Returns the month of the year from the given epoch millis and timezone id. The value ranges from 1 to 12.
week(tsInMillis)
Returns the ISO week of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 53. Alias weekOfYear
is also supported.
week(tsInMillis, timeZoneId)
Returns the ISO week of the year from the given epoch millis and timezone id. The value ranges from 1 to 53. Alias weekOfYear
is also supported.
dayOfYear(tsInMillis)
Returns the day of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 366. Alias doy
is also supported.
dayOfYear(tsInMillis, timeZoneId)
Returns the day of the year from the given epoch millis and timezone id. The value ranges from 1 to 366. Alias doy
is also supported.
day(tsInMillis)
Returns the day of the month from the given epoch millis in UTC timezone. The value ranges from 1 to 31. Alias dayOfMonth
is also supported.
day(tsInMillis, timeZoneId)
Returns the day of the month from the given epoch millis and timezone id. The value ranges from 1 to 31. Alias dayOfMonth
is also supported.
dayOfWeek(tsInMillis)
Returns the day of the week from the given epoch millis in UTC timezone. The value ranges from 1(Monday) to 7(Sunday). Alias dow
is also supported.
dayOfWeek(tsInMillis, timeZoneId)
Returns the day of the week from the given epoch millis and timezone id. The value ranges from 1(Monday) to 7(Sunday). Alias dow
is also supported.
hour(tsInMillis)
Returns the hour of the day from the given epoch millis in UTC timezone. The value ranges from 0 to 23.
hour(tsInMillis, timeZoneId)
Returns the hour of the day from the given epoch millis and timezone id. The value ranges from 0 to 23.
minute(tsInMillis)
Returns the minute of the hour from the given epoch millis in UTC timezone. The value ranges from 0 to 59.
minute(tsInMillis, timeZoneId)
Returns the minute of the hour from the given epoch millis and timezone id. The value ranges from 0 to 59.
second(tsInMillis)
Returns the second of the minute from the given epoch millis in UTC timezone. The value ranges from 0 to 59.
second(tsInMillis, timeZoneId)
Returns the second of the minute from the given epoch millis and timezone id. The value ranges from 0 to 59.
millisecond(tsInMillis)
Returns the millisecond of the second from the given epoch millis in UTC timezone. The value ranges from 0 to 999.
millisecond(tsInMillis, timeZoneId)
Returns the millisecond of the second from the given epoch millis and timezone id. The value ranges from 0 to 999.
Function
Type
Description
JSONEXTRACTSCALAR
(jsonField, 'jsonPath', 'resultsType', [defaultValue])
Transform
Evaluates the 'jsonPath'
on jsonField,
returns the result as the type 'resultsType'
, use optional defaultValue
for null or parsing error.
JSONEXTRACTKEY
(jsonField, 'jsonPath')
Transform
Extracts all matched JSON field keys based on 'jsonPath'
Into aSTRING_ARRAY.
TOJSONMAPSTR(map)
Scalar
Convert map to JSON String
JSONFORMAT(object)
Scalar
Convert object to JSON String
JSONPATH(jsonField, 'jsonPath')
Scalar
Extracts the object value from jsonField
based on 'jsonPath'
, the result type is inferred based on JSON value. Cannot be used in query because data type is not specified.
JSONPATHLONG(jsonField, 'jsonPath', [defaultValue])
Scalar
Extracts the Long value from jsonField
based on 'jsonPath'
, use optional defaultValue
for null or parsing error.
JSONPATHDOUBLE(jsonField, 'jsonPath', [defaultValue])
Scalar
Extracts the Double value from jsonField
based on 'jsonPath'
, use optional defaultValue
for null or parsing error.
JSONPATHSTRING(jsonField, 'jsonPath', [defaultValue])
Scalar
Extracts the String value from jsonField
based on 'jsonPath'
, use optional defaultValue
for null or parsing error.
JSONPATHARRAY(jsonField, 'jsonPath')
Scalar
Extracts an array from jsonField
based on 'jsonPath'
, the result type is inferred based on JSON value. Cannot be used in query because data type is not specified.
JSONPATHARRAYDEFAULTEMPTY(jsonField, 'jsonPath')
Scalar
Extracts an array from jsonField
based on 'jsonPath'
, the result type is inferred based on JSON value. Returns empty array for null or parsing error. Cannot be used in query because data type is not specified.
Arguments
Description
jsonField
An Identifier/Expression contains JSON documents.
'jsonPath'
Follows JsonPath Syntax to read values from JSON documents.
'results_type'
One of the Pinot supported data types:INT, LONG, FLOAT, DOUBLE, BOOLEAN, TIMESTAMP, STRING,
INT_ARRAY, LONG_ARRAY, FLOAT_ARRAY, DOUBLE_ARRAY, STRING_ARRAY
.
JSONPATH(myJsonRecord, '$.name')
"Pete"
JSONPATH(myJsonRecord, '$.age')
24
JSONPATHSTRING(myJsonRecord, '$.age')
"24"
JSONPATHARRAY(myJsonRecord, '$.subjects[*].name')
["maths", "english"]
JSONPATHARRAY(myJsonRecord, '$.subjects[*].score')
[90, 70]
JSONPATHARRAY(myJsonRecord, '$.subjects[*].homework_grades[1]')
[85, 65]
SHA(bytesCol)
Return SHA-1 digest of binary column(bytes
type) as hex string
SHA(rawData)
SHA256(bytesCol)
Return SHA-256 digest of binary column(bytes
type) as hex string
SHA256(rawData)
SHA512(bytesCol)
Return SHA-512 digest of binary column(bytes
type) as hex string
SHA512(rawData)
MD5(bytesCol)
Return MD5 digest of binary column(bytes
type) as hex string
MD5(rawData)
ARRAYLENGTH
Returns the length of a multi-value column
MAP_VALUE
Select the value for a key from Map stored in Pinot.
MAP_VALUE(mapColumn, 'myKey', valueColumn)
VALUEIN
Takes at least 2 arguments, where the first argument is a multi-valued column, and the following arguments are constant values. The transform function will filter the value from the multi-valued column with the given constant values. The VALUEIN
transform function is especially useful when the same multi-valued column is both filtering column and grouping column.
VALUEIN(mvColumn, 3, 5, 15)
The Pinot Admin UI contains all the APIs that you will need to operate and manage your cluster. It provides a set of APIs for Pinot cluster management including health check, instances management, schema and table management, data segments management.
Note: The controller API's are primarily for admin tasks. Even though the UI console queries Pinot when running queries from the query console, please use the Broker Query API for querying Pinot.
Let's check out the tables in this cluster by going to Table -> List all tables in cluster and click on Try it out!
. We can see the baseballStats
table listed here. We can also see the exact curl
call made to the controller API.
You can look at the configuration of this table by going to Tables -> Get/Enable/Disable/Drop a table, type in baseballStats
in the table name, and click Try it out!
Let's check out the schemas in the cluster by going to Schema -> List all schemas in the cluster and click Try it out!
. We can see a schema called baseballStats
in this list.
Take a look at the schema by going to Schema -> Get a schema, type baseballStats
in the schema name, and click Try it out!
.
Finally, let's checkout the data segments in the cluster by going to List all segments, type in baseballStats
in the table name, and click Try it out!
. There's 1 segment for this table, called baseballStats_OFFLINE_0
.
You might have figured out by now, in order to get data into the Pinot cluster, we need a table, a schema and segments. Let's head over to Batch upload sample data, to find out more about these components and learn how to create them for your own data.
Response is returned in a SQL-like tabular structure. Note, this is the response returned from the standard-SQL endpoint. For PQL endpoint response, skip to PQL endpoint response
Note
PQL endpoint is deprecated, and will soon be removed. The standard sql endpoint is the recommended endpoint.
The response received from PQL endpoint is different depending on the type of the query.
To see how JSON data can be queried, assume that we have the following table:
We also assume that "jsoncolumn" has a Json Index on it. Note that the last two rows in the table have different structure than the rest of the rows. In keeping with JSON specification, a JSON column can contain any valid JSON data and doesn't need to adhere to a predefined schema. To pull out the entire JSON document for each row, we can run the query below:
To drill down and pull out specific keys within the JSON column, we simply append the JsonPath expression of those keys to the end of the column name.
Note that the third column (jsoncolumn.data[1]) is null for rows with id 106 and 107. This is because these rows have JSON documents that don't have a key with JsonPath jsoncolumn.data[1]. We can filter out these rows.
Notice that certain last names (duck and mouse for example) repeat in the data above. We can get a count of each last name by running a GROUP BY query on a JsonPath expression.
Also there is numerical information (jsconcolumn.score) embeded within the JSON document. We can extract those numerical values from JSON data into SQL and sum them up using the query below.
In short, JSON querying support in Pinot will allow you to use a JsonPath expression whereever you can use a column name with the only difference being that to query a column with data type JSON, you must append a JsonPath expression after the name of the column.
Pinot provides a native java client to execute queries on the cluster. The client makes it easier for user to query data. The client is also tenant-aware and thus is able to redirect the queries to the correct broker.
You can use the client by including the following dependency -
Here's an example of how to use the pinot-java-client
to query Pinot.
The client provides a ConnectionFactory
class to create connections to a Pinot cluster. The factory supports the following methods to create a connection -
Zookeeper (Recommended) - Comma seperated list of zookeeper of the cluster. This is the recommended method which can redirect queries to appropriate brokers based on tenant/table.
Broker list - Comma seperated list of the brokers in the cluster. This should only be used in standalone setups or for POC, unless you have a load balancer setup for brokers.
Properties file - You can also put the broker list as brokerList
in a properties file and provide the path to that file to the factory. This should only be used in standalone setups or for POC, unless you have a load balancer setup for brokers.
Here's an example demonstrating all methods of Connection factory -
You can run the query in both blocking as well as async manner. Use
Connection.execute(org.apache.pinot.client.Request)
for blocking queries
Connection.executeAsync(org.apache.pinot.client.Request)
for asynchronous queries that return a future object.
You can also use PreparedStatement
to escape query parameters. We don't store the Prepared Statement in the database and hence it won't increase the subsequent query performance.
Results can be obtained with the various get methods in the first ResultSet, obtained through the getResultSet(int)
method:
If queryFormat pql
is used in the Request
, there are some differences in how the results can be accessed, depending on the query.
In the case of aggregation, each aggregation function is within its own ResultSet. A query with multiple aggregation function will return one result set per aggregation function, as they are computed in parallel.
In case of aggregation with GROUP BY
, there will be as many ResultSets as the number of aggregations, each of which will contain multiple results grouped by a grouping key.
A lot of times the user wants to query data from an external application instead of using the inbuilt query explorer. Pinot provides external query client for this purpose. All of the clients have pretty standard interfaces so that the learning curve is minimum.
Currently Pinot provides the following clients
Pinot offers standard JDBC interface to query the database. This makes it easier to integrate Pinot with other applications such as Tableau.
You can include the JDBC dependency in your code as follows -
There is no need to register the driver manually as it will automatically register itself at the startup of the application.
Here's an example of how to use the pinot-jdbc-client
for querying. The client only requires the controller URL.
You can also use PreparedStatements. The placeholder parameters are represented using ?
** (question mark) symbol.
The JDBC client doesn't support INSERT
, DELETE
or UPDATE
statements due to the database limitations. You can only use the client to query the database.
The driver is also not completely ANSI SQL 92 compliant.
If you want to use JDBC driver to integrate Pinot with other applications, do make sure to check JDBC ConnectionMetadata in your code. This will help in determining which features cannot be supported by Pinot since it is an OLAP database.
id | jsoncolumn |
---|---|
id | jsoncolumn.name.last | jsoncolumn.name.first | jsoncolumn.data[1] |
---|---|---|---|
id | jsoncolumn.name.last | jsoncolumn.name.first | jsoncolumn.data[1] |
---|---|---|---|
jsoncolumn.name.last | count(*) |
---|---|
jsoncolumn.name.last | sum(jsoncolumn.score) |
---|---|
You can also build locally and use it.
This section is only applicable for PQL endpoint, which is deprecated and will be deleted soon. For more information about the endpoints, visit .
You can also compile the into a JAR and place the JAR in the Drivers directory of your application.
Response Field
Description
resultTable
This contains everything needed to process the response
resultTable.dataSchema
This describes schema of the response (columnNames and their dataTypes)
resultTable.dataSchema.columnNames
columnNames in the response.
resultTable.dataSchema.columnDataTypes
DataTypes for each column
resultTable.rows
Actual content with values. This is an array of arrays. number of rows depends on the limit value in the query. The number of columns in each row is equal to the length of (resultTable.dataSchema.columnNames)
timeUsedms
Total time taken as seen by the broker before sending the response back to the client
totalDocs
This is number of documents/records in the table
numServersQueried
represents the number of servers queried by the broker (note that this may be less than the total number of servers since broker can apply some optimizations to minimize the number of servers)
numServersResponded
This should be equal to the numServersQueried. If this is not the same, then one of more servers might have timed out. If numServersQueried != numServersResponded the results can be considered partial and clients can retry the query with exponential back off.
numSegmentsQueried
Total number of segmentsQueried for this query. it may be less than the total number of segments since broker can apply optimizations.
numSegmentsMatched
This is the number of segments actually processed. This indicates the effectiveness of pruning logic (based on partitioning, time etc).
numSegmentsProcessed
Actual number of segments that were processed. This is where the majority of the time is spent.
numDocScanned
The number of docs/records that were scanned to process the query. This includes the docs scanned in filter phase (this can be zero if columns in query are indexed) and post filter.
numEntriesScannedInFilter
This along with numEntriesScannedInPostFilter should give an idea on where most of the time is spent during query processing. If this is high, enabling indexing for columns in tableConfig can be one way to bring it down.
numEntriesScannedPostFilter
This along with numEntriesScannedInPostFilter should give an idea on where most of the time is spent during query processing. A high number for this means the selectivity is low (i.e. pinot needs to scan a lot of records to answer the query). If this is high, adding regular inverted/bitmap index will not help. However, consider using start-tree index.
numGroupsLimitReached
If the query has group by clause and top K, pinot drops new entries after the numGroupsLimit is reached. If this boolean is set to true then the query result may not be accurate. Note that the default value for numGroupsLimit is 100k and should be sufficient for most use cases.
exceptions
Will contain the stack trace if there is any exception processing the query.
segmentStatistics
N/A
traceInfo
If trace is enabled (can be enabled for each query), this will contain the timing for each stage and each segment. Advanced feature and intended for dev/debugging purposes
"101"
"{"name":{"first":"daffy","last":"duck"},"score":101,"data":["a","b","c","d"]}"
102"
"{"name":{"first":"donald","last":"duck"},"score":102,"data":["a","b","e","f"]}
"103"
"{"name":{"first":"mickey","last":"mouse"},"score":103,"data":["a","b","g","h"]}
"104"
"{"name":{"first":"minnie","last":"mouse"},"score":104,"data":["a","b","i","j"]}"
"105"
"{"name":{"first":"goofy","last":"dwag"},"score":104,"data":["a","b","i","j"]}"
"106"
"{"person":{"name":"daffy duck","companies":[{"name":"n1","title":"t1"},{"name":"n2","title":"t2"}]}}"
"107"
"{"person":{"name":"scrooge mcduck","companies":[{"name":"n1","title":"t1"},{"name":"n2","title":"t2"}]}}"
"101"
"duck"
"daffy"
"b"
"102"
"duck"
"donald"
"b"
"103"
"mouse"
"mickey"
"b"
"104"
"mouse"
"minnie"
"b"
"105"
"dwag"
"goofy"
"b"
"106"
"null"
"null"
"null"
"107"
"null"
"null"
"null"
"101"
"duck"
"daffy"
"b"
"102"
"duck"
"donald"
"b"
"103"
"mouse"
"mickey"
"b"
"104"
"mouse"
"minnie"
"b"
"105"
"dwag"
"goofy"
"b"
"mouse"
"2"
"duck"
"2"
"dwag"
"1"
"mouse"
"207"
"dwag"
"104"
"duck"
"203"
Pinot Client for Golang
Pinot also provides a native go client to query database directly from go application.
Please follow this Pinot Quickstart link to install and start Pinot batch QuickStart locally.
Check out Client library Github Repo
Build and run the example application to query from Pinot Batch Quickstart
Pinot client could be initialized through:
Please see this example for your reference.
Code snippet:
Query Response is defined as the struct of following:
Note that AggregationResults
and SelectionResults
are holders for PQL queries.
Meanwhile ResultTable
is the holder for SQL queries. ResultTable
is defined as:
RespSchema
is defined as:
There are multiple functions defined for ResultTable
, like:
Sample Usage is here
Configure AliCloud Object Storage Service (OSS) as Pinot deep storage
OSS can be used as HDFS deep storage for Apache Pinot without implementing the OSS file system plugin. You should follow the steps below;
1. Configure hdfs-site.xml and core-site.xml files. After that, put these configurations under any desired path, then set the value of pinot.<node>.storage.factory.oss.hadoop.conf
config on the controller/server configs to this path.
For hdfs-site.xml; you do not have to give any configuration;
For core-site.xml; you have to give OSS access/secret and bucket configurations like below;
2. In order to access OSS, find your HDFS jars related to OSS and put them under the PINOT_DIR/lib
. You can use jars below but be careful about versions to avoid conflict.
smartdata-aliyun-oss
smartdata-hadoop-common
guava
3. Set OSS deep storage configs on controller.conf and server.conf;
Controller config
Server config
Example Job Spec
Using the same HDFS deep storage configs and jars, you can read data from OSS, then create segments and push them to OSS again. An example standalone batch ingestion job can be like below;
Applications can use this python client library to query Apache Pinot.
Pypi Repo: https://pypi.org/project/pinotdb/
Source Code Repo: https://github.com/python-pinot-dbapi/pinot-dbapi
Please note:
pinotdb version >= 0.3.2 is using Pinot SQL API (added in Pinot >= 0.3.0) and drops support for PQL API. So this client requires Pinot server version >= 0.3.0 in order to access Pinot.
pinotdb version in 0.2.x is using Pinot PQL API, which works with pinot version <= 0.3.0, but may miss some new SQL query features added in newer Pinot version.
The db engine connection string is format as: pinot://:?controller=://:/
Run below command to start Pinot Batch Quickstart in docker and expose Pinot controller port 9000 and Pinot broker port 8000.
Once pinot batch quickstart is up, you can run below sample code snippet to query Pinot:
Sample Output:
Run below command to start Pinot Hybrid Quickstart in docker and expose Pinot controller port 9000 and Pinot broker port 8000.
Below is an example to query against Pinot Quickstart Hybrid:
In order to setup Pinot in Docker to use S3 as deep store, we need to put extra configs for Controller and Server.
Below sections will prepare 3 config files under /tmp/pinot-s3-docker
to mount to the container.
Below is a sample controller.conf
file.
Please config:
controller.data.dir
to your s3 bucket. All the uploaded segments will be stored there.
And add s3 as a pinot storage with configs:
You can specify AccessKey and Secret using:
Environment Variables - AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
(RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY
and AWS_SECRET_KEY
(only recognized by Java SDK)
Java System Properties - aws.accessKeyId
and aws.secretKey
Credential profiles file at the default location (~/.aws/credentials
) shared by all AWS SDKs and the AWS CLI
Configure AWS credential in pinot config files, e.g. set pinot.controller.storage.factory.s3.accessKey
and pinot.controller.storage.factory.s3.secretKey
in the config file. (Not recommended)
Add s3
to pinot.controller.segment.fetcher.protocols
and set pinot.controller.segment.fetcher.s3.class
toorg.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
Then start pinot controller with:
Broker is a simple one you can just start it with default:
Below is a sample server.conf
file
Similar to controller config, please also set s3 configs in pinot server.
Then start pinot server with:
In this demo, we just use airlineStats
table as an example which is already packaged inside the docker image.
You can also mount your table conf and schema files to the container and run it.
Below is a sample standalone ingestion job spec with certain notable changes:
jobType is SegmentCreationAndMetadataPush (this job will bypass controller download segment )
inputDirURI is set to a s3 location s3://my.bucket/batch/airlineStats/rawdata/
outputDirURI is set to a s3 location s3://my.bucket/output/airlineStats/segments
Add a new PinotFs under pinotFSSpecs
Sample ingestionJobSpec.yaml
Launch the data ingestion job:
Pinot segments can be created offline on Hadoop, or via command line from data files. Controller REST endpoint can then be used to add the segment to the table to which the segment belongs. Pinot segments can also be created by ingesting data from realtime resources (such as Kafka).
Offline Pinot workflow
To create Pinot segments on Hadoop, a workflow can be created to complete the following steps:
Pre-aggregate, clean up and prepare the data, writing it as Avro format files in a single HDFS directory
Create segments
Upload segments to the Pinot cluster
Step one can be done using your favorite tool (such as Pig, Hive or Spark), Pinot provides two MapReduce jobs to do step two and three.
Create a job properties configuration file, such as one below:
The Pinot Hadoop module contains a job that you can incorporate into your workflow to generate Pinot segments.
You can then use the SegmentTarPush job to push segments via the controller REST API.
Here is how you can create Pinot segments from standard formats like CSV/JSON/AVRO.
Create a top level directory containing all the CSV/JSON/AVRO files that need to be converted into segments.
The file name extensions are expected to be the same as the format name (i.e .csv
, .json
or .avro
), and are case insensitive. Note that the converter expects the .csv
extension even if the data is delimited using tabs or spaces instead.
Prepare a schema file describing the schema of the input data. The schema needs to be in JSON format. See example later in this section.
Specifically for CSV format, an optional csv config file can be provided (also in JSON format). This is used to configure parameters like the delimiter/header for the CSV file etc. A detailed description of this follows below.
Run the pinot-admin command to generate the segments. The command can be invoked as follows. Options within “[ ]” are optional. For -format, the default value is AVRO.
To configure various parameters for CSV a config file in JSON format can be provided. This file is optional, as are each of its parameters. When not provided, default values used for these parameters are described below:
fileFormat: Specify one of the following. Default is EXCEL.
EXCEL
MYSQL
RFC4180
TDF
header: If the input CSV file does not contain a header, it can be specified using this field. Note, if this is specified, then the input file is expected to not contain the header row, or else it will result in parse error. The columns in the header must be delimited by the same delimiter character as the rest of the CSV file.
delimiter: Use this to specify a delimiter character. The default value is “,”.
multiValueDelimiter: Use this to specify a delimiter character for each value in multi-valued columns. The default value is “;”.
Below is a sample config file.
Sample Schema:
You can use curl to push a segment to pinot:
Alternatively you can use the pinot-admin.sh utility to upload one or more segments:
The command uploads all the segments found in segmentDirectoryPath
. The segments could be either tar-compressed (in which case it is a file under segmentDirectoryPath
) or uncompressed (in which case it is a directory under segmentDirectoryPath
).
Regarding AWS Credential, we also follow the convention of .
Follow the steps described in the section on to build pinot. Locate pinot-admin.sh
in pinot-tools/target/pinot-tools=pkg/bin/pinot-admin.sh
.
One of the primary advantage of using Pinot is its pluggable architecture. The plugins make it easy to add support for any third-party system which can be an execution framework, a filesystem or input format.
In this tutorial, we will use three such plugins to easily ingest data and push it to our pinot cluster. The plugins we will be using are -
pinot-batch-ingestion-spark
pinot-s3
pinot-parquet
You can check out Batch Ingestion, File systems and Input formats for all the available plugins.
We are using the following tools and frameworks for this tutorial -
Apache Spark 2.2.3 (Although any spark 2.X should work)
Apache Parquet 1.8.2
We need to get input data to ingest first. For our demo, we'll just create some small parquet files and upload them to our S3 bucket. The easiest way is to create CSV files and then convert them to parquet. CSV makes it human-readable and thus easier to modify input in case of some failure in our demo. We will call this file students.csv
Now, we'll create parquet files from the above CSV file using Spark. Since this is a small program, we will be using Spark shell instead of writing a full fledged Spark code.
The .parquet
files can now be found in /path/to/batch_input
directory. You can now upload this directory to S3 either using their UI or running the command
We need to create a table to query the data that will be ingested. All tables in pinot are associated with a schema. You can check out Table configuration and Schema configuration for more details on creating configurations.
For our demo, we will have the following schema and table configs
We can now upload these configurations to pinot and create an empty table. We will be using pinot-admin.sh
CLI for these purpose.
You can check out Command-Line Interface (CLI) for all the available commands.
Our table will now be available in the Pinot data explorer
Now that our data is available in S3 as well as we have the Tables in Pinot, we can start the process of ingesting the data. Data ingestion in Pinot involves the following steps -
Read data and generate compressed segment files from input
Upload the compressed segment files to output location
Push the location of the segment files to the controller
Once the location is available to the controller, it can notify the servers to download the segment files and populate the tables.
The above steps can be performed using any distributed executor of your choice such as Hadoop, Spark, Flink etc. For this demo we will be using Apache Spark to execute the steps.
Pinot provides runners for Spark out of the box. So as a user, you don't need to write a single line of code. You can write runners for any other executor using our provided interfaces.
Firstly, we will create a job spec configuration file for our data ingestion process.
In the job spec, we have kept execution framework as spark
and configured the appropriate runners for each of our steps. We also need a temporary stagingDir
for our spark job. This directory is cleaned up after our job has executed.
We also provide the S3 Filesystem and Parquet reader implementation in the config to use. You can refer Ingestion Job Spec for complete list of configuration.
We can now run our spark job to execute all the steps and populate data in pinot.
In the command , we have included the JARs of all the required plugins in the spark's driver classpath
. In practice, you only need to do this if you get a ClassNotFoundException
.
Voila! Now our data is successfully ingested. Let's try to query it from Pinot's broker
If everything went right, you should receive the following output
Below commands are based on pinot distribution binary.
In order to setup Pinot to use S3 as deep store, we need to put extra configs for Controller and Server.
Below is a sample controller.conf
file.
Please config:
controller.data.dir
to your s3 bucket. All the uploaded segments will be stored there.
And add s3 as a pinot storage with configs:
Regarding AWS Credential, we also follow the convention of DefaultAWSCredentialsProviderChain.
You can specify AccessKey and Secret using:
Environment Variables - AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
(RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY
and AWS_SECRET_KEY
(only recognized by Java SDK)
Java System Properties - aws.accessKeyId
and aws.secretKey
Credential profiles file at the default location (~/.aws/credentials
) shared by all AWS SDKs and the AWS CLI
Configure AWS credential in pinot config files, e.g. set pinot.controller.storage.factory.s3.accessKey
and pinot.controller.storage.factory.s3.secretKey
in the config file. (Not recommended)
Add s3
to pinot.controller.segment.fetcher.protocols
and set pinot.controller.segment.fetcher.s3.class
toorg.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
If you to grant full control to bucket owner, then add this to the config:
Then start pinot controller with:
Broker is a simple one you can just start it with default:
Below is a sample server.conf
file
Similar to controller config, please also set s3 configs in pinot server.
If you to grant full control to bucket owner, then add this to the config:
Then start pinot controller with:
In this demo, we just use airlineStats
table as an example.
Create table with below command:
Below is a sample standalone ingestion job spec with certain notable changes:
jobType is SegmentCreationAndUriPush
inputDirURI is set to a s3 location s3://my.bucket/batch/airlineStats/rawdata/
outputDirURI is set to a s3 location s3://my.bucket/output/airlineStats/segments
Add a new PinotFs under pinotFSSpecs
For library version < 0.6.0, please set segmentUriPrefix
to [scheme]://[bucket.name]
, e.g. s3://my.bucket
, from version 0.6.0, you can put empty string or just ignore segmentUriPrefix
.
Sample ingestionJobSpec.yaml
Below is a sample job output:
Please follow this page to setup a local spark cluster.
Below is a sample Spark Ingestion job
Submit spark job with the ingestion job:
Below is the sample snapshot of s3 location for controller:
Below is a sample download URI in PropertyStore, we expect the segment download uri is started with s3://
Pinot only allows adding new columns to the schema. In order to drop a column, change the column name or data type, a new table has to be created.
Let's begin by first fetching the existing schema. We can do this using the controller API:
Let's add a new column at the end of the schema, something like this (by editing baseballStats.schema
In this example, we're adding a new column called yearsOfExperience
with a default value of 1.
You can now update the schema using the following command
Please note: this will not be reflected immediately. You can use the following command to reload the table segments for this column to show up. This can be done as follows:
After the reload, now you can query the new column as shown below:
Note that the real values for the newly added columns won't be reflected within the current consuming segment(s). The next consuming segment(s) will start consuming the real values.
As you can observe, the current query returns the defaultNullValue
for the newly added column. In order to populate this column with real values, you will need to re-run the batch ingestion job for the past dates.
In practice, we need to run Pinot data ingestion as a pipeline or a scheduled job.
Assuming pinot-distribution is already built, inside examples directory, you could find several sample table layouts.
Usually each table deserves its own directory, like airlineStats.
Inside the table directory, rawdata is created to put all the input data.
Typically, for data events with timestamp, we partition those data and store them into a daily folder. E.g. a typically layout would follow this pattern: rawdata/%yyyy%/%mm%/%dd%/[daily_input_files]
.
Create a batch ingestion job spec file to describe how to ingest the data.
Below is an example (also located at examples/batch/airlineStats/ingestionJobSpec.yaml
)
Below command will create example table into Pinot cluster.
Below command will kick off the ingestion job to generate Pinot segments and push them into the cluster.
After job finished, segments are stored in examples/batch/airlineStats/segments
following same layout of input directory layout.
Below example is running in a spark local mode. You can download spark distribution and start it by running:
Below command shows how to use spark-submit command to submit a spark job using pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
.
Sample Spark ingestion job spec yaml, (also located at examples/batch/airlineStats/sparkIngestionJobSpec.yaml
):
Please ensure parameter PINOT_ROOT_DIR
and PINOT_VERSION
are set properly.
Please ensure you set
spark.driver.extraJavaOptions =>
-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins
Or put all the required plugins jars to CLASSPATH, then set -Dplugins.dir=${CLASSPATH}
spark.driver.extraClassPath =>
pinot-all-${PINOT_VERSION}-jar-with-depdencies.jar
Below command shows how to use Hadoop jar command to run a Hadoop job using pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
.
Sample Hadoop ingestion job spec yaml(also located at examples/batch/airlineStats/hadoopIngestionJobSpec.yaml
):
Please ensure parameter PINOT_ROOT_DIR
and PINOT_VERSION
are set properly.
You can set Environment Variable: JAVA_OPTS
to modify:
Log4j2 file location with -Dlog4j2.configurationFile
Plugin directory location with -Dplugins.dir=/opt/pinot/plugins
JVM props, like -Xmx8g -Xms4G
Please note that you need to config above three all together in JAVA_OPTS
. If you only config JAVA_OPTS="-Xmx4g"
then plugins.dir
is empty usually will cause job failure.
E.g.
You can also add your customized JAVA_OPTS
if necessary.
So far, you've seen how to for a Pinot table. In this tutorial, we'll see how to evolve the schema (e.g. add a new column to the schema). This guide assumes you have a Pinot cluster up and running (eg: as mentioned in ). We will also assume there's an existing table baseballStats
created as part of the .
Real-Time Pinot table: In case of real-time tables, make sure the "pinot.server.instance.reload.consumingSegment" config is set to true inside . Without this, the current consuming segment(s) will not reflect the default null value for newly added columns.
New columns can be added with . If all the source columns for the new column exist in the schema, the transformed values will be generated for the new column instead of filling default values.
Real-Time Pinot table: Backfilling data does not work for real-time tables. If you only have a real-time table, you can convert it to a hybrid table, by adding an offline counterpart that uses the same schema. Then you can backfill the offline table and fill in values for the newly added column. More on .
Build latest Pinot Distribution following this .