Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Pinot has many built-in Aggregation Functions such as MIN, MAX, SUM, AVG etc. See PQL page for the list of aggregation functions.
Adding a new AggregationFunction requires two things
Implement AggregationFunction interface and make it available as part of the classpath
Register the function in AggregationFunctionFactory. As of today, this requires code change in Pinot but we plan to add the ability to plugin Functions without having to change Pinot code.
To get an overall idea, see MAX Aggregation Function implementation. All other implementations can be found here.
Lets look at the key methods to implements in AggregationFunction
Before getting into the implementation, it's important to understand how Aggregation works in Pinot.
This is advanced topic and assumes you know Pinot concepts. All the data in Pinot is stored in segments across multiple nodes. The query plan at a high level comprises of 3 phases
1. Map phase
This phase works on the individual segments in Pinot.
Initialization: Depending on the query type the following methods are invoked to set up the result holder. While having different methods and return types adds complexity, it helps in performance.
AGGREGATION : createAggregationResultHolder
This must return an instance of type AggregationResultHolder. You can either use the DoubleAggregationResultHolder or ObjectAggregationResultHolder
GROUP BY: createGroupByResultHolder
This method must return an instance of type GroupByResultHolder. Depending on the type of result object, you might be able to use one of the existing implementations.
Callback: For every record that matches the filter condition in the query,
one of the following methods are invoked depending on the queryType(aggregation vs group by) and columnType(single-value vs multi-value). Note that we invoke this method for a batch of records instead of every row for performance reasons and allows JVM to vectorize some of parts of the execution if possible.
AGGREGATION: aggregate(int length, AggregationResultHolder aggregationResultHolder, Map<String,BlockValSet> blockValSetMap)
length: This represent length of the block. Typically < 10k
aggregationResultHolder: this is the object returned fromcreateAggregationResultHolder
blockValSetMap: Map of blockValSets depending on the arguments to the AggFunction
Group By Single Value: aggregateGroupBySV(int length, int[] groupKeyArray, GroupByResultHolder groupByResultHolder, Map blockValSets)
length: This represent length of the block. Typically < 10k
groupKeyArray: Pinot internally maintains a value to int mapping and this groupKeyArray maps to the internal mapping. These values together form a unique key.
groupByResultHolder: This is the object returned fromcreateGroupByResultHolder
blockValSetMap: Map of blockValSets depending on the arguments to the AggFunction
Group By Multi Value: aggregateGroupBySV(int length, int[] groupKeyArray, GroupByResultHolder groupByResultHolder, Map blockValSets)
length: This represent length of the block. Typically < 10k
groupKeyArray: Pinot internally maintains a value to int mapping and this groupKeyArray maps to the internal mapping. These values together form a unique key.
groupByResultHolder: This is the object returned fromcreateGroupByResultHolder
blockValSetMap: Map of blockValSets depending on the arguments to the AggFunction
2. Combine phase
In this phase, the results from all segments within a single pinot server are combined into IntermediateResult. The type of IntermediateResult is based on the Generic Type defined in the AggregationFunction implementation.
3. Reduce phase
There are two steps in the Reduce Phase
Merge all the IntermediateResult's from various servers using the merge function
Extract the final results by invoking the extractFinalResult method. In most cases, FinalResult is same type as IntermediateResult. AverageAggregationFunction is an example where IntermediateResult (AvgPair) is different from FinalResult(Double)
To contribute to Pinot, follow the instructions below.
Pinot uses git for source code management. If you are new to Git, it will be good to review of Git and a common tasks like and .
To limit the number of branches created on the Apache Pinot repository, we recommend that you create a fork by clicking on the fork button . Read more about
Run the following maven command to set up the project.
To import the Pinot stylesheet this launch intellij and navigate to Preferences
(on Mac) or Settings
on Linux.
Navigate to Editor
-> Code Style
-> Java
Select Import Scheme
-> Intellij IDES code style XML
Choose codestyle-intellij.xml
from pinot/config
folder of your workspace. Click Apply.
To import the Pinot stylesheet this launch eclipse and navigate to Preferences
(on Mac) or Settings
on Linux.
Navigate to Java->Code Style->Formatter
Choose codestyle-eclipse.xml
from pinot/config folder
of your workspace. Click Apply.
Batch Quickstart
start all Pinot components (ZK, Controller, Server, Broker) in the same JVM
create Baseball Stats table
Go to localhost:9000 in your browser and play with the query console.
Real-time Quickstart
start all Pinot components (ZK, Controller, Server, Broker) in the same JVM
Start Kafka in the same JVM
create MeetUpRSVP table.
Live stream meetup events into Kafka
Go to localhost:9000 in your browser and play with the meetup RSVP table.
Pinot is a Maven project and familiarity with Maven will help you work with Pinot code. If you are new to Maven, you can read about Maven and .
Import the project into your favorite IDE. Set up stylesheet according to your IDE. We have provided instructions for intellij and eclipse. If you are using other IDEs, ensure you use stylesheet based on .
Once the IDE is set up, you can run for batch mode or for real-time mode.
When Pinot segment files are created in external systems (Hadoop/Spark/etc), there are several ways to push those data to Pinot Controller and Server:
Push segment to shared NFS and let Pinot pull segment files from the location of that NFS.
Push segment to a Web server and let Pinot pull segment files from the Web server with http/https link.
Push segment to HDFS and let Pinot pull segment files from HDFS with hdfs location uri.
Push segment to other system and implement your own segment fetcher to pull data from those systems.
The first two options should be supported out of the box with Pinot package. As long your remote jobs send Pinot controller with the corresponding URI to the files it will pick up the file and allocate it to proper Pinot Servers and brokers. To enable Pinot support for HDFS, you will need to provide Pinot Hadoop configuration and proper Hadoop dependencies.
In your Pinot controller/server configuration, you will need to provide the following configs:
or
This path should point the local folder containing core-site.xml
and hdfs-site.xml
files from your Hadoop installation
or
These two configs should be the corresponding Kerberos configuration if your Hadoop installation is secured with Kerberos. Check Hadoop Kerberos guide on how to generate Kerberos security identification.
You will also need to provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.
To push HDFS segment files to Pinot controller, you just need to ensure you have proper Hadoop configuration as we mentioned in the previous part. Then your remote segment creation/push job can send the HDFS path of your newly created segment files to the Pinot Controller and let it download the files.
For example, the following curl requests to Controller will notify it to download segment files to the proper table:
You can also implement your own segment fetchers for other file systems and load into Pinot system with an external jar. All you need to do is to implement a class that extends the interface of SegmentFetcher and provides config to Pinot Controller and Server as follows:
or
You can also provide other configs to your fetcher under config-root pinot.server.segment.fetcher.<protocol>
This document outlines guidelines for managing external dependencies in the Apache Pinot Project. By enforcing these guidelines, we can ensure that contributors continually apply dependency management best practices in a consistent and centralized manner across the code base. These practices lead to a more predictable dependency graph and allows third-party builds dependent on OSS to repeatably override dependencies as needed for compliance or other reasons. See for more details on the motivation.
When adding new dependencies or updating the version of an existing dependency, follow these guidelines.
For standard (non plugin) Pinot subprojects:
Define all versions inside of the top-level Pinot POM’s dependencyManagement section.
Use properties to define dependency versions and refer to the property when declaring your dependency in the Pinot POM. This allows the version to be easily overridden using Maven’s .
Include the dependency inside the dependencies section of the subproject but do not define a version.
If a BOM version of a dependency exists, favor the BOM so that any transitive dependencies in the dependency group are also pinned at the same version. For example, prefer com.fasterxml.jackson:jackson-bom
over com.fasterxml.jackson.core:jackson-annotations
to enforce that all transitive dependencies for the Jackson libraries are at the same version.
For Pinot plugin subprojects:
If the required dependency (and version) is present in the Pinot POM, do not shade the dependency and reference it without a version in the subproject’s dependencies section.
If the dependency is not present in the Pinot POM, or is present at a different version, use the maven-shade-plugin to relocate the library, so you can avoid class loading conflicts due to duplicate classes from Pinot POM or other plugins. In such cases, you can also define the dependency inside the subproject’s POM to pin its version. It is the plugin’s responsibility to ensure that no Pinot methods are called that reference conflicting libraries, and thus avoid failing with a MethodNotFoundException
at runtime. For example, a plugin may be able to call JsonUtils.objectToString
but not JsonUtils.objectToJsonNode
when shading Jackson libraries.
For documentation contribution guidelines, see Contributing to the Apache Pinot documentation.
TODO: Deprecated
Before proceeding to contributing changes to Pinot, review the contents of this section.
Pinot depends on a number of external projects, the most notable ones are:
Apache Zookeeper
Apache Helix
Apache Kafka
Apache Thrift
Netty
Google Guava
Yammer
Helix is used for ClusterManagement, and Pinot code is tightly integrated with Helix and Zookeeper interfaces.
Kafka is the default real-time stream provider, but can be replaced with others. See customizations section for more info.
Thrift is used for message exchange between broker and server components, with Netty providing the server functionality for processing messages in a non-blocking fashion.
Guava is used for number of auxiliary components such as Caches and RateLimiters. Yammer metrics is used to register and expose metrics from Pinot components.
In addition, Pinot relies on several key external libraries for some of its core functionality: Roaring Bitmaps: Pinot’s inverted indices are built using RoaringBitmap library. t-Digest: Pinot’s digest based percentile calculations are based on T-Digest library.
Pinot is a multi-module project, with each module providing specific functionality that helps us to build services from a combination of modules. This helps keep clean interface contracts between different modules as well as reduce the overall executable size for individually deployable component.
Each module has a src/main/java
folder where the code resides and src/test/java
where the unit tests corresponding to the module’s code reside.
The following figure provides a high-level overview of the foundational Pinot modules.
pinot-common
provides classes common to Pinot components. Some key classes you will find here are:
config
: Definitions for various elements of Pinot’s table config.
metrics
: Definitions for base metrics provided by Controller, Broker and Server.
metadata
: Definitions of metadata stored in Zookeeper.
pql.parsers
: Code to compile PQL strings into corresponding AbstractSyntaxTrees (AST).
request
: Autogenerated thrift classes representing various parts of PQL requests.
response
: Definitions of response format returned by the Broker.
filesystem
: provides abstractions for working with segments
on local or remote filesystems. This module allows for users to plugin filesystems specific to their usecase. Extensions to the base PinotFS
should ideally be housed in their specific modules so as not pull in unnecessary dependencies for all users.
pinot-transport
module provides classes required to handle scatter-gather on Pinot Broker and netty wrapper classes used by Server to handle connections from Broker.
pinot-core
modules provides the core functionality of Pinot, specifically for handling segments, various index structures, query execution - filters, transformations, aggregations etc and support for real-time segments.
pinot-server
provides server specific functionality including server startup and REST APIs exposed by the server.
pinot-controller
houses all the controller specific functionality, including many cluster administration APIs, segment upload (for both offline and real-time), segment assignment, retention strategies etc.
pinot-broker
provides broker functionality that includes wiring the broker startup sequence, building broker routing tables, PQL request handling.
pinot-minion
provides functionality for running auxiliary/periodic tasks on a Pinot Cluster such as purging records for compliance with regulations like GDPR.
pinot-hadoop
provides classes for segment generation jobs using Hadoop infrastructure.
In addition to the core modules described above, Pinot code provides the following modules:
pinot-tools
: This module is a collection of many tools useful for setting up Pinot cluster, creating/updating segments.It also houses the Pinot quick start guide code.
pinot-perf
: This module has a collection of benchmark test code used to evaluate design options.
pinot-client-api
: This module houses the Java client API. See Executing queries via Java Client API for more info.
pinot-integration-tests
: This module holds integration tests that test functionality across multiple classes or components.
These tests typically do not rely on mocking and provide more end to end coverage for code.
pinot-hadoop-filesystem
and pinot-azure-filesystem
are module added to support extensions to Pinot filesystem. The functionality is broken down into modules of their own to avoid polluting the common modules with additional large libraries. These libraries bring in transitive dependencies of their own that can cause classpath conflicts at runtime. We would like to avoid this for the common usage of Pinot as much as possible.