1 of 9

Query

Learn how to query Apache Pinot using SQL or explore data using the web-based Pinot query console.

Querying Pinot

Learn how to query Pinot using SQL

DIALECT

Pinot uses Calcite SQL Parser to parse queries and uses MYSQL_ANSI dialect. You can see the grammar .

Filtering with IdSet

Learn how to write fast queries for looking up ids in a list of values.

A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but this approach doesn't perform well with large lists of ids. In these cases, you can use an IdSet.

Functions

ID_SET

ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )

This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:

INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
Other types - Bloom Filter

The following parameters are used to configure the Bloom Filter:

expectedInsertions - Number of expected insertions for the BloomFilter, must be positive
fpp - Desired false positive probability for the BloomFilter, must be positive and < 1.0

Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.

IN_ID_SET

IN_ID_SET(columnName, base64EncodedIdSet)

This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.

IN_SUBQUERY

IN_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.

INPARTITIONEDSUBQUERY

IN_PARTITIONED_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.

This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.

Examples

Create IdSet

You can create an IdSet of the values in the yearID column by running the following:

idset(yearID)

When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:

idset(playerName)

We can also configure the fpp parameter:

idset(playerName)

Filter by values in IdSet

We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for yearIDs in the IdSet, run the following:

Filter by values not in IdSet

To return rows for yearIDs not in the IdSet, run the following:

Filter on broker

To filter rows for yearIDs in the IdSet on a Pinot Broker, run the following query:

To filter rows for yearIDs not in the IdSet on a Pinot Broker, run the following query:

Filter on server

To filter rows for yearIDs in the IdSet on a Pinot Server, run the following query:

To filter rows for yearIDs not in the IdSet on a Pinot Server, run the following query:

Supported Aggregations

Pinot provides support for aggregations using GROUP BY. You can use the following functions to get the aggregated value.

Function

Description

Example

COUNT

Get the count of rows in a group

COUNT(*)

MIN

Get the minimum value in a group

MIN(playerScore)

Multi-value column functions

The following aggregation functions can be used for multi-value columns

User-Defined Functions (UDFs)

Pinot currently supports two ways for you to implement your own functions:

Groovy Scripts
Scalar Functions

Groovy Scripts

Pinot allows you to run any function using scripts. The syntax for executing Groovy script within the query is as follows:

GROOVY('result value metadata json', ''groovy script', arg0, arg1, arg2...)

This function will execute the groovy script using the arguments provided and return the result that matches the provided result value metadata. The function requires the following arguments:

Result value metadata json - json string representing result value metadata. Must contain non-null keys resultType and isSingleValue.
Groovy script to execute- groovy script string, which uses arg0, arg1, arg2

Examples

Add colA and colB and return a single-value INT groovy( '{"returnType":"INT","isSingleValue":true}', 'arg0 + arg1', colA, colB)
Find the max element in mvColumn array and return a single-value INT
groovy('{"returnType":"INT","isSingleValue":true}', 'arg0.toList().max()', mvColumn)

Scalar Functions

Since the 0.5.0 release, Pinot supports custom functions that return a single output for multiple inputs. Examples of scalar functions can be found in and

Pinot automatically identifies and registers all the functions that have the @ScalarFunction annotation.

Only Java methods are supported.

Adding user defined scalar functions

You can add new scalar functions as follows:

Create a new java project. Make sure you keep the package name as org.apache.pinot.scalar.XXXX
In your java project include the dependency

Annotate your methods with @ScalarFunction annotation. Make sure the method is static and returns only a single value output. The input and output can have one of the following types -
- Integer

Place the compiled JAR in the /plugins directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the function in a query as follows:

Note that the function name in SQL is the same as the function name in Java. The SQL function name is case-insensitive as well.

Cardinality Estimation

Cardinality estimation is a classic problem. Pinot solves it with multiple ways each of which has a trade-off between accuracy and latency.

Accurate Results

Functions:

DistinctCount(x) -> LONG

Returns accurate count for all unique values in a column.

The underlying implementation is using a IntOpenHashSet in library: it.unimi.dsi:fastutil:8.2.3 to hold all the unique values.

Approximation Results

It usually takes a lot of resources and time to compute accurate results for unique counting on large datasets. In some circumstances, we can tolerate a certain error rate, in which case we can use approximation functions to tackle this problem.

HyperLogLog

is an approximation algorithm for unique counting. It uses fixed number of bits to estimate the cardinality of given data set.

Pinot leverages in library com.clearspring.analytics:stream:2.7.0as the data structure to hold intermediate results.

Functions:

DistinctCountHLL(x)_ -> LONG_

For column type INT/LONG/FLOAT/DOUBLE/STRING , Pinot treats each value as an individual entry to add into HyperLogLog Object, then compute the approximation by calling method cardinality().

For column type BYTES, Pinot treats each value as a serialized HyperLogLog Object with pre-aggregated values inside. The bytes value is generated by org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog).

All deserialized HyperLogLog object will be merged into one then calling method **cardinality() **to get the approximated unique count.

Theta Sketches

The framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the and its extensions from the library org.apache.datasketches:datasketches-java:1.2.0-incubating to perform distinct counting as well as evaluating set operations.

Functions:

DistinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**) **-> LONG
- thetaSketchColumn (required): Name of the column to aggregate on.
- thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches. Currently, the only supported parameter is nominalEntries

In the example query below, the where clause is responsible for identifying the matching rows. Note, the where clause can be completely independent of the postAggregationExpression. Once matching rows are identified, each server unionizes all the sketches that match the individual predicates, i.e. country='USA' , device='mobile' in this case. Once the broker receives the intermediate sketches for each of these individual predicates from all servers, it performs the final aggregation by evaluating the postAggregationExpression and returns the final cardinality of the resulting sketch.

DistinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**)** -> HexEncoded Serialized Sketch Bytes

This is the same as the previous function, except it returns the byte serialized sketch instead of the cardinality sketch. Since Pinot returns responses as JSON strings, bytes are returned as hex encoded strings. The hex encoded string can be deserialized into sketch by using the library org.apache.commons.codec.binaryas Hex.decodeHex(stringValue.toCharArray()).

Lookup UDF Join

Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with a dimension table in Pinot. The UDF signature is as below:

lookUp('dimTableName', 'dimColToLookUp', 'dimJoinKey1', factJoinKeyVal1, 'dimJoinKey2', factJoinKeyVal2 ... )

dimTableName Name of the dim table to perform the lookup on.
dimColToLookUp The column name of the dim table to be retrieved to decorate our result.
dimJoinKey The column name on which we want to perform the lookup i.e. the join column name for dim table.
factJoinKeyVal The value of the dim table join column for which we will retrieve the dimColToLookUp for the scope and invocation.

Return type of the UDF will be that of the dimColToLookUp column type. There can also be multiple primary keys and corresponding values.

Note: If the dimension table uses a composite primary key i.e multiple primary keys, then ensure that the order of keys appearing in the lookup() UDF is same as the order defined for "primaryKeyColumns" in the dimension table schema.

Querying JSON data

To see how JSON data can be queried, assume that we have the following table:

We also assume that "jsoncolumn" has a Json Index on it. Note that the last two rows in the table have different structure than the rest of the rows. In keeping with JSON specification, a JSON column can contain any valid JSON data and doesn't need to adhere to a predefined schema. To pull out the entire JSON document for each row, we can run the query below:

jsoncolumn

"101"

"{"name":{"first":"daffy","last":"duck"},"score":101,"data":["a","b","c","d"]}"

102"

To drill down and pull out specific keys within the JSON column, we simply append the JsonPath expression of those keys to the end of the column name.

jsoncolumn.name.last

jsoncolumn.name.first

jsoncolumn.data[1]

Note that the third column (jsoncolumn.data[1]) is null for rows with id 106 and 107. This is because these rows have JSON documents that don't have a key with JsonPath jsoncolumn.data[1]. We can filter out these rows.

jsoncolumn.name.last

jsoncolumn.name.first

jsoncolumn.data[1]

Notice that certain last names (duck and mouse for example) repeat in the data above. We can get a count of each last name by running a GROUP BY query on a JsonPath expression.

jsoncolumn.name.last

count(*)

Also there is numerical information (jsconcolumn.score) embeded within the JSON document. We can extract those numerical values from JSON data into SQL and sum them up using the query below.

jsoncolumn.name.last

sum(jsoncolumn.score)

In short, JSON querying support in Pinot will allow you to use a JsonPath expression whereever you can use a column name with the only difference being that to query a column with data type JSON, you must append a JsonPath expression after the name of the column.

Cardinality Estimation

Cardinality estimation is a classic problem. Pinot solves it with multiple ways each of which has a trade-off between accuracy and latency.

Accurate Results

Functions:

DistinctCount(x) -> LONG

Returns accurate count for all unique values in a column.

The underlying implementation is using a IntOpenHashSet in library: it.unimi.dsi:fastutil:8.2.3 to hold all the unique values.

Approximation Results

HyperLogLog

is an approximation algorithm for unique counting. It uses fixed number of bits to estimate the cardinality of given data set.

Pinot leverages in library com.clearspring.analytics:stream:2.7.0as the data structure to hold intermediate results.

Functions:

DistinctCountHLL(x)_ -> LONG_

All deserialized HyperLogLog object will be merged into one then calling method **cardinality() **to get the approximated unique count.

Theta Sketches

Functions:

DistinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**) **-> LONG
- thetaSketchColumn (required): Name of the column to aggregate on.
- thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches. Currently, the only supported parameter is nominalEntries

DistinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**)** -> HexEncoded Serialized Sketch Bytes

User-Defined Functions (UDFs)

Pinot currently supports two ways for you to implement your own functions:

Groovy Scripts
Scalar Functions

Groovy Scripts

Pinot allows you to run any function using scripts. The syntax for executing Groovy script within the query is as follows:

GROOVY('result value metadata json', ''groovy script', arg0, arg1, arg2...)

This function will execute the groovy script using the arguments provided and return the result that matches the provided result value metadata. The function requires the following arguments:

Result value metadata json - json string representing result value metadata. Must contain non-null keys resultType and isSingleValue.
Groovy script to execute- groovy script string, which uses arg0, arg1, arg2

Examples

Add colA and colB and return a single-value INT groovy( '{"returnType":"INT","isSingleValue":true}', 'arg0 + arg1', colA, colB)
Find the max element in mvColumn array and return a single-value INT
groovy('{"returnType":"INT","isSingleValue":true}', 'arg0.toList().max()', mvColumn)

Scalar Functions

Since the 0.5.0 release, Pinot supports custom functions that return a single output for multiple inputs. Examples of scalar functions can be found in and

Pinot automatically identifies and registers all the functions that have the @ScalarFunction annotation.

Only Java methods are supported.

Adding user defined scalar functions

You can add new scalar functions as follows:

Create a new java project. Make sure you keep the package name as org.apache.pinot.scalar.XXXX
In your java project include the dependency

Annotate your methods with @ScalarFunction annotation. Make sure the method is static and returns only a single value output. The input and output can have one of the following types -
- Integer

Place the compiled JAR in the /plugins directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the function in a query as follows:

Note that the function name in SQL is the same as the function name in Java. The SQL function name is case-insensitive as well.

Query

Querying Pinot

hashtagDIALECT

hashtag

Filtering with IdSet

hashtagFunctions

hashtagID_SET

hashtagIN_ID_SET

hashtagIN_SUBQUERY

hashtagIN__PARTITIONED__SUBQUERY

hashtagExamples

hashtagCreate IdSet

hashtagFilter by values in IdSet

hashtagFilter by values not in IdSet

hashtagFilter on broker

hashtagFilter on server

hashtag

Supported Aggregations

hashtagMulti-value column functions

User-Defined Functions (UDFs)

hashtagGroovy Scripts

hashtagScalar Functions

hashtagAdding user defined scalar functions

Cardinality Estimation

hashtagAccurate Results

hashtagApproximation Results

hashtagHyperLogLog

hashtagTheta Sketches

Lookup UDF Join

Querying JSON data

Query

Cardinality Estimation

hashtagAccurate Results

hashtagApproximation Results

hashtagHyperLogLog

hashtagTheta Sketches

Lookup UDF Join

Querying Pinot

hashtagDIALECT

hashtag

hashtagIdentifier vs Literal

hashtagExample Queries

hashtagSimple selection

hashtagAggregation

hashtagGrouping on Aggregation

hashtagOrdering on Aggregation

hashtagFiltering

hashtagFiltering with NULL predicate

hashtagSelection (Projection)

hashtagOrdering on Selection

hashtagPagination on Selection

hashtagWild-card match (in WHERE clause only)

hashtagCase-When Statement

hashtagUDF

hashtagBYTES column

Filtering with IdSet

hashtagFunctions

hashtagID_SET

hashtagIN_ID_SET

hashtagIN_SUBQUERY

hashtagIN__PARTITIONED__SUBQUERY

hashtagExamples

hashtagCreate IdSet

hashtagFilter by values in IdSet

hashtagFilter by values not in IdSet

hashtagFilter on broker

hashtagFilter on server

hashtag

Supported Aggregations

hashtagMulti-value column functions

Querying JSON data

User-Defined Functions (UDFs)

hashtagGroovy Scripts

hashtagScalar Functions

hashtagAdding user defined scalar functions

Supported Transformations

hashtagMath Functions

hashtagString Functions

hashtagDateTime Functions

hashtagJSON Functions

DIALECT

Functions

ID_SET

IN_ID_SET

IN_SUBQUERY

INPARTITIONEDSUBQUERY

Examples

Create IdSet

Filter by values in IdSet

Filter by values not in IdSet

Filter on broker

Filter on server

Multi-value column functions

Groovy Scripts

Scalar Functions

Adding user defined scalar functions

Accurate Results

Approximation Results

HyperLogLog

Theta Sketches

Accurate Results

Approximation Results

HyperLogLog

Theta Sketches

DIALECT

Identifier vs Literal

Example Queries

Simple selection

Aggregation

Grouping on Aggregation

Ordering on Aggregation

Filtering

Filtering with NULL predicate

Selection (Projection)

Ordering on Selection

Pagination on Selection

Wild-card match (in WHERE clause only)

Case-When Statement

UDF

BYTES column

Functions

ID_SET

IN_ID_SET

IN_SUBQUERY

INPARTITIONEDSUBQUERY

Examples

Create IdSet

Filter by values in IdSet

Filter by values not in IdSet

Filter on broker

Filter on server

Multi-value column functions

Groovy Scripts

Scalar Functions

Adding user defined scalar functions

Math Functions

String Functions

DateTime Functions

JSON Functions

Binary Functions

Multi-value Column Functions

Advanced Queries

Geospatial Queries

Text Queries