1 of 9

Query

Learn how to query Apache Pinot using SQL or explore data using the web-based Pinot query console.

Querying Pinot

Learn how to query Pinot using SQL

DIALECT

Pinot uses Calcite SQL Parser to parse queries and uses MYSQL_ANSI dialect. You can see the grammar here.

Limitations

Pinot does not support Joins or nested Subqueries and we recommend using Presto for queries that span multiple tables. Read Engineering Full SQL support for Pinot at Uber for more info.

No DDL support. Tables can be created via the REST API.

Identifier vs Literal

In Pinot SQL:

Double quotes(") are used to force string identifiers, e.g. column name.
Single quotes(') are used to enclose string literals.

Mis-using those might cause unexpected query results:

E.g.

WHERE a='b' means the predicate on the column a equals to a string literal value 'b'
WHERE a="b" means the predicate on the column a equals to the value of the column b

Example Queries

Use single quotes for literals and double quotes (optional) for identifiers (column names)
If you name the columns as timestamp, date, or other reserved keywords, or the column name includes special characters, you need to use double quotes when you refer to them in the query.

Simple selection

//default to limit 10
SELECT * 
FROM myTable 

SELECT * 
FROM myTable 
LIMIT 100

Aggregation

SELECT COUNT(*), MAX(foo), SUM(bar) 
FROM myTable

Grouping on Aggregation

SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
LIMIT 50

Ordering on Aggregation

SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
ORDER BY bar, MAX(foo) DESC 
LIMIT 50

Filtering

SELECT COUNT(*) 
FROM myTable
  WHERE foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))

For performant filtering of ids in a list, see Filtering with IdSet.

Filtering with NULL predicate

SELECT COUNT(*) 
FROM myTable
  WHERE foo IS NOT NULL
  AND foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))

Selection (Projection)

SELECT * 
FROM myTable
  WHERE quux < 5
  LIMIT 50

Ordering on Selection

SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 100

Pagination on Selection

Note: results might not be consistent if column ordered by has same value in multiple rows.

SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 50, 100

Wild-card match (in WHERE clause only)

To count rows where the column airlineName starts with U

SELECT COUNT(*) 
FROM myTable
  WHERE REGEXP_LIKE(airlineName, '^U.*')
  GROUP BY airlineName LIMIT 10

Case-When Statement

Pinot supports the CASE-WHEN-ELSE statement.

Example 1:

SELECT
    CASE
      WHEN price > 30 THEN 3
      WHEN price > 20 THEN 2
      WHEN price > 10 THEN 1
      ELSE 0
    END AS price_category
FROM myTable

Example 2:

SELECT
  SUM(
    CASE
      WHEN price > 30 THEN 30
      WHEN price > 20 THEN 20
      WHEN price > 10 THEN 10
      ELSE 0
    END) AS total_cost
FROM myTable

UDF

Functions have to be implemented within Pinot. Injecting functions is not yet supported. The example below demonstrate the use of UDFs. More examples in Transform Function in Aggregation Grouping

SELECT COUNT(*)
FROM myTable
GROUP BY DATETIMECONVERT(timeColumnName, '1:MILLISECONDS:EPOCH', '1:HOURS:EPOCH', '1:HOURS')

BYTES column

Pinot supports queries on BYTES column using HEX string. The query response also uses hex string to represent bytes value.

E.g. the query below fetches all the rows for a given UID.

SELECT * 
FROM myTable
WHERE UID = 'c8b3bce0b378fc5ce8067fc271a34892'

Filtering with IdSet

Learn how to write fast queries for looking up ids in a list of values.

A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but this approach doesn't perform well with large lists of ids. In these cases, you can use an IdSet.

Functions

ID_SET

ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )

This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:

INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
Other types - Bloom Filter

The following parameters are used to configure the Bloom Filter:

expectedInsertions - Number of expected insertions for the BloomFilter, must be positive
fpp - Desired false positive probability for the BloomFilter, must be positive and < 1.0

Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.

IN_ID_SET

IN_ID_SET(columnName, base64EncodedIdSet)

This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.

IN_SUBQUERY

IN_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.

INPARTITIONEDSUBQUERY

IN_PARTITIONED_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.

This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.

Examples

Create IdSet

You can create an IdSet of the values in the yearID column by running the following:

When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:

We can also configure the fpp parameter:

Filter by values in IdSet

We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for yearIDs in the IdSet, run the following:

Filter by values not in IdSet

To return rows for yearIDs not in the IdSet, run the following:

Filter on broker

To filter rows for yearIDs in the IdSet on a Pinot Broker, run the following query:

To filter rows for yearIDs not in the IdSet on a Pinot Broker, run the following query:

Filter on server

To filter rows for yearIDs in the IdSet on a Pinot Server, run the following query:

To filter rows for yearIDs not in the IdSet on a Pinot Server, run the following query:

Supported Transformations

This document contains the list of all the transformation functions supported by Pinot SQL.

Math Functions

String Functions

Multiple string functions are supported out of the box from release-0.5.0 .

DateTime Functions

Date time functions allow you to perform transformations on columns that contain timestamps or dates.

JSON Functions

Usage

'jsonPath'and 'results_type'are literals. Pinot uses single quotes to distinguish them from identifiers.

e.g.

JSONEXTRACTSCALAR(profile_json_str, '$.name', 'STRING') is valid.
JSONEXTRACTSCALAR(profile_json_str, "$.name", "STRING") is invalid.

Transform functions can only be used in Pinot SQL. Scalar functions can be used for column transformation in table ingestion configs.

Examples

The examples below are based on these 3 sample profile JSON documents:

Query 1: Extract string values from the field 'name'

Results are

Query 2: Extract integer values from the field 'age'

Results are

Query 3: Extract Bob's age from the JSON profile.

Results are

Query 4: Extract all field keys of JSON profile.

Results are

Another example of extracting JSON fields from below JSON record:

Extract JSON fields:

Binary Functions

Multi-value Column Functions

All of the functions mentioned till now only support single value columns. You can use the following functions to do operations on multi-value columns.

Advanced Queries

Geospatial Queries

Text Queries

Supported Aggregations

Pinot provides support for aggregations using GROUP BY. You can use the following functions to get the aggregated value.

Multi-value column functions

The following aggregation functions can be used for multi-value columns

User-Defined Functions (UDFs)

Pinot currently supports two ways for you to implement your own functions:

Groovy Scripts
Scalar Functions

Groovy Scripts

Pinot allows you to run any function using Apache Groovy scripts. The syntax for executing Groovy script within the query is as follows:

GROOVY('result value metadata json', ''groovy script', arg0, arg1, arg2...)

This function will execute the groovy script using the arguments provided and return the result that matches the provided result value metadata. The function requires the following arguments:

Result value metadata json - json string representing result value metadata. Must contain non-null keys resultType and isSingleValue.
Groovy script to execute- groovy script string, which uses arg0, arg1, arg2 etc to refer to the arguments provided within the script
arguments - pinot columns/other transform functions that are arguments to the groovy script

Examples

Add colA and colB and return a single-value INT groovy( '{"returnType":"INT","isSingleValue":true}', 'arg0 + arg1', colA, colB)
Find the max element in mvColumn array and return a single-value INT
groovy('{"returnType":"INT","isSingleValue":true}', 'arg0.toList().max()', mvColumn)
Find all elements of the array mvColumn and return as a multi-value LONG column
groovy('{"returnType":"LONG","isSingleValue":false}', 'arg0.findIndexValues{ it > 5 }', mvColumn)
Multiply length of array mvColumn with colB and return a single-value DOUBLE
groovy('{"returnType":"DOUBLE","isSingleValue":true}', 'arg0 * arg1', arraylength(mvColumn), colB)
Find all indexes in mvColumnA which have value foo, add values at those indexes in mvColumnB
groovy( '{"returnType":"DOUBLE","isSingleValue":true}', 'def x = 0; arg0.eachWithIndex{item, idx-> if (item == "foo") {x = x + arg1[idx] }}; return x' , mvColumnA, mvColumnB)
Switch case which returns a FLOAT value depending on length of mvCol array
groovy('{\"returnType\":\"FLOAT\", \"isSingleValue\":true}', 'def result; switch(arg0.length()) { case 10: result = 1.1; break; case 20: result = 1.2; break; default: result = 1.3;}; return result.floatValue()', mvCol)
Any Groovy script which takes no arguments
groovy('new Date().format( "yyyyMMdd" )', '{"returnType":"STRING","isSingleValue":true}')

Scalar Functions

Since the 0.5.0 release, Pinot supports custom functions that return a single output for multiple inputs. Examples of scalar functions can be found in StringFunctions and DateTimeFunctions

Pinot automatically identifies and registers all the functions that have the @ScalarFunction annotation.

Only Java methods are supported.

Adding user defined scalar functions

You can add new scalar functions as follows:

Create a new java project. Make sure you keep the package name as org.apache.pinot.scalar.XXXX
In your java project include the dependency

<dependency>
  <groupId>org.apache.pinot</groupId>
  <artifactId>pinot-common</artifactId>
  <version>0.5.0</version>
 </dependency>

include 'org.apache.pinot:pinot-common:0.5.0'

Annotate your methods with @ScalarFunction annotation. Make sure the method is static and returns only a single value output. The input and output can have one of the following types -
- Integer
- Long
- Double
- String

//Example Scalar function

@ScalarFunction
static String mySubStr(String input, Integer beginIndex) {
  return input.substring(beginIndex);
}

Place the compiled JAR in the /plugins directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the function in a query as follows:

SELECT mysubstr(playerName, 4) 
FROM baseballStats

Note that the function name in SQL is the same as the function name in Java. The SQL function name is case-insensitive as well.

Cardinality Estimation

Cardinality estimation is a classic problem. Pinot solves it with multiple ways each of which has a trade-off between accuracy and latency.

Accurate Results

Functions:

DistinctCount(x) -> LONG

Returns accurate count for all unique values in a column.

The underlying implementation is using a IntOpenHashSet in library: it.unimi.dsi:fastutil:8.2.3 to hold all the unique values.

Approximation Results

It usually takes a lot of resources and time to compute accurate results for unique counting on large datasets. In some circumstances, we can tolerate a certain error rate, in which case we can use approximation functions to tackle this problem.

HyperLogLog

Functions:

DistinctCountHLL(x)_ -> LONG_

For column type INT/LONG/FLOAT/DOUBLE/STRING , Pinot treats each value as an individual entry to add into HyperLogLog Object, then compute the approximation by calling method cardinality().

For column type BYTES, Pinot treats each value as a serialized HyperLogLog Object with pre-aggregated values inside. The bytes value is generated by org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog).

All deserialized HyperLogLog object will be merged into one then calling method **cardinality() **to get the approximated unique count.

Theta Sketches

Functions:

DistinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**) **-> LONG
- thetaSketchColumn (required): Name of the column to aggregate on.
- thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches. Currently, the only supported parameter is nominalEntries.
- predicates (optional)_: _ These are individual predicates of form lhs <op> rhs which are applied on rows selected by the where clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
- postAggregationExpressionToEvaluate (required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT , where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.

In the example query below, the where clause is responsible for identifying the matching rows. Note, the where clause can be completely independent of the postAggregationExpression. Once matching rows are identified, each server unionizes all the sketches that match the individual predicates, i.e. country='USA' , device='mobile' in this case. Once the broker receives the intermediate sketches for each of these individual predicates from all servers, it performs the final aggregation by evaluating the postAggregationExpression and returns the final cardinality of the resulting sketch.

DistinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**)** -> HexEncoded Serialized Sketch Bytes

This is the same as the previous function, except it returns the byte serialized sketch instead of the cardinality sketch. Since Pinot returns responses as JSON strings, bytes are returned as hex encoded strings. The hex encoded string can be deserialized into sketch by using the library org.apache.commons.codec.binaryas Hex.decodeHex(stringValue.toCharArray()).

Lookup UDF Join

Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with a dimension table in Pinot. The UDF signature is as below:

lookUp('dimTableName', 'dimColToLookUp', 'dimJoinKey1', factJoinKeyVal1, 'dimJoinKey2', factJoinKeyVal2 ... )

dimTableName Name of the dim table to perform the lookup on.
dimColToLookUp The column name of the dim table to be retrieved to decorate our result.
dimJoinKey The column name on which we want to perform the lookup i.e. the join column name for dim table.
factJoinKeyVal The value of the dim table join column for which we will retrieve the dimColToLookUp for the scope and invocation.

Return type of the UDF will be that of the dimColToLookUp column type. There can also be multiple primary keys and corresponding values.

Note: If the dimension table uses a composite primary key i.e multiple primary keys, then ensure that the order of keys appearing in the lookup() UDF is same as the order defined for "primaryKeyColumns" in the dimension table schema.

Querying JSON data

To see how JSON data can be queried, assume that we have the following table:

To drill down and pull out specific keys within the JSON column, we simply append the JsonPath expression of those keys to the end of the column name.

Note that the third column (jsoncolumn.data[1]) is null for rows with id 106 and 107. This is because these rows have JSON documents that don't have a key with JsonPath jsoncolumn.data[1]. We can filter out these rows.

Notice that certain last names (duck and mouse for example) repeat in the data above. We can get a count of each last name by running a GROUP BY query on a JsonPath expression.

Also there is numerical information (jsconcolumn.score) embeded within the JSON document. We can extract those numerical values from JSON data into SQL and sum them up using the query below.

In short, JSON querying support in Pinot will allow you to use a JsonPath expression whereever you can use a column name with the only difference being that to query a column with data type JSON, you must append a JsonPath expression after the name of the column.

Filtering with IdSet

Learn how to write fast queries for looking up ids in a list of values.

Functions

ID_SET

ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )

This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:

INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
Other types - Bloom Filter

The following parameters are used to configure the Bloom Filter:

expectedInsertions - Number of expected insertions for the BloomFilter, must be positive
fpp - Desired false positive probability for the BloomFilter, must be positive and < 1.0

Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.

IN_ID_SET

IN_ID_SET(columnName, base64EncodedIdSet)

This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.

IN_SUBQUERY

IN_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.

INPARTITIONEDSUBQUERY

IN_PARTITIONED_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.

Examples

Create IdSet

You can create an IdSet of the values in the yearID column by running the following:

SELECT ID_SET(yearID)
FROM baseballStats
WHERE teamID = 'WS1'

idset(yearID)

When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:

SELECT ID_SET(playerName, 'expectedInsertions=10')
FROM baseballStats
WHERE teamID = 'WS1'

idset(playerName)

SELECT ID_SET(playerName, 'expectedInsertions=100')
FROM baseballStats
WHERE teamID = 'WS1'

idset(playerName)

We can also configure the fpp parameter:

SELECT ID_SET(playerName, 'expectedInsertions=100;fpp=0.01')
FROM baseballStats
WHERE teamID = 'WS1'

idset(playerName)

Filter by values in IdSet

We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for yearIDs in the IdSet, run the following:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_ID_SET(
 yearID,   
 'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
  ) = 1 
GROUP BY yearID

Filter by values not in IdSet

To return rows for yearIDs not in the IdSet, run the following:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_ID_SET(
  yearID,   
  'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
  ) = 0 
GROUP BY yearID

Filter on broker

To filter rows for yearIDs in the IdSet on a Pinot Broker, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 1
GROUP BY yearID

To filter rows for yearIDs not in the IdSet on a Pinot Broker, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 0
GROUP BY yearID

Filter on server

To filter rows for yearIDs in the IdSet on a Pinot Server, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_PARTITIONED_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 1
GROUP BY yearID

To filter rows for yearIDs not in the IdSet on a Pinot Server, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_PARTITIONED_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 0
GROUP BY yearID

User-Defined Functions (UDFs)

Pinot currently supports two ways for you to implement your own functions:

Groovy Scripts
Scalar Functions

Groovy Scripts

Pinot allows you to run any function using Apache Groovy scripts. The syntax for executing Groovy script within the query is as follows:

GROOVY('result value metadata json', ''groovy script', arg0, arg1, arg2...)

This function will execute the groovy script using the arguments provided and return the result that matches the provided result value metadata. The function requires the following arguments:

Result value metadata json - json string representing result value metadata. Must contain non-null keys resultType and isSingleValue.
Groovy script to execute- groovy script string, which uses arg0, arg1, arg2 etc to refer to the arguments provided within the script
arguments - pinot columns/other transform functions that are arguments to the groovy script

Examples

Add colA and colB and return a single-value INT groovy( '{"returnType":"INT","isSingleValue":true}', 'arg0 + arg1', colA, colB)
Find the max element in mvColumn array and return a single-value INT
groovy('{"returnType":"INT","isSingleValue":true}', 'arg0.toList().max()', mvColumn)
Find all elements of the array mvColumn and return as a multi-value LONG column
groovy('{"returnType":"LONG","isSingleValue":false}', 'arg0.findIndexValues{ it > 5 }', mvColumn)
Multiply length of array mvColumn with colB and return a single-value DOUBLE
groovy('{"returnType":"DOUBLE","isSingleValue":true}', 'arg0 * arg1', arraylength(mvColumn), colB)
Find all indexes in mvColumnA which have value foo, add values at those indexes in mvColumnB
groovy( '{"returnType":"DOUBLE","isSingleValue":true}', 'def x = 0; arg0.eachWithIndex{item, idx-> if (item == "foo") {x = x + arg1[idx] }}; return x' , mvColumnA, mvColumnB)
Switch case which returns a FLOAT value depending on length of mvCol array
groovy('{\"returnType\":\"FLOAT\", \"isSingleValue\":true}', 'def result; switch(arg0.length()) { case 10: result = 1.1; break; case 20: result = 1.2; break; default: result = 1.3;}; return result.floatValue()', mvCol)
Any Groovy script which takes no arguments
groovy('new Date().format( "yyyyMMdd" )', '{"returnType":"STRING","isSingleValue":true}')

Scalar Functions

Since the 0.5.0 release, Pinot supports custom functions that return a single output for multiple inputs. Examples of scalar functions can be found in StringFunctions and DateTimeFunctions

Pinot automatically identifies and registers all the functions that have the @ScalarFunction annotation.

Only Java methods are supported.

Adding user defined scalar functions

You can add new scalar functions as follows:

Create a new java project. Make sure you keep the package name as org.apache.pinot.scalar.XXXX
In your java project include the dependency

<dependency>
  <groupId>org.apache.pinot</groupId>
  <artifactId>pinot-common</artifactId>
  <version>0.5.0</version>
 </dependency>

include 'org.apache.pinot:pinot-common:0.5.0'

Annotate your methods with @ScalarFunction annotation. Make sure the method is static and returns only a single value output. The input and output can have one of the following types -
- Integer
- Long
- Double
- String

//Example Scalar function

@ScalarFunction
static String mySubStr(String input, Integer beginIndex) {
  return input.substring(beginIndex);
}

Place the compiled JAR in the /plugins directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the function in a query as follows:

SELECT mysubstr(playerName, 4) 
FROM baseballStats

Note that the function name in SQL is the same as the function name in Java. The SQL function name is case-insensitive as well.

Cardinality Estimation

Cardinality estimation is a classic problem. Pinot solves it with multiple ways each of which has a trade-off between accuracy and latency.

Accurate Results

Functions:

DistinctCount(x) -> LONG

Returns accurate count for all unique values in a column.

The underlying implementation is using a IntOpenHashSet in library: it.unimi.dsi:fastutil:8.2.3 to hold all the unique values.

Approximation Results

HyperLogLog

is an approximation algorithm for unique counting. It uses fixed number of bits to estimate the cardinality of given data set.

Pinot leverages in library com.clearspring.analytics:stream:2.7.0as the data structure to hold intermediate results.

Functions:

DistinctCountHLL(x)_ -> LONG_

All deserialized HyperLogLog object will be merged into one then calling method **cardinality() **to get the approximated unique count.

Theta Sketches

The framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the and its extensions from the library org.apache.datasketches:datasketches-java:1.2.0-incubating to perform distinct counting as well as evaluating set operations.

Functions:

DistinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**) **-> LONG
- thetaSketchColumn (required): Name of the column to aggregate on.
- thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches. Currently, the only supported parameter is nominalEntries.
- predicates (optional)_: _ These are individual predicates of form lhs <op> rhs which are applied on rows selected by the where clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
- postAggregationExpressionToEvaluate (required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT , where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.

select distinctCountThetaSketch(
  sketchCol, 
  'nominalEntries=1024', 
  'country'=''USA'' AND 'state'=''CA'', 'device'=''mobile'', 'SET_INTERSECT($1, $2)'
) 
from table 
where country = 'USA' or device = 'mobile...'

DistinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**)** -> HexEncoded Serialized Sketch Bytes

Querying Pinot

Learn how to query Pinot using SQL

DIALECT

Pinot uses Calcite SQL Parser to parse queries and uses MYSQL_ANSI dialect. You can see the grammar here.

Limitations

Pinot does not support Joins or nested Subqueries and we recommend using Presto for queries that span multiple tables. Read Engineering Full SQL support for Pinot at Uber for more info.

No DDL support. Tables can be created via the REST API.

Identifier vs Literal

In Pinot SQL:

Double quotes(") are used to force string identifiers, e.g. column name.
Single quotes(') are used to enclose string literals.

Mis-using those might cause unexpected query results:

E.g.

WHERE a='b' means the predicate on the column a equals to a string literal value 'b'
WHERE a="b" means the predicate on the column a equals to the value of the column b

Example Queries

Use single quotes for literals and double quotes (optional) for identifiers (column names)
If you name the columns as timestamp, date, or other reserved keywords, or the column name includes special characters, you need to use double quotes when you refer to them in the query.

Simple selection

//default to limit 10
SELECT * 
FROM myTable 

SELECT * 
FROM myTable 
LIMIT 100

Aggregation

SELECT COUNT(*), MAX(foo), SUM(bar) 
FROM myTable

Grouping on Aggregation

SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
LIMIT 50

Ordering on Aggregation

SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
ORDER BY bar, MAX(foo) DESC 
LIMIT 50

Filtering

SELECT COUNT(*) 
FROM myTable
  WHERE foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))

For performant filtering of ids in a list, see Filtering with IdSet.

Filtering with NULL predicate

SELECT COUNT(*) 
FROM myTable
  WHERE foo IS NOT NULL
  AND foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))

Selection (Projection)

SELECT * 
FROM myTable
  WHERE quux < 5
  LIMIT 50

Ordering on Selection

SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 100

Pagination on Selection

Note: results might not be consistent if column ordered by has same value in multiple rows.

SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 50, 100

Wild-card match (in WHERE clause only)

To count rows where the column airlineName starts with U

SELECT COUNT(*) 
FROM myTable
  WHERE REGEXP_LIKE(airlineName, '^U.*')
  GROUP BY airlineName LIMIT 10

Case-When Statement

Pinot supports the CASE-WHEN-ELSE statement.

Example 1:

SELECT
    CASE
      WHEN price > 30 THEN 3
      WHEN price > 20 THEN 2
      WHEN price > 10 THEN 1
      ELSE 0
    END AS price_category
FROM myTable

Example 2:

SELECT
  SUM(
    CASE
      WHEN price > 30 THEN 30
      WHEN price > 20 THEN 20
      WHEN price > 10 THEN 10
      ELSE 0
    END) AS total_cost
FROM myTable

UDF

Functions have to be implemented within Pinot. Injecting functions is not yet supported. The example below demonstrate the use of UDFs. More examples in Transform Function in Aggregation Grouping

SELECT COUNT(*)
FROM myTable
GROUP BY DATETIMECONVERT(timeColumnName, '1:MILLISECONDS:EPOCH', '1:HOURS:EPOCH', '1:HOURS')

BYTES column

Pinot supports queries on BYTES column using HEX string. The query response also uses hex string to represent bytes value.

E.g. the query below fetches all the rows for a given UID.

SELECT * 
FROM myTable
WHERE UID = 'c8b3bce0b378fc5ce8067fc271a34892'

Query

Querying Pinot

DIALECT

Limitations

Identifier vs Literal

Example Queries

Simple selection

Aggregation

Grouping on Aggregation

Ordering on Aggregation

Filtering

Filtering with NULL predicate

Selection (Projection)

Ordering on Selection

Pagination on Selection

Wild-card match (in WHERE clause only)

Case-When Statement

UDF

BYTES column

Filtering with IdSet

Functions

ID_SET

IN_ID_SET

IN_SUBQUERY

IN__PARTITIONED__SUBQUERY

Examples

Create IdSet

Filter by values in IdSet

Filter by values not in IdSet

Filter on broker

Filter on server

Supported Transformations

Math Functions

String Functions

DateTime Functions

JSON Functions

Binary Functions

Multi-value Column Functions

Advanced Queries

Geospatial Queries

Text Queries

Supported Aggregations

Multi-value column functions

User-Defined Functions (UDFs)

Groovy Scripts

Scalar Functions

Adding user defined scalar functions

Cardinality Estimation

Accurate Results

Approximation Results

HyperLogLog

Theta Sketches

Lookup UDF Join

Querying JSON data

Lookup UDF Join

Filtering with IdSet

Functions

ID_SET

IN_ID_SET

IN_SUBQUERY

IN__PARTITIONED__SUBQUERY

Examples

Create IdSet

Filter by values in IdSet

Filter by values not in IdSet

Filter on broker

Filter on server

User-Defined Functions (UDFs)

Groovy Scripts

Scalar Functions

Adding user defined scalar functions

Query

Supported Aggregations

Multi-value column functions

Querying JSON data

Supported Transformations

Math Functions

String Functions

DateTime Functions

JSON Functions

INPARTITIONEDSUBQUERY

INPARTITIONEDSUBQUERY