1 of 36

Query

Learn how to query Apache Pinot using SQL or explore data using the web-based Pinot query console.

Explore query syntax:

Querying Pinot

Learn how to query Pinot using SQL

SQL Interface

Pinot provides a SQL interface for querying, which uses the Calcite SQL parser to parse queries and the MYSQL_ANSI dialect. For details on the syntax, see the the Calcite documentation. To find supported SQL operators, see Class SqlLibraryOperators.

Pinot 1.0

In Pinot 1.0, the multi-stage query engine supports inner join, left-outer, semi-join, and nested queries out of the box. It's optimized for in-memory process and latency. For more information, see how to enable and use the multi-stage query engine.

Pinot also supports using simple Data Definition Language (DDL) to insert data into a table from file directly. For details, see programmatically access the multi-stage query engine. More DDL supports will be added in the future. But for now, the most common way for data definition is using the Controller Admin API.

Note: For queries that require a large amount of data shuffling, require spill-to-disk, or are hitting any other limitations of the multi-stage query engine (v2), we still recommend using Presto.

Identifier vs Literal

In Pinot SQL:

Double quotes(") are used to force string identifiers, e.g. column names
Single quotes(') are used to enclose string literals. If the string literal also contains a single quote, escape this with a single quote e.g '''Pinot''' to match the string literal 'Pinot'

Misusing those might cause unexpected query results, like the following examples:

WHERE a='b' means the predicate on the column a equals to a string literal value 'b'
WHERE a="b" means the predicate on the column a equals to the value of the column b

If your column names use reserved keywords (e.g. timestamp or date) or special characters, you will need to use double quotes when referring to them in queries.

Note: Define decimal literals within quotes to preserve precision.

Example Queries

Selection

//default to limit 10
SELECT * 
FROM myTable 

SELECT * 
FROM myTable 
LIMIT 100

SELECT "date", "timestamp"
FROM myTable

Aggregation

SELECT COUNT(*), MAX(foo), SUM(bar) 
FROM myTable

Grouping on Aggregation

SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
LIMIT 50

Ordering on Aggregation

SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
ORDER BY bar, MAX(foo) DESC 
LIMIT 50

Filtering

SELECT COUNT(*) 
FROM myTable
  WHERE foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))

For performant filtering of IDs in a list, see Filtering with IdSet.

Filtering with NULL predicate

SELECT COUNT(*) 
FROM myTable
  WHERE foo IS NOT NULL
  AND foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))

Selection (Projection)

SELECT * 
FROM myTable
  WHERE quux < 5
  LIMIT 50

Ordering on Selection

SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 100

Pagination on Selection

Note that results might not be consistent if the ORDER BY column has the same value in multiple rows.

SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 50, 100

Wild-card match (in WHERE clause only)

The example below counts rows where the column airlineName starts with U:

SELECT COUNT(*) 
FROM myTable
  WHERE REGEXP_LIKE(airlineName, '^U.*')
  GROUP BY airlineName LIMIT 10

Case-When Statement

Pinot supports the CASE-WHEN-ELSE statement, as shown in the following two examples:

SELECT
    CASE
      WHEN price > 30 THEN 3
      WHEN price > 20 THEN 2
      WHEN price > 10 THEN 1
      ELSE 0
    END AS price_category
FROM myTable

SELECT
  SUM(
    CASE
      WHEN price > 30 THEN 30
      WHEN price > 20 THEN 20
      WHEN price > 10 THEN 10
      ELSE 0
    END) AS total_cost
FROM myTable

UDF

Pinot doesn't currently support injecting functions. Functions have to be implemented within Pinot, as shown below:

SELECT COUNT(*)
FROM myTable
GROUP BY DATETIMECONVERT(timeColumnName, '1:MILLISECONDS:EPOCH', '1:HOURS:EPOCH', '1:HOURS')

For more examples, see Transform Function in Aggregation Grouping.

BYTES column

Pinot supports queries on BYTES column using hex strings. The query response also uses hex strings to represent bytes values.

The query below fetches all the rows for a given UID:

SELECT * 
FROM myTable
WHERE UID = 'c8b3bce0b378fc5ce8067fc271a34892'

Query Syntax

Query Pinot using supported syntax.

Aggregation Functions

Aggregate functions return a single result for a group of rows.

Aggregate functions return a single result for a group of rows. The following table shows supported aggregate functions in Pinot.

Function

Description

Example

Default Value When No Record Selected

Project a column where the maxima appears in a series of measuring columns.

ARG_MAX(measuring1, measuring2, measuring3, projection)

Will return no result

0

Returns the count of the records as Long

COUNT(*)

0

Returns the population covariance between of 2 numerical columns as Double

COVAR_POP(col1, col2)

Double.NEGATIVE_INFINITY

Returns the sample covariance between of 2 numerical columns as Double

COVAR_SAMP(col1, col2)

Double.NEGATIVE_INFINITY

Calculate the histogram of a numeric column as Double[]

HISTOGRAM(numberOfGames,0,200,10)

0, 0, ..., 0

Returns the minimum value of a numeric column as Double

MIN(playerScore)

Double.POSITIVE_INFINITY

Returns the maximum value of a numeric column as Double

MAX(playerScore)

Double.NEGATIVE_INFINITY

Returns the sum of the values for a numeric column as Double

SUM(playerScore)

0

Returns the sum of the values for a numeric column with optional precision and scale as BigDecimal

SUMPRECISION(salary), SUMPRECISION(salary, precision, scale)

0.0

Returns the average of the values for a numeric column as Double

AVG(playerScore)

Double.NEGATIVE_INFINITY

Returns the most frequent value of a numeric column as Double. When multiple modes are present it gives the minimum of all the modes. This behavior can be overridden to get the maximum or the average mode.

MODE(playerScore)

MODE(playerScore, 'MIN')

MODE(playerScore, 'MAX')

MODE(playerScore, 'AVG')

Double.NEGATIVE_INFINITY

Returns the max - min value for a numeric column as Double

MINMAXRANGE(playerScore)

Double.NEGATIVE_INFINITY

Returns the Nth percentile of the values for a numeric column as Double. N is a decimal number between 0 and 100 inclusive.

PERCENTILE(playerScore, 50) PERCENTILE(playerScore, 99.9)

Double.NEGATIVE_INFINITY

PERCENTILEEST(playerScore, 50)

PERCENTILEEST(playerScore, 99.9)

Long.MIN_VALUE

PERCENTILETDIGEST(playerScore, 50)

PERCENTILETDIGEST(playerScore, 99.9)

Double.NaN

PERCENTILETDIGEST(playerScore, 50, 1000)

PERCENTILETDIGEST(playerScore, 99.9, 500)

Double.NaN

PERCENTILESMARTTDIGEST

Returns the Nth percentile of the values for a numeric column as Double. When there are too many values, automatically switch to approximate percentile using TDigest. The switch threshold (100_000 by default) and compression (100 by default) for the TDigest can be configured via the optional second argument.

PERCENTILESMARTTDIGEST(playerScore, 50)

PERCENTILESMARTTDIGEST(playerScore, 99.9, 'threshold=100;compression=50)

Double.NEGATIVE_INFINITY

Returns the count of distinct values of a column as Integer

DISTINCTCOUNT(playerName)

0

Returns the count of distinct values of a column as Integer. This function is accurate for INT column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collisions.

DISTINCTCOUNTBITMAP(playerName)

0

Returns an approximate distinct count using HyperLogLog as Long. It also takes an optional second argument to configure the log2m for the HyperLogLog.

DISTINCTCOUNTHLL(playerName, 12)

0

Returns HyperLogLog response serialized as String. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.

DISTINCTCOUNTRAWHLL(playerName)

0

Returns an approximate distinct count using HyperLogLogPlus as Long. It also takes an optional second and third arguments to configure the p and sp for the HyperLogLogPlus.

DISTINCTCOUNTHLLPLUS(playerName)

0

Returns HyperLogLogPlus response serialized as String. The serialized HLLPlus can be converted back into an HLLPlus and then aggregated with other HLLPluses. A common use case may be to merge HLLPlus responses from different Pinot tables, or to allow aggregation after client-side batching.

DISTINCTCOUNTRAWHLLPLUS(playerName)

0

DISTINCTCOUNTSMARTHLL

Returns the count of distinct values of a column as Integer. When there are too many distinct values, automatically switch to approximate distinct count using HyperLogLog. The switch threshold (100_000 by default) and log2m (12 by default) for the HyperLogLog can be configured via the optional second argument.

DISTINCTCOUNTSMARTHLL(playerName),

DISTINCTCOUNTSMARTHLL(playerName, 'threshold=100;log2m=8')

0

Returns the count of distinct values of a column as Long when the column is pre-partitioned for each segment, where there is no common value within different segments. This function calculates the exact count of distinct values within the segment, then simply sums up the results from different segments to get the final result.

SEGMENTPARTITIONEDDISTINCTCOUNT(playerName)

0

SEGMENTPARTITIONEDDISTINCTCOUNT(playerName)

0

Get the last value of dataColumn where the timeColumn is used to define the time of dataColumn and the dataType specifies the type of dataColumn, which can be BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING

LASTWITHTIME(playerScore, timestampColumn, 'BOOLEAN')

LASTWITHTIME(playerScore, timestampColumn, 'INT')

LASTWITHTIME(playerScore, timestampColumn, 'LONG')

LASTWITHTIME(playerScore, timestampColumn, 'FLOAT')

LASTWITHTIME(playerScore, timestampColumn, 'DOUBLE')

LASTWITHTIME(playerScore, timestampColumn, 'STRING')

INT: Int.MIN_VALUE LONG: Long.MIN_VALUE FLOAT: Float.NaN DOUBLE: Double.NaN STRING: ""

Get the first value of dataColumn where the timeColumn is used to define the time of dataColumn and the dataType specifies the type of dataColumn, which can be BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING

FIRSTWITHTIME(playerScore, timestampColumn, 'BOOLEAN')

FIRSTWITHTIME(playerScore, timestampColumn, 'INT')

FIRSTWITHTIME(playerScore, timestampColumn, 'LONG')

FIRSTWITHTIME(playerScore, timestampColumn, 'FLOAT')

FIRSTWITHTIME(playerScore, timestampColumn, 'DOUBLE')

FIRSTWITHTIME(playerScore, timestampColumn, 'STRING')

INT: Int.MIN_VALUE LONG: Long.MIN_VALUE FLOAT: Float.NaN DOUBLE: Double.NaN STRING: ""

Deprecated functions:

Function

Description

Example

FASTHLL

FASTHLL stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format

FASTHLL(playerName)

Multi-value column functions

The following aggregation functions can be used for multi-value columns

Function

FILTER Clause in aggregation

Pinot supports FILTER clause in aggregation queries as follows:

SELECT SUM(COL1) FILTER (WHERE COL2 > 300),
       AVG(COL2) FILTER (WHERE COL2 < 50) 
FROM MyTable WHERE COL3 > 50

In the query above, COL1 is aggregated only for rows where COL2 > 300 and COL3 > 50 . Similarly, COL2 is aggregated where COL2 < 50 and COL3 > 50.

With NULL Value Support enabled, this allows to filter out the null values while performing aggregation as follows:

SELECT SUM(COL1) FILTER (WHERE COL1 IS NOT NULL)
FROM MyTable WHERE COL3 > 50

In the above query, COL1 is aggregated only for the non-null values. Without NULL value support, we would have to filter using the default null value.

Deprecated functions:

Function

Description

Example

FASTHLLMV (Deprecated)

stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format

FASTHLLMV(playerNames)

Cardinality Estimation

Cardinality estimation is a classic problem. Pinot solves it with multiple ways each of which has a trade-off between accuracy and latency.

Exact Results

Functions:

DistinctCount(x) -> LONG

Returns accurate count for all unique values in a column.

The underlying implementation is using a IntOpenHashSet in library: it.unimi.dsi:fastutil:8.2.3 to hold all the unique values.

Approximate Results

It usually takes a lot of resources and time to compute exact results for unique counting on large datasets. In some circumstances, we can tolerate a certain error rate, in which case we can use approximation functions to tackle this problem.

HyperLogLog

is an approximation algorithm for unique counting. It uses fixed number of bits to estimate the cardinality of given data set.

Pinot leverages in library com.clearspring.analytics:stream:2.7.0as the data structure to hold intermediate results.

Functions:

DistinctCountHLL(x)_ -> LONG_

For column type INT/LONG/FLOAT/DOUBLE/STRING, Pinot treats each value as an individual entry to add into HyperLogLog Object, and then computes the approximation by calling method _cardinality().

For column type BYTES, Pinot treats each value as a serialized HyperLogLog Object with pre-aggregated values inside. The bytes value is generated by org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog).

All deserialized HyperLogLog object will be merged into one then calling method _cardinality() to get the approximated unique count._

HyperLogLogPlusPlus

64-bit hash function is used instead of the 32 bits used in the original paper. This reduces the hash collisions for large cardinalities allowing to remove the large range correction.
Some bias is found for small cardinalities when switching from linear counting to the HLL counting. An empirical bias correction is proposed to mitigate the problem.
A sparse representation of the registers is implemented to reduce memory requirements for small cardinalities, which can be later transformed to a dense representation if the cardinality grows.

Functions:

DistinctCountHLLPlus(<HllPlusColumn>)_ -> LONG_
DistinctCountHLLPlus(<HllPlusColumn>, <p>)_ -> LONG_
DistinctCountHLLPlus(<HllPlusColumn>, <p>, <sp>)_ -> LONG_

For column type INT/LONG/FLOAT/DOUBLE/STRING , Pinot treats each value as an individual entry to add into HyperLogLogPlus Object, then compute the approximation by calling method _cardinality().

For column type BYTES, Pinot treats each value as a serialized HyperLogLogPlus Object with pre-aggregated values inside. The bytes value is generated by org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_PLUS_SER_DE.serialize(hyperLogLogPlus).

All deserialized HyperLogLogPlus object will be merged into one then calling method _cardinality() to get the approximated unique count._

Theta Sketches

Functions:

DistinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**) **-> LONG
- thetaSketchColumn (required): Name of the column to aggregate on.
- thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches. Currently, the only supported parameter is nominalEntries.
- predicates (optional)_: _ These are individual predicates of form lhs <op> rhs which are applied on rows selected by the where clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
- postAggregationExpressionToEvaluate (required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT , where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.

In the example query below, the where clause is responsible for identifying the matching rows. Note, the where clause can be completely independent of the postAggregationExpression. Once matching rows are identified, each server unionizes all the sketches that match the individual predicates, i.e. country='USA' , device='mobile' in this case. Once the broker receives the intermediate sketches for each of these individual predicates from all servers, it performs the final aggregation by evaluating the postAggregationExpression and returns the final cardinality of the resulting sketch.

select distinctCountThetaSketch(
  sketchCol, 
  'nominalEntries=1024', 
  'country'=''USA'' AND 'state'=''CA'', 'device'=''mobile'', 'SET_INTERSECT($1, $2)'
) 
from table 
where country = 'USA' or device = 'mobile...'

DistinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**)** -> HexEncoded Serialized Sketch Bytes

This is the same as the previous function, except it returns the byte serialized sketch instead of the cardinality sketch. Since Pinot returns responses as JSON strings, bytes are returned as hex encoded strings. The hex encoded string can be deserialized into sketch by using the library org.apache.commons.codec.binaryas Hex.decodeHex(stringValue.toCharArray()).

Tuple Sketches

Functions:

avgValueIntegerSumTupleSketch(<tupleSketchColumn>, <tupleSketchLgK>**) -> Long
- tupleSketchColumn (required): Name of the column to aggregate on.
- tupleSketchLgK (optional): lgK which is the the log2 of K, which controls both the size and accuracy of the sketch.

This function can be used to combine the summary values from the random sample stored within the Tuple sketch and formulate an estimate for an average that applies to the entire dataset. The average should be interpreted as applying to each key tracked by the sketch and is rounded to the nearest whole number.

distinctCountTupleSketch(<tupleSketchColumn>, <tupleSketchLgK>**) -> LONG
- tupleSketchColumn (required): Name of the column to aggregate on.
- tupleSketchLgK (optional): lgK which is the the log2 of K, which controls both the size and accuracy of the sketch.

This returns the cardinality estimate for a column where the values are already encoded as Tuple sketches, stored as BYTES.

distinctCountRawIntegerSumTupleSketch(<tupleSketchColumn>, <tupleSketchLgK>**) -> HexEncoded Serialized Sketch Bytes

sumValuesIntegerSumTupleSketch(<tupleSketchColumn>, <tupleSketchLgK>**) -> Long
- tupleSketchColumn (required): Name of the column to aggregate on.
- tupleSketchLgK (optional): lgK which is the the log2 of K, which controls both the size and accuracy of the sketch.

This function can be used to combine the summary values (using sum) from the random sample stored within the Tuple sketch and formulate an estimate that applies to the entire dataset. See avgValueIntegerSumTupleSketch for extracting an average for integer summaries. If other merging options are required, it is best to extract the raw sketches directly or to implement a new Pinot aggregation function to support these.

Compressed Probability Counting (CPC) Sketches

Functions:

distinctCountCpcSketch(<cpcSketchColumn>, <cpcSketchLgK>**) -> Long
- cpcSketchColumn (required): Name of the column to aggregate on.
- cpcSketchLgK (optional): lgK which is the the log2 of K, which controls both the size and accuracy of the sketch.

This returns the cardinality estimate for a column.

distinctCountRawCpcSketch(<cpcSketchColumn>, <cpcSketchLgK>**) -> HexEncoded Serialized Sketch Bytes
- cpcSketchColumn (required): Name of the column to aggregate on.
- cpcSketchLgK (optional): lgK which is the the log2 of K, which controls both the size and accuracy of the sketch.

UltraLogLog (ULL) Sketches

Functions:

distinctCountULL(<ullSketchColumn>, <ullSketchPrecision>**) -> Long
- ullSketchColumn (required): Name of the column to aggregate on.
- ullSketchPrecision (optional): p which is the precision parameter, which controls both the size and accuracy of the sketch.

This returns the cardinality estimate for a column.

distinctCountRawULL(<cpcSketchColumn>, <ullSketchPrecision>**) -> HexEncoded Serialized Sketch Bytes
- ullSketchColumn (required): Name of the column to aggregate on.
- ullSketchPrecision (optional): p which is the precision parameter, which controls both the size and accuracy of the sketch.

Explain Plan (Multi-Stage)

This document describes EXPLAIN PLAN syntax for multi-stage engine (v2)

This page explains how to use EXPLAIN PLAN FOR syntax to obtain different plans of a query in multi-stage engine. You can read more about how to interpret the plans in the Understanding multi-stage explain plans page.

Also remember that plans are logical representations of the query execution. Sometimes it is more useful to study the actual stats of the query execution, which are included on each query result. You can read more about how to interpret the stats in the Understanding multi-stage stats page.

In Single-stage engine Explain Plan, we do not differentiate any logical/physical plan b/c the structure of the query is fixed. By default it explain the Physical Plan

In multi-stage engine we support EXPLAIN PLAN syntax mostly following Apache Calcite's EXPLAIN PLAN syntax. Here are several examples:

Explain Logical Plan

Using SSB standard query example:

EXPLAIN PLAN FOR 
select 
  P_BRAND1, sum(LO_REVENUE) 
from ssb_lineorder_1, ssb_part_1
where LO_PARTKEY = P_PARTKEY 
  and P_CATEGORY = 'MFGR#12' 
group by P_BRAND1

The result field contains 2 columns and 1 row:

+-----------------------------------|-------------------------------------------------------------|
| SQL#$%0                           |PLAN#$%1                                                     |
+-----------------------------------|-------------------------------------------------------------|
|"EXPLAIN PLAN FOR                  |"Execution Plan                                              | 
|select                             |LogicalAggregate(group=[{0}], agg#0=[$SUM0($1)])             | 
|  P_BRAND1, sum(LO_REVENUE)        |  PinotLogicalExchange(distribution=[hash[0]])               | 
|from ssb_lineorder_1, ssb_part_1   |    LogicalAggregate(group=[{2}], agg#0=[$SUM0($1)])         | 
|where LO_PARTKEY = P_PARTKEY       |      LogicalJoin(condition=[=($0, $3)], joinType=[inner])   | 
|  and P_CATEGORY = 'MFGR#12'       |        PinotLogicalExchange(distribution=[hash[0]])         | 
|group by P_BRAND1                  |          LogicalProject(LO_PARTKEY=[$12], LO_REVENUE=[$14]) | 
|   and P_CATEGORY = 'MFGR#12'      |            LogicalTableScan(table=[[ssb_lineorder_1]])      | 
|"                                  |        PinotLogicalExchange(distribution=[hash[1]])         | 
|                                   |          LogicalProject(P_BRAND1=[$3], P_PARTKEY=[$9])      | 
|                                   |            LogicalFilter(condition=[=($4, 'MFGR#12')])      | 
|                                   |              LogicalTableScan(table=[[ssb_part_1]])         |
|                                   |"                                                            |
+-----------------------------------|-------------------------------------------------------------|

noted that all the normal options for EXPLAIN PLAN in Apache Calcite also works in Pinot with extra information including attributes, type, etc.

One of the most useful options is the AS <format>, which support the following formats:

JSON, which returns the plan in a JSON format. This format is useful for parsing the plan in a program and it also provides some extra information that is not present in the default format.
XML, which is similar to JSON but in XML format.
DOT, which returns a DOT format that can be used to visualize the plan using tools like Graphviz. This format is understandable by different tools, including online stateless pages.

Explain Implementation Plan

If we want to gather the implementation plan specific to Pinot internal multi-stage engine operator chain. You can use the EXPLAIN IMPLEMENTATION PLAN :

+-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SQL#$%0                           |PLAN#$%1                                                                                                                                                         |  
+-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
|"EXPLAIN IMPLEMENTATION PLAN FOR   |[0]@local:8843 MAIL_RECEIVE(BROADCAST_DISTRIBUTED)                                                                                                               | 
|select                             |├── [1]@local:8432 MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@local@{8843,8843}|[0]} (Subtree Omitted)                                                               | 
|  P_BRAND1, sum(LO_REVENUE)        |├── [1]@local:8432 MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@local@{8843,8843}|[0]} (Subtree Omitted)                                                               | 
|from ssb_lineorder_1, ssb_part_1   |└── [1]@local:8432 MAIL_SEND(BROADCAST_DISTRIBUTED)->{[0]@local@{8843,8843}|[0]}                                                                                 | 
|where LO_PARTKEY = P_PARTKEY       |    └── [1]@local:8432 AGGREGATE_FINAL                                                                                                                           | 
|  and P_CATEGORY = 'MFGR#12'       |        └── [1]@local:8432 MAIL_RECEIVE(HASH_DISTRIBUTED)                                                                                                        | 
|group by P_BRAND1                  |            ├── [2]@local:8432 MAIL_SEND(HASH_DISTRIBUTED)->{[1]@local@{8432,8843}|[1],[1]@local@{8432,8843}|[2],[1]@local@{8432,8843}|[0]} (Subtree Omitted)    | 
|   and P_CATEGORY = 'MFGR#12'      |            ├── [2]@local:8432 MAIL_SEND(HASH_DISTRIBUTED)->{[1]@local@{8432,8843}|[1],[1]@local@{8432,8843}|[2],[1]@local@{8432,8843}|[0]} (Subtree Omitted)    | 
|"                                  |            └── [2]@local:8432 MAIL_SEND(HASH_DISTRIBUTED)->{[1]@local@{8432,8843}|[1],[1]@local@{8432,8843}|[2],[1]@local@{8432,8843}|[0]}                      | 
|                                   |                └── [2]@local:8432 AGGREGATE_LEAF                                                                                                                | 
|                                   |                    └── [2]@local:8432 JOIN                                                                                                                      | 
|                                   |                        ├── [2]@local:8432 MAIL_RECEIVE(HASH_DISTRIBUTED)                                                                                        | 
|                                   |                        │   ├── [3]@local:8432 MAIL_SEND(HASH_DISTRIBUTED)->{[2]@local@{8432,8843}|[1],[2]@local@{8432,8843}|[2],[2]@local@{8432,8843}|[0]}      | 
|                                   |                        │   │   └── [3]@local:8432 PROJECT                                                                                                       | 
|                                   |                        │   │       └── [3]@local:8432 TABLE SCAN (ssb_lineorder_1) null                                                                         | 
|                                   |                        │   ├── [3]@local:8432 MAIL_SEND(HASH_DISTRIBUTED)->{[2]@local@{8432,8843}|[1],[2]@local@{8432,8843}|[2],[2]@local@{8432,8843}|[0]}      | 
|                                   |                        │   │   └── [3]@local:8432 PROJECT                                                                                                       | 
|                                   |                        │   │       └── [3]@local:8432 TABLE SCAN (ssb_lineorder_1) null                                                                         | 
|                                   |                        │   └── [3]@local:8432 MAIL_SEND(HASH_DISTRIBUTED)->{[2]@local@{8432,8843}|[1],[2]@local@{8432,8843}|[2],[2]@local@{8432,8843}|[0]}      | 
|                                   |                        │       └── [3]@local:8432 PROJECT                                                                                                       | 
|                                   |                        │           └── [3]@local:8432 TABLE SCAN (ssb_lineorder_1) null                                                                         | 
|                                   |                        └── [2]@local:8432 MAIL_RECEIVE(HASH_DISTRIBUTED)                                                                                        | 
|                                   |                            ├── [4]@local:8432 MAIL_SEND(HASH_DISTRIBUTED)->{[2]@local@{8432,8843}|[1],[2]@local@{8432,8843}|[2],[2]@local@{8432,8843}|[0]}      | 
|                                   |                            │   └── [4]@local:8432 PROJECT                                                                                                       | 
|                                   |                            │       └── [4]@local:8432 FILTER                                                                                                    | 
|                                   |                            │           └── [4]@local:8432 TABLE SCAN (ssb_part_1) null                                                                          | 
|                                   |                            ├── [4]@local:8432 MAIL_SEND(HASH_DISTRIBUTED)->{[2]@local@{8432,8843}|[1],[2]@local@{8432,8843}|[2],[2]@local@{8432,8843}|[0]}      | 
|                                   |                            │   └── [4]@local:8432 PROJECT                                                                                                       | 
|                                   |                            │       └── [4]@local:8432 FILTER                                                                                                    | 
|                                   |                            │           └── [4]@local:8432 TABLE SCAN (ssb_part_1) null                                                                          | 
|                                   |                            └── [4]@local:8432 MAIL_SEND(HASH_DISTRIBUTED)->{[2]@local@{8432,8843}|[1],[2]@local@{8432,8843}|[2],[2]@local@{8432,8843}|[0]}      | 
|                                   |                                └── [4]@local:8432 PROJECT                                                                                                       | 
|                                   |                                    └── [4]@local:8432 FILTER                                                                                                    | 
|                                   |                                        └── [4]@local:8432 TABLE SCAN (ssb_part_1) null                                                                          | 
+-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|

Notes that now there is information regarding how many servers were used, and how are data being shuffled between nodes. etc.

Filtering with IdSet

Learn how to write fast queries for looking up IDs in a list of values.

Filtering with IdSet is only supported with the single-stage query engine (v1).

A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but using IN doesn't perform well with large lists of IDs. For large lists of IDs, we recommend using an IdSet.

Functions

ID_SET

ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )

This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:

INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
Other types - Bloom Filter

The following parameters are used to configure the Bloom Filter:

expectedInsertions - Number of expected insertions for the BloomFilter, must be positive
fpp - False positive probability to use for the BloomFilter. Must be positive and less than 1.0.

Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.

IN_ID_SET

IN_ID_SET(columnName, base64EncodedIdSet)

This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.

IN_SUBQUERY

IN_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.

INPARTITIONEDSUBQUERY

IN_PARTITIONED_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.

This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.

The query passed to IN_SUBQUERY can be run on any table - they aren't restricted to the table used in the parent query.

The query passed to IN__PARTITIONED__SUBQUERY must be run on the same table as the parent query.

Examples

Create IdSet

You can create an IdSet of the values in the yearID column by running the following:

SELECT ID_SET(yearID)
FROM baseballStats
WHERE teamID = 'WS1'

When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:

SELECT ID_SET(playerName, 'expectedInsertions=10')
FROM baseballStats
WHERE teamID = 'WS1'

SELECT ID_SET(playerName, 'expectedInsertions=100')
FROM baseballStats
WHERE teamID = 'WS1'

We can also configure the fpp parameter:

SELECT ID_SET(playerName, 'expectedInsertions=100;fpp=0.01')
FROM baseballStats
WHERE teamID = 'WS1'

Filter by values in IdSet

We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for _yearID_s in the IdSet, run the following:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_ID_SET(
 yearID,   
 'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
  ) = 1 
GROUP BY yearID

Filter by values not in IdSet

To return rows for _yearID_s not in the IdSet, run the following:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_ID_SET(
  yearID,   
  'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
  ) = 0 
GROUP BY yearID

Filter on broker

To filter rows for _yearID_s in the IdSet on a Pinot Broker, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 1
GROUP BY yearID

To filter rows for _yearID_s not in the IdSet on a Pinot Broker, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 0
GROUP BY yearID

Filter on server

To filter rows for _yearID_s in the IdSet on a Pinot Server, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_PARTITIONED_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 1
GROUP BY yearID

To filter rows for _yearID_s not in the IdSet on a Pinot Server, run the following query:

SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_PARTITIONED_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 0
GROUP BY yearID

GapFill Function For Time-Series Dataset

GapFill Function is only supported with the single-stage query engine (v1).

Many of the datasets are time series in nature, tracking state change of an entity over time. The granularity of recorded data points might be sparse or the events could be missing due to network and other device issues in the IOT environment. But analytics applications which are tracking the state change of these entities over time, might be querying for values at lower granularity than the metric interval.

Here is the sample data set tracking the status of parking lots in parking space.

lotId

event_time

is_occupied

We want to find out the total number of parking lots that are occupied over a period of time which would be a common use case for a company that manages parking spaces.

Let us take 30 minutes' time bucket as an example:

timeBucket/lotId

If you look at the above table, you will see a lot of missing data for parking lots inside the time buckets. In order to calculate the number of occupied park lots per time bucket, we need gap fill the missing data.

The Ways of Gap Filling the Data

There are two ways of gap filling the data: FILL_PREVIOUS_VALUE and FILL_DEFAULT_VALUE.

FILL_PREVIOUS_VALUE means the missing data will be filled with the previous value for the specific entity, in this case, park lot, if the previous value exists. Otherwise, it will be filled with the default value.

FILL_DEFAULT_VALUE means that the missing data will be filled with the default value. For numeric column, the defaul value is 0. For Boolean column type, the default value is false. For TimeStamp, it is January 1, 1970, 00:00:00 GMT. For STRING, JSON and BYTES, it is empty String. For Array type of column, it is empty array.

We will leverage the following the query to calculate the total occupied parking lots per time bucket.

Aggregation/Gapfill/Aggregation

Query Syntax

SELECT time_col, SUM(status) AS occupied_slots_count
FROM (
    SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
                   '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
                    TIMESERIESON(lotId)), lotId, status
    FROM (
        SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
               lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
        FROM parking_data
        WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
        GROUP BY 1, 2
        ORDER BY 1
        LIMIT 100)
    LIMIT 100)
GROUP BY 1
LIMIT 100

Workflow

The most nested sql will convert the raw event table to the following table.

The second most nested sql will gap fill the returned data as following:

The outermost query will aggregate the gapfilled data as follows:

There is one assumption we made here that the raw data is sorted by the timestamp. The Gapfill and Post-Gapfill Aggregation will not sort the data.

The above example just shows the use case where the three steps happen:

The raw data will be aggregated;
The aggregated data will be gapfilled;
The gapfilled data will be aggregated.

There are three more scenarios we can support.

Select/Gapfill

If we want to gapfill the missing data per half an hour time bucket, here is the query:

Query Syntax

SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
               '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
               TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
FROM parking_data
WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
ORDER BY 1
LIMIT 100

Workflow

At first the raw data will be transformed as follows:

Then it will be gapfilled as follows:

Aggregate/Gapfill

Query Syntax

SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
               '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
               TIMESERIESON(lotId)), lotId, status
FROM (
    SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
           lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
    FROM parking_data
    WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
    GROUP BY 1, 2
    ORDER BY 1
    LIMIT 100)
LIMIT 100

Workflow

The nested sql will convert the raw event table to the following table.

The outer sql will gap fill the returned data as following:

Gapfill/Aggregate

Query Syntax

SELECT time_col, SUM(is_occupied) AS occupied_slots_count
FROM (
    SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
           '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
           TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
    FROM parking_data
    WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
    ORDER BY 1
    LIMIT 100)
GROUP BY 1
LIMIT 100

Workflow

The raw data will be transformed as following at first:

The transformed data will be gap filled as follows:

The aggregation will generate the following table:

Grouping Algorithm

In this guide we will learn about the heuristics used for trimming results in Pinot's grouping algorithm (used when processing GROUP BY queries) to make sure that the server doesn't run out of memory.

Within segment

When grouping rows within a segment, Pinot keeps a maximum of <numGroupsLimit> groups per segment. This value is set to 100,000 by default and can be configured by the pinot.server.query.executor.num.groups.limit property.

If the number of groups of a segment reaches this value, the extra groups will be ignored and the results returned may not be completely accurate. The numGroupsLimitReached property will be set to true in the query response if the value is reached.

Trimming tail groups

After the inner segment groups have been computed, the Pinot query engine optionally trims tail groups. Tail groups are ones that have a lower rank based on the ORDER BY clause used in the query.

This configuration is disabled by default, but can be enabled by configuring the pinot.server.query.executor.min.segment.group.trim.size property.

When segment group trim is enabled, the query engine will trim the tail groups and keep max(<minSegmentGroupTrimSize>, 5 * LIMIT) groups if it gets more groups. Pinot keeps at least 5 * LIMIT groups when trimming tail groups to ensure the accuracy of results.

This value can be overridden on a query by query basis by passing the following option:

SELECT * 
FROM ...

OPTION(minSegmentGroupTrimSize=<minSegmentGroupTrimSize>)

Cross segments

Once grouping has been done within a segment, Pinot will merge segment results and trim tail groups and keep max(<minServerGroupTrimSize>, 5 * LIMIT) groups if it gets more groups.

<minServerGroupTrimSize> is set to 5,000 by default and can be adjusted by configuring the pinot.server.query.executor.min.server.group.trim.size property. When setting the configuration to -1, the cross segments trim can be disabled.

This value can be overridden on a query by query basis by passing the following option:

SELECT * 
FROM ...

OPTION(minServerGroupTrimSize=<minServerGroupTrimSize>)

When cross segments trim is enabled, the server will trim the tail groups before sending the results back to the broker. It will also trim the tail groups when the number of groups reaches the <trimThreshold>.

<trimThreshold> is the upper bound of groups allowed in a server for each query to protect servers from running out of memory. To avoid too frequent trimming, the actual trim size is bounded to <trimThreshold> / 2. Combining this with the above equation, the actual trim size for a query is calculated as min(max(<minServerGroupTrimSize>, 5 * LIMIT), <trimThreshold> / 2).

This configuration is set to 1,000,000 by default and can be adjusted by configuring the pinot.server.query.executor.groupby.trim.threshold property.

A higher threshold reduces the amount of trimming done, but consumes more heap memory. If the threshold is set to more than 1,000,000,000, the server will only trim the groups once before returning the results to the broker.

This value can be overridden on a query by query basis by passing the following option:

SELECT * 
FROM ...

OPTION(groupTrimThreshold=<groupTrimThreshold>)

At Broker

When broker performs the final merge of the groups returned by various servers, there is another level of trimming that takes place. The tail groups are trimmed and max(<minBrokerGroupTrimSize>, 5 * LIMIT) groups are retained.

Default value of <minBrokerGroupTrimSize> is set to 5000. This can be adjusted by configuring pinot.broker.min.group.trim.size property.

GROUP BY behavior

Pinot sets a default LIMIT of 10 if one isn't defined and this applies to GROUP BY queries as well. Therefore, if no limit is specified, Pinot will return 10 groups.

Pinot will trim tail groups based on the ORDER BY clause to reduce the memory footprint and improve the query performance. It keeps at least 5 * LIMIT groups so that the results give good enough approximation in most cases. The configurable min trim size can be used to increase the groups kept to improve the accuracy but has a larger extra memory footprint.

HAVING behavior

If the query has a HAVING clause, it is applied on the merged GROUP BY results that already have the tail groups trimmed. If the HAVING clause is the opposite of the ORDER BY order, groups matching the condition might already be trimmed and not returned. e.g.

SELECT SUM(colA) 
FROM myTable 
GROUP BY colB 
HAVING SUM(colA) < 100 
ORDER BY SUM(colA) DESC 
LIMIT 10

Increase min trim size to keep more groups in these cases.

Configuration Parameters

Lookup UDF Join

For more information about using JOINs with the multi-stage query engine, see JOINs.

Lookup UDF Join is only supported with the single-stage query engine (v1). For more information about using JOINs with the multi-stage query engine, see .

Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with in Pinot.

Syntax

The UDF function syntax is listed as below:

lookupUDFSpec:
    LOOKUP
    '('
    '''dimTable'''
    '''dimColToLookup'''
    [ '''dimJoinKey''', factJoinKey ]*
    ')'

dimTable Name of the dim table to perform the lookup on.
dimColToLookUp The column name of the dim table to be retrieved to decorate our result.
dimJoinKey The column name on which we want to perform the lookup i.e. the join column name for dim table.
factJoinKey The column name on which we want to perform the lookup against e.g. the join column name for fact table

Noted that:

all the dim-table-related expressions are expressed as literal strings, this is the LOOKUP UDF syntax limitation: we cannot express column identifier which doesn't exist in the query's main table, which is the factTable table.
the syntax definition of [ '''dimJoinKey''', factJoinKey ]* indicates that if there are multiple dim partition columns, there should be multiple join key pair expressed.

Examples

Here are some of the examples

Single-partition-key-column Example

Consider the table baseballStats

and dim table dimBaseballTeams

several acceptable queries are:

Dim-Fact LOOKUP example

SELECT 
  playerName, 
  teamID, 
  LOOKUP('dimBaseballTeams', 'teamName', 'teamID', teamID) AS teamName, 
  LOOKUP('dimBaseballTeams', 'teamAddress', 'teamID', teamID) AS teamAddress
FROM baseballStats

Self LOOKUP example

SELECT 
  teamID, 
  teamName AS nameFromLocal,
  LOOKUP('dimBaseballTeams', 'teamName', 'teamID', teamID) AS nameFromLookup
FROM dimBaseballTeams

Complex-partition-key-columns Example

Consider a single dimension table with schema:

BILLING SCHEMA

Self LOOKUP example

select 
  customerId,
  missedPayment, 
  LOOKUP('billing', 'city', 'customerId', customerId, 'creditHistory', creditHistory) AS lookedupCity 
from billing

Usage FAQ

The data return type of the UDF will be that of the dimColToLookUp column type.
when multiple primary key columns are used for the dimension table (e.g. composite primary key), ensure that the order of keys appearing in the lookup() UDF is the same as the order defined in the primaryKeyColumns from the dimension table schema.

Window aggregate

Use window aggregate to compute averages, sort, rank, or count items, calculate sums, and find minimum or maximum values across window.

Important: To query using Windows functions, you must enable Pinot's . See how to ).

Window aggregate overview

This is an overview of the window aggregate feature.

Window aggregate syntax

Pinot's window function (windowedAggCall) includes the following syntax definition:

windowedAggCall:
      windowAggFunction
      OVER 
      window

windowAggFunction:
      agg '(' [ ALL | DISTINCT ] value [, value ]* ')'
   |
      agg '(' '*' ')'

window:
      '('
      [ PARTITION BY expression [, expression ]* ]
      [ ORDER BY orderItem [, orderItem ]* ]
      [
          RANGE numericOrIntervalExpression { PRECEDING | FOLLOWING }
      |   ROWS numericExpression { PRECEDING | FOLLOWING }
      ]
      ')'

windowAggCall refers to the actual windowed agg operation.

Example window aggregate query layout

The following query shows the complete components of the window function. Note, PARTITION BY and ORDER BY are optional.

SELECT FUNC(column1) OVER (PARTITION BY column2 ORDER BY column3)
    FROM tableName
    WHERE filter_clause

Window mechanism (OVER clause)

Partition by clause

If a PARTITION BY clause is specified, the intermediate results will be grouped into different partitions based on the values of the columns appearing in the PARTITION BY clause.
If the PARTITION BY clause isn’t specified, the whole result will be regarded as one big partition, i.e. there is only one partition in the result set.

Order by clause

If an ORDER BY clause is specified, all the rows within the same partition will be sorted based on the values of the columns appearing in the window ORDER BY clause. The ORDER BY clause decides the order in which the rows within a partition are to be processed.
If no ORDER BY clause is specified while a PARTITION BY clause is specified, the order of the rows is undefined. To order the output, use a global ORDER BY clause in the query.

Frame clause

Important Note: in release 1.0.0 window aggregate only supports UNBOUND PRECEDING, UNBOUND FOLLOWING and CURRENT ROW. frame and row count support have not been implemented yet.

{RANGE|ROWS} frame_start OR
{RANGE|ROWS} BETWEEN frame_start AND frame_end; frame_start and frame_end can be any of:
- UNBOUNDED PRECEDING: expression PRECEDING. May only be allowed in ROWS mode [depends on DB, some support some don’t]
- CURRENT ROW expression FOLLOWING. May only be allowed in ROWS mode [depends on DB, some support some don’t]
- UNBOUNDED FOLLOWING:
  - If no FRAME clause is specified, then the default frame behavior depends on whether ORDER BY is present or not.
  - If an ORDER BY clause is specified, the default behavior is to calculate the aggregation from the beginning of the partition to the current row or UNBOUNDED PRECEDING to CURRENT ROW.
  - If only a PARTITION BY clause is present, the default frame behavior is to calculate the aggregation from UNBOUNDED PRECEDING to CURRENT ROW.

If there is no FRAME, no PARTITION BY, and no ORDER BY clause specified in the OVER clause (empty OVER), the whole result set is regarded as one partition, and there's one frame in the window.

Inside the over clause, there are three optional components: PARTITION BY clause, ORDER BY clause, and FRAME clause.

Window aggregate functions

Window aggregate functions are commonly used to do the following:

Supported window aggregate functions are listed in the following table.

Window aggregate query examples

Sum transactions by customer ID

Calculate the rolling sum transaction amount ordered by the payment date for each customer ID (note, the default frame here is UNBOUNDED PRECEDING and CURRENT ROW).

SELECT customer_id, payment_date, amount, SUM(amount) OVER(PARTITION BY customer_id ORDER BY payment_date) from payment;

Find the minimum or maximum transaction by customer ID

Calculate the least (use MIN()) or most expensive (use MAX()) transaction made by each customer comparing all transactions made by the customer (default frame here is UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING). The following query shows how to find the least expensive transaction.

SELECT customer_id, payment_date, amount, MIN(amount) OVER(PARTITION BY customer_id) from payment;

Find the average transaction amount by customer ID

Calculate a customer’s average transaction amount for all transactions they’ve made (default frame here is UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING).

SELECT customer_id, payment_date, amount, AVG(amount) OVER(PARTITION BY customer_id) from payment;

Rank year-to-date sales for a sales team

Use ROW_NUMBER() to rank team members by their year-to-date sales (default frame here is UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING).

SELECT ROW_NUMBER() OVER(ORDER BY SalesYTD DESC) AS Row,   
    FirstName, LastName AS "Total sales YTD"   
FROM Sales.vSalesPerson;

Count the number of transactions by customer ID

Count the number of transactions made by each customer (default frame here is UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING).

SELECT customer_id, payment_date, amount, count(amount) OVER(PARTITION BY customer_id) from payment;

Aggregation Functions

Aggregate functions return a single result for a group of rows.

Aggregate functions return a single result for a group of rows. The following table shows supported aggregate functions in Pinot.

Function

Description

Example

Default Value When No Record Selected

Project a column where the maxima appears in a series of measuring columns.

ARG_MAX(measuring1, measuring2, measuring3, projection)

Will return no result

See

0

Returns the count of the records as Long

COUNT(*)

0

Returns the population covariance between of 2 numerical columns as Double

COVAR_POP(col1, col2)

Double.NEGATIVE_INFINITY

Returns the sample covariance between of 2 numerical columns as Double

COVAR_SAMP(col1, col2)

Double.NEGATIVE_INFINITY

Calculate the histogram of a numeric column as Double[]

HISTOGRAM(numberOfGames,0,200,10)

0, 0, ..., 0

Returns the minimum value of a numeric column as Double

MIN(playerScore)

Double.POSITIVE_INFINITY

Returns the maximum value of a numeric column as Double

MAX(playerScore)

Double.NEGATIVE_INFINITY

Returns the sum of the values for a numeric column as Double

SUM(playerScore)

0

Returns the sum of the values for a numeric column with optional precision and scale as BigDecimal

SUMPRECISION(salary), SUMPRECISION(salary, precision, scale)

0.0

Returns the average of the values for a numeric column as Double

AVG(playerScore)

Double.NEGATIVE_INFINITY

MODE(playerScore)

MODE(playerScore, 'MIN')

MODE(playerScore, 'MAX')

MODE(playerScore, 'AVG')

Double.NEGATIVE_INFINITY

Returns the max - min value for a numeric column as Double

MINMAXRANGE(playerScore)

Double.NEGATIVE_INFINITY

Returns the Nth percentile of the values for a numeric column as Double. N is a decimal number between 0 and 100 inclusive.

PERCENTILE(playerScore, 50) PERCENTILE(playerScore, 99.9)

Double.NEGATIVE_INFINITY

Returns the Nth percentile of the values for a numeric column using as Long

PERCENTILEEST(playerScore, 50)

PERCENTILEEST(playerScore, 99.9)

Long.MIN_VALUE

Returns the Nth percentile of the values for a numeric column using as Double

PERCENTILETDIGEST(playerScore, 50)

PERCENTILETDIGEST(playerScore, 99.9)

Double.NaN

Returns the Nth percentile (using compression factor of CF) of the values for a numeric column using as Double

PERCENTILETDIGEST(playerScore, 50, 1000)

PERCENTILETDIGEST(playerScore, 99.9, 500)

Double.NaN

PERCENTILESMARTTDIGEST

PERCENTILESMARTTDIGEST(playerScore, 50)

PERCENTILESMARTTDIGEST(playerScore, 99.9, 'threshold=100;compression=50)

Double.NEGATIVE_INFINITY

Returns the count of distinct values of a column as Integer

DISTINCTCOUNT(playerName)

0

DISTINCTCOUNTBITMAP(playerName)

0

Returns an approximate distinct count using HyperLogLog as Long. It also takes an optional second argument to configure the log2m for the HyperLogLog.

DISTINCTCOUNTHLL(playerName, 12)

0

DISTINCTCOUNTRAWHLL(playerName)

0

Returns an approximate distinct count using HyperLogLogPlus as Long. It also takes an optional second and third arguments to configure the p and sp for the HyperLogLogPlus.

DISTINCTCOUNTHLLPLUS(playerName)

0

DISTINCTCOUNTRAWHLLPLUS(playerName)

0

DISTINCTCOUNTSMARTHLL

DISTINCTCOUNTSMARTHLL(playerName),

DISTINCTCOUNTSMARTHLL(playerName, 'threshold=100;log2m=8')

0

See

0

See

0

See

0

See

0

See

0

See

0

See

0

See

0

SEGMENTPARTITIONEDDISTINCTCOUNT(playerName)

0

SEGMENTPARTITIONEDDISTINCTCOUNT(playerName)

0

See

0

LASTWITHTIME(playerScore, timestampColumn, 'BOOLEAN')

LASTWITHTIME(playerScore, timestampColumn, 'INT')

LASTWITHTIME(playerScore, timestampColumn, 'LONG')

LASTWITHTIME(playerScore, timestampColumn, 'FLOAT')

LASTWITHTIME(playerScore, timestampColumn, 'DOUBLE')

LASTWITHTIME(playerScore, timestampColumn, 'STRING')

INT: Int.MIN_VALUE LONG: Long.MIN_VALUE FLOAT: Float.NaN DOUBLE: Double.NaN STRING: ""

FIRSTWITHTIME(playerScore, timestampColumn, 'BOOLEAN')

FIRSTWITHTIME(playerScore, timestampColumn, 'INT')

FIRSTWITHTIME(playerScore, timestampColumn, 'LONG')

FIRSTWITHTIME(playerScore, timestampColumn, 'FLOAT')

FIRSTWITHTIME(playerScore, timestampColumn, 'DOUBLE')

FIRSTWITHTIME(playerScore, timestampColumn, 'STRING')

INT: Int.MIN_VALUE LONG: Long.MIN_VALUE FLOAT: Float.NaN DOUBLE: Double.NaN STRING: ""

Deprecated functions:

Function

Description

Example

FASTHLL

FASTHLL stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format

FASTHLL(playerName)

Multi-value column functions

The following aggregation functions can be used for multi-value columns

Function

Returns the count of a multi-value column as Long

Returns the minimum value of a numeric multi-value column as Double

Returns the maximum value of a numeric multi-value column as Double

Returns the sum of the values for a numeric multi-value column as Double

Returns the average of the values for a numeric multi-value column as Double

Returns the max - min value for a numeric multi-value column as Double

Returns the Nth percentile of the values for a numeric multi-value column as Double

Returns the Nth percentile using as Long

Returns the Nth percentile using as Double

Returns the Nth percentile (using compression factor CF) using as Double

Returns the count of distinct values for a multi-value column as Integer

Returns the count of distinct values for a multi-value column as Integer. This function is accurate for INT or dictionary encoded column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collision.

Returns an approximate distinct count using HyperLogLog as Long

Returns HyperLogLog response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.

Returns an approximate distinct count using HyperLogLogPlus as Long

Returns HyperLogLogPlus response serialized as string. The serialized HLLPlus can be converted back into an HLLPlus and then aggregated with other HLLPluses. A common use case may be to merge HLLPlus responses from different Pinot tables, or to allow aggregation after client-side batching.

FILTER Clause in aggregation

Pinot supports FILTER clause in aggregation queries as follows:

SELECT SUM(COL1) FILTER (WHERE COL2 > 300),
       AVG(COL2) FILTER (WHERE COL2 < 50) 
FROM MyTable WHERE COL3 > 50

In the query above, COL1 is aggregated only for rows where COL2 > 300 and COL3 > 50 . Similarly, COL2 is aggregated where COL2 < 50 and COL3 > 50.

With NULL Value Support enabled, this allows to filter out the null values while performing aggregation as follows:

SELECT SUM(COL1) FILTER (WHERE COL1 IS NOT NULL)
FROM MyTable WHERE COL3 > 50

In the above query, COL1 is aggregated only for the non-null values. Without NULL value support, we would have to filter using the default null value.

Deprecated functions:

Function

Description

Example

FASTHLLMV (Deprecated)

stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format

FASTHLLMV(playerNames)