Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Learn how to query Pinot using SQL
Pinot provides a SQL interface for querying. It uses the Calcite SQL parser to parse queries and the MYSQL_ANSI dialect. For details on the syntax, see the the Calcite documentation. To find specific supported SQL operators, see SqlLibraryOperators.
In Pinot 1.0, the multi-stage query engine supports inner join, left-outer, semi-join, and nested queries out of the box. It's optimized for in-memory process and latency. For more information, see how to enable and use the multi-stage query engine.
Pinot also supports using simple Data Definition Language (DDL) to insert data into a table from file directly. For details, see programmatically access the multi-stage query engine. More DDL supports will be added in the future. But for now, the most common way for data definition is using the Controller Admin API.
Note: For queries that require a large amount of data shuffling, require spill-to-disk, or are hitting any other limitations of the multi-stage query engine (v2), we still recommend using Presto.
In Pinot SQL:
Double quotes(") are used to force string identifiers, e.g. column names
Single quotes(') are used to enclose string literals. If the string literal also contains a single quote, escape this with a single quote e.g '''Pinot'''
to match the string literal 'Pinot'
Misusing those might cause unexpected query results, like the following examples:
WHERE a='b'
means the predicate on the column a
equals to a string literal value 'b'
WHERE a="b"
means the predicate on the column a
equals to the value of the column b
If your column names use reserved keywords (e.g. timestamp
or date
) or special characters, you will need to use double quotes when referring to them in queries.
Note: Define decimal literals within quotes to preserve precision.
For performant filtering of IDs in a list, see Filtering with IdSet.
Note that results might not be consistent if the ORDER BY
column has the same value in multiple rows.
The example below counts rows where the column airlineName
starts with U
:
Pinot supports the CASE-WHEN-ELSE
statement, as shown in the following two examples:
Pinot doesn't currently support injecting functions. Functions have to be implemented within Pinot, as shown below:
For more examples, see Transform Function in Aggregation Grouping.
Pinot supports queries on BYTES column using hex strings. The query response also uses hex strings to represent bytes values.
The query below fetches all the rows for a given UID:
Cardinality estimation is a classic problem. Pinot solves it with multiple ways each of which has a trade-off between accuracy and latency.
Functions:
DistinctCount(x) -> LONG
Returns accurate count for all unique values in a column.
The underlying implementation is using a IntOpenHashSet in library: it.unimi.dsi:fastutil:8.2.3
to hold all the unique values.
It usually takes a lot of resources and time to compute accurate results for unique counting on large datasets. In some circumstances, we can tolerate a certain error rate, in which case we can use approximation functions to tackle this problem.
HyperLogLog is an approximation algorithm for unique counting. It uses fixed number of bits to estimate the cardinality of given data set.
Pinot leverages HyperLogLog Class in library com.clearspring.analytics:stream:2.7.0
as the data structure to hold intermediate results.
Functions:
DistinctCountHLL(x)_ -> LONG_
For column type INT/LONG/FLOAT/DOUBLE/STRING , Pinot treats each value as an individual entry to add into HyperLogLog Object, then compute the approximation by calling method cardinality().
For column type BYTES, Pinot treats each value as a serialized HyperLogLog Object with pre-aggregated values inside. The bytes value is generated by org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog)
.
All deserialized HyperLogLog object will be merged into one then calling method **cardinality() **to get the approximated unique count.
The Theta Sketch framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the Sketch Class and its extensions from the library org.apache.datasketches:datasketches-java:1.2.0-incubating
to perform distinct counting as well as evaluating set operations.
Functions:
DistinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**) **-> LONG
thetaSketchColumn (required): Name of the column to aggregate on.
thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches. Currently, the only supported parameter is nominalEntries
.
predicates (optional)_: _ These are individual predicates of form lhs <op> rhs
which are applied on rows selected by the where
clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn
that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA
are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
postAggregationExpressionToEvaluate (required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT
, where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.
In the example query below, the where
clause is responsible for identifying the matching rows. Note, the where clause can be completely independent of the postAggregationExpression
. Once matching rows are identified, each server unionizes all the sketches that match the individual predicates, i.e. country='USA'
, device='mobile'
in this case. Once the broker receives the intermediate sketches for each of these individual predicates from all servers, it performs the final aggregation by evaluating the postAggregationExpression
and returns the final cardinality of the resulting sketch.
DistinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**)** -> HexEncoded Serialized Sketch Bytes
This is the same as the previous function, except it returns the byte serialized sketch instead of the cardinality sketch. Since Pinot returns responses as JSON strings, bytes are returned as hex encoded strings. The hex encoded string can be deserialized into sketch by using the library org.apache.commons.codec.binary
as Hex.decodeHex(stringValue.toCharArray())
.
Learn how to look up IDs in a list of values. Filtering with IdSet is only supported with the single-stage query engine (v1).
Filtering with IdSet is only supported with the single-stage query engine (v1).
A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but using IN doesn't perform well with large lists of IDs. For large lists of IDs, we recommend using an IdSet.
ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )
This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:
INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
Other types - Bloom Filter
The following parameters are used to configure the Bloom Filter:
expectedInsertions - Number of expected insertions for the BloomFilter, must be positive
fpp - Desired false positive probability for the BloomFilter, must be positive and < 1.0
Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.
IN_ID_SET(columnName, base64EncodedIdSet)
This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.
IN_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.
IN_PARTITIONED_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.
This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.
The query passed to IN_SUBQUERY
can be run on any table - they aren't restricted to the table used in the parent query.
The query passed to IN__PARTITIONED__SUBQUERY
must be run on the same table as the parent query.
You can create an IdSet of the values in the yearID column by running the following:
When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:
We can also configure the fpp parameter:
We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for yearIDs in the IdSet, run the following:
To return rows for yearIDs not in the IdSet, run the following:
To filter rows for yearIDs in the IdSet on a Pinot Broker, run the following query:
To filter rows for yearIDs not in the IdSet on a Pinot Broker, run the following query:
To filter rows for yearIDs in the IdSet on a Pinot Server, run the following query:
To filter rows for yearIDs not in the IdSet on a Pinot Server, run the following query:
To see how JSON data can be queried, assume that we have the following table:
We also assume that "jsoncolumn" has a Json Index on it. Note that the last two rows in the table have different structure than the rest of the rows. In keeping with JSON specification, a JSON column can contain any valid JSON data and doesn't need to adhere to a predefined schema. To pull out the entire JSON document for each row, we can run the query below:
id | jsoncolumn |
---|---|
To drill down and pull out specific keys within the JSON column, we simply append the JsonPath expression of those keys to the end of the column name.
Note that the third column (value) is null for rows with id 106 and 107. This is because these rows have JSON documents that don't have a key with JsonPath $.data[1]. We can filter out these rows.
Certain last names (duck and mouse for example) repeat in the data above. We can get a count of each last name by running a GROUP BY query on a JsonPath expression.
Also there is numerical information (jsconcolumn.$.id) embeded within the JSON document. We can extract those numerical values from JSON data into SQL and sum them up using the query below.
Note that the JSON_MATCH
function utilizes JsonIndex
and can only be used if a JsonIndex
is already present on the JSON column. As shown in the examples above, the second argument of JSON_MATCH
operator takes a predicate. This predicate is evaluated against the JsonIndex
and supports =
, !=
, IS NULL
, or IS NOT NULL
operators. Relational operators, such as >
, <
, >=
, and <=
are currently not supported. However, you can combine the use of JSON_MATCH
and JSON_EXTRACT_SCALAR
function (which supports >
, <
, >=
, and <=
operators) to get the necessary functinoality as shown below.
JSON_MATCH
function also provides the ability to use wildcard *
JsonPath expressions even though it doesn't support full JsonPath expressions.
While, JSON_MATCH supports IS NULL
and IS NOT NULL
operators, these operators should only be applied to leaf-level path elements, i.e the predicate JSON_MATCH(jsoncolumn, '"$.data[*]" IS NOT NULL')
is not valid since "$.data[*]"
does not address a "leaf" element of the path; however, "$.data[0]" IS NOT NULL')
is valid since "$.data[0]"
unambigously identifies a leaf element of the path.
JSON_EXTRACT_SCALAR
does not utilize JsonIndex and therefore performs slower than JSON_MATCH
which utilizes JsonIndex. However, JSON_EXTRACT_SCALAR
supports a wider range for of JsonPath expressions and operators. To make the best use of fast index access (JSON_MATCH
) along with JsonPath expressions (JSON_EXTRACT_SCALAR
) you can combine the use of these two functions in WHERE clause.
The second argument of the JSON_MATCH
function is a boolean expression in string form. This section shows how to correctly write the second argument of JSON_MATCH. Let's assume we want to search a JSON array array data
for values k
and j
. This can be done by the following predicate:
To convert this predicate into string form for use in JSON_MATCH, we first turn the left side of the predicate into an identifier by enclosing it in double quotes:
Next, the literals in the predicate also need to be enclosed by '. Any existing ' need to be escaped as well. This gives us:
Finally, we need to create a string out of the entire expression above by enclosing it in ':
Now we have the string representation of the original predicate and this can be used in JSON_MATCH function:
Many of the datasets are time series in nature, tracking state change of an entity over time. The granularity of recorded data points might be sparse or the events could be missing due to network and other device issues in the IOT environment. But analytics applications which are tracking the state change of these entities over time, might be querying for values at lower granularity than the metric interval.
Here is the sample data set tracking the status of parking lots in parking space.
lotId | event_time | is_occupied |
---|
We want to find out the total number of parking lots that are occupied over a period of time which would be a common use case for a company that manages parking spaces.
Let us take 30 minutes' time bucket as an example:
timeBucket/lotId | P1 | P2 | P3 |
---|
If you look at the above table, you will see a lot of missing data for parking lots inside the time buckets. In order to calculate the number of occupied park lots per time bucket, we need gap fill the missing data.
There are two ways of gap filling the data: FILL_PREVIOUS_VALUE and FILL_DEFAULT_VALUE.
FILL_PREVIOUS_VALUE means the missing data will be filled with the previous value for the specific entity, in this case, park lot, if the previous value exists. Otherwise, it will be filled with the default value.
FILL_DEFAULT_VALUE means that the missing data will be filled with the default value. For numeric column, the defaul value is 0. For Boolean column type, the default value is false. For TimeStamp, it is January 1, 1970, 00:00:00 GMT. For STRING, JSON and BYTES, it is empty String. For Array type of column, it is empty array.
We will leverage the following the query to calculate the total occupied parking lots per time bucket.
The most nested sql will convert the raw event table to the following table.
The second most nested sql will gap fill the returned data as following:
The outermost query will aggregate the gapfilled data as follows:
There is one assumption we made here that the raw data is sorted by the timestamp. The Gapfill and Post-Gapfill Aggregation will not sort the data.
The above example just shows the use case where the three steps happen:
The raw data will be aggregated;
The aggregated data will be gapfilled;
The gapfilled data will be aggregated.
There are three more scenarios we can support.
If we want to gapfill the missing data per half an hour time bucket, here is the query:
At first the raw data will be transformed as follows:
Then it will be gapfilled as follows:
The nested sql will convert the raw event table to the following table.
The outer sql will gap fill the returned data as following:
The raw data will be transformed as following at first:
The transformed data will be gap filled as follows:
The aggregation will generate the following table:
Query execution within Pinot is modeled as a sequence of operators that are executed in a pipelined manner to produce the final result. The output of the EXPLAIN PLAN statement can be used to see how queries are being run or to further optimize queries.
EXPLAN PLAN can be run in two modes: verbose and non-verbose (default) via the use of a query option. To enable verbose mode the query option explainPlanVerbose=true
must be passed.
In the non-verbose EXPLAIN PLAN output above, the Operator
column describes the operator that Pinot will run where as, the Operator_Id
and Parent_Id
columns show the parent-child relationship between operators.
This parent-child relationship shows the order in which operators execute. For example, FILTER_MATCH_ENTIRE_SEGMENT
will execute before and pass its output to PROJECT
. Similarly, PROJECT
will execute before and pass its output to TRANSFORM_PASSTHROUGH
operator and so on.
Although the EXPLAIN PLAN query produces tabular output, in this document, we show a tree representation of the EXPLAIN PLAN output so that parent-child relationship between operators are easy to see and user can visualize the bottom-up flow of data in the operator tree execution.
Note a special node with the Operator_Id
and Parent_Id
called PLAN_START(numSegmentsForThisPlan:1)
. This node indicates the number of segments which match a given plan. The EXPLAIN PLAN query can be run with the verbose mode enabled using the query option explainPlanVerbose=true
which will show the varying deduplicated query plans across all segments across all servers.
Reading the EXPLAIN PLAN output from bottom to top will show how data flows from a table to query results. In the example shown above, the FILTER_MATCH_ENTIRE_SEGMENT
operator shows that all 977889 records of the segment matched the query. The DOC_ID_SET
over the filter operator gets the set of document IDs matching the filter operator. The PROJECT
operator over the DOC_ID_SET
operator pulls only those columns that were referenced in the query. The TRANSFORM_PASSTHROUGH
operator just passes the column data from PROJECT
operator to the SELECT
operator. At SELECT
, the query has been successfully evaluated against one segment. Results from different data segments are then combined (COMBINE_SELECT
) and sent to the Broker. The Broker combines and reduces the results from different servers (BROKER_REDUCE
) into a final result that is sent to the user. The PLAN_START(numSegmentsForThisPlan:1)
indicates that a single segment matched this query plan. If verbose mode is enabled many plans can be returned and each will contain a node indicating the number of matched segments.
The rest of this document illustrates the EXPLAIN PLAN output with examples and describe the operators that show up in the output of the EXPLAIN PLAN.
Since verbose mode is enabled, the EXPLAIN PLAN output returns two plans matching one segment each (assuming 2 segments for this table). The first EXPLAIN PLAN output above shows that Pinot used an inverted index to evaluate the predicate "playerID = 'aardsda01'" (FILTER_INVERTED_INDEX
). The result was then fully scanned (FILTER_FULL_SCAN
) to evaluate the second predicate "playerName = 'David Allan'". Note that the two predicates are being combined using AND
in the query; hence, only the data that satsified the first predicate needs to be scanned for evaluating the second predicate. However, if the predicates were being combined using OR
, the query would run very slowly because the entire "playerName" column would need to be scanned from top to bottom to look for values satisfying the second predicate. To improve query efficiency in such cases, one should consider indexing the "playerName" column as well. The second plan output shows a FILTER_EMPTY
indicating that no matching documents were found for one segment.
The EXPLAIN PLAN output above shows how GROUP BY queries are evaluated in Pinot. GROUP BY results are created on the server (AGGREGATE_GROUPBY_ORDERBY
) for each segment on the server. The server then combines segment-level GROUP BY results (COMBINE_GROUPBY_ORDERBY
) and sends the combined result to the Broker. The Broker combines GROUP BY result from all the servers to produce the final result which is send to the user. Note that the COMBINE_SELECT
operator from the previous query was not used here, instead a different COMBINE_GROUPBY_ORDERBY
operator was used. Depending upon the type of query different combine operators such as COMBINE_DISTINCT
and COMBINE_ORDERBY
etc may be seen.
The root operator of the EXPLAIN PLAN output is BROKER_REDUCE
. BROKER_REDUCE
indicates that Broker is processing and combining server results into final result that is sent back to the user. BROKER_REDUCE
has a COMBINE operator as its child. Combine operator combines the results of query evaluation from each segment on the server and sends the combined result to the Broker. There are several combine operators (COMBINE_GROUPBY_ORDERBY
, COMBINE_DISTINCT
, COMBINE_AGGREGATE
, etc.) that run depending upon the operations being performed by the query. Under the Combine operator, either a Select (SELECT
, SELECT_ORDERBY
, etc.) or an Aggregate (AGGREGATE
, AGGREGATE_GROUPBY_ORDERBY
, etc.) can appear. Aggreate operator is present when query performs aggregation (count(*)
, min
, max
, etc.); otherwise, a Select operator is present. If the query performs scalar transformations (Addition, Multiplication, Concat, etc.), then one would see TRANSFORM operator appear under the SELECT operator. Often a TRANSFORM_PASSTHROUGH
operator is present instead of the TRANSFORM operator. TRANSFORM_PASSTHROUGH
just passes results from operators that appear lower in the operator execution heirarchy to the SELECT operator. DOC_ID_SET
operator usually appear above FILTER operators and indicate that a list of matching document IDs are assessed. FILTER operators usually appear at the bottom of the operator heirarchy and show index use. For example, the presence of FILTER_FULL_SCAN indicates that index was not used (and hence the query is likely to run relatively slow). However, if the query used an index one of the indexed filter operators (FILTER_SORTED_INDEX
, FILTER_RANGE_INDEX
, FILTER_INVERTED_INDEX
, FILTER_JSON_INDEX
, etc.) will show up.
This document contains all the available query options
Aggregate functions return a single result for a group of rows.
Aggregate functions return a single result for a group of rows. The following table shows supported aggregate functions in Pinot.
Function | Description | Example | Default Value When No Record Selected |
---|
Deprecated functions:
The following aggregation functions can be used for multi-value columns
Pinot supports FILTER clause in aggregation queries as follows:
In the query above, COL1
is aggregated only for rows where COL2 > 300 and COL3 > 50
. Similarly, COL2
is aggregated where COL2 < 50 and COL3 > 50
.
In the above query, COL1
is aggregated only for the non-null values. Without NULL value support, we would have to filter using the default null value.
NOTE: TheFILTER
clause is currently supported for aggregation-only queries, i.e., GROUP BY
is not supported.
Deprecated functions:
In this guide we will learn about the heuristics used for trimming results in Pinot's grouping algorithm (used when processing GROUP BY
queries) to make sure that the server doesn't run out of memory.
When grouping rows within a segment, Pinot keeps a maximum of <numGroupsLimit>
groups per segment. This value is set to 100,000 by default and can be configured by the pinot.server.query.executor.num.groups.limit
property.
If the number of groups of a segment reaches this value, the extra groups will be ignored and the results returned may not be completely accurate. The numGroupsLimitReached
property will be set to true
in the query response if the value is reached.
After the inner segment groups have been computed, the Pinot query engine optionally trims tail groups. Tail groups are ones that have a lower rank based on the ORDER BY
clause used in the query.
This configuration is disabled by default, but can be enabled by configuring the pinot.server.query.executor.min.segment.group.trim.size
property.
When segment group trim is enabled, the query engine will trim the tail groups and keep max(<minSegmentGroupTrimSize>, 5 * LIMIT)
groups if it gets more groups. Pinot keeps at least 5 * LIMIT
groups when trimming tail groups to ensure the accuracy of results.
This value can be overridden on a query by query basis by passing the following option:
Once grouping has been done within a segment, Pinot will merge segment results and trim tail groups and keep max(<minServerGroupTrimSize>, 5 * LIMIT)
groups if it gets more groups.
<minServerGroupTrimSize>
is set to 5,000 by default and can be adjusted by configuring the pinot.server.query.executor.min.server.group.trim.size
property. When setting the configuration to -1
, the cross segments trim can be disabled.
This value can be overridden on a query by query basis by passing the following option:
When cross segments trim is enabled, the server will trim the tail groups before sending the results back to the broker. It will also trim the tail groups when the number of groups reaches the <trimThreshold>
.
This configuration is set to 1,000,000 by default and can be adjusted by configuring the pinot.server.query.executor.groupby.trim.threshold
property.
A higher threshold reduces the amount of trimming done, but consumes more heap memory. If the threshold is set to more than 1,000,000,000, the server will only trim the groups once before returning the results to the broker.
Pinot sets a default LIMIT
of 10 if one isn't defined and this applies to GROUP BY
queries as well. Therefore, if no limit is specified, Pinot will return 10 groups.
Pinot will trim tail groups based on the ORDER BY
clause to reduce the memory footprint and improve the query performance. It keeps at least 5 * LIMIT
groups so that the results give good enough approximation in most cases. The configurable min trim size can be used to increase the groups kept to improve the accuracy but has a larger extra memory footprint.
If the query has a HAVING
clause, it is applied on the merged GROUP BY
results that already have the tail groups trimmed. If the HAVING
clause is the opposite of the ORDER BY
order, groups matching the condition might already be trimmed and not returned. e.g.
Increase min trim size to keep more groups in these cases.
idset(yearID) |
---|
idset(playerName) |
---|
idset(playerName) |
---|
idset(playerName) |
---|
id | last_name | first_name | value |
---|---|---|---|
id | last_name | first_name | value |
---|---|---|---|
jsoncolumn.name.last | count(*) |
---|---|
jsoncolumn.name.last | sum(jsoncolumn.score) |
---|---|
jsoncolumn.name.last | sum(jsoncolumn.score) |
---|---|
last_name | total |
---|---|
lotId | event_time | is_occupied |
---|
timeBucket/lotId | P1 | P2 | P3 |
---|
timeBucket | totalNumOfOccuppiedSlots |
---|
lotId | event_time | is_occupied |
---|
lotId | event_time | is_occupied |
---|
lotId | event_time | is_occupied |
---|
timeBucket/lotId | P1 | P2 | P3 |
---|
lotId | event_time | is_occupied |
---|
lotId | event_time | is_occupied |
---|
timeBucket | totalNumOfOccuppiedSlots |
---|
EXPLAIN PLAN output should only be used for informational purposes because it is likely to change from version to version as Pinot is further developed and enhanced. Pinot uses a "Scatter Gather" approach to query evaluation (see for more details). At the Broker, an incoming query is split into several server-level queries for each backend server to evaluate. At each Server, the query is further split into segment-level queries that are evaluated against each segment on the server. The results of segment queries are combined and sent to the Broker. The Broker in turn combines the results from all the Servers and sends the final results back to the user. Note that if the EXPLAIN PLAN query runs without the verbose mode enabled, a single plan will be returned (the heuristic used is to return the deepest plan tree) and this may not be an accurate representation of all plans across all segments. Different segments may execute the plan in a slightly different way.
Function | Description | Example |
---|
Function |
---|
With enabled, this allows to filter out the null values while performing aggregation as follows:
Function | Description | Example |
---|
Parameter | Default | Query Override | Description |
---|
"101"
"{"name":{"first":"daffy","last":"duck"},"score":101,"data":["a","b","c","d"]}"
102"
"{"name":{"first":"donald","last":"duck"},"score":102,"data":["a","b","e","f"]}
"103"
"{"name":{"first":"mickey","last":"mouse"},"score":103,"data":["a","b","g","h"]}
"104"
"{"name":{"first":"minnie","last":"mouse"},"score":104,"data":["a","b","i","j"]}"
"105"
"{"name":{"first":"goofy","last":"dwag"},"score":104,"data":["a","b","i","j"]}"
"106"
"{"person":{"name":"daffy duck","companies":[{"name":"n1","title":"t1"},{"name":"n2","title":"t2"}]}}"
"107"
"{"person":{"name":"scrooge mcduck","companies":[{"name":"n1","title":"t1"},{"name":"n2","title":"t2"}]}}"
101
duck
daffy
b
102
duck
donald
b
103
mouse
mickey
b
104
mouse
minnie
b
105
dwag
goofy
b
106
null
null
null
107
null
null
null
101
duck
daffy
b
102
duck
donald
b
103
mouse
mickey
b
104
mouse
minnie
b
105
dwag
goofy
b
"mouse"
"2"
"duck"
"2"
"dwag"
"1"
"mouse"
"207"
"dwag"
"104"
"duck"
"203"
"mouse"
"207"
"dwag"
"104"
"duck"
"102"
P1 | 2021-10-01 09:01:00.000 | 1 |
P2 | 2021-10-01 09:17:00.000 | 1 |
P1 | 2021-10-01 09:33:00.000 | 0 |
P1 | 2021-10-01 09:47:00.000 | 1 |
P3 | 2021-10-01 10:05:00.000 | 1 |
P2 | 2021-10-01 10:06:00.000 | 0 |
P2 | 2021-10-01 10:16:00.000 | 1 |
P2 | 2021-10-01 10:31:00.000 | 0 |
P3 | 2021-10-01 11:17:00.000 | 0 |
P1 | 2021-10-01 11:54:00.000 | 0 |
2021-10-01 09:00:00.000 | 1 | 1 |
2021-10-01 09:30:00.000 | 0,1 |
2021-10-01 10:00:00.000 | 0,1 | 1 |
2021-10-01 10:30:00.000 | 0 |
2021-10-01 11:00:00.000 | 0 |
2021-10-01 11:30:00.000 | 0 |
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
2021-10-01 09:00:00.000 | 1 | 1 | 0 |
2021-10-01 09:30:00.000 | 1 | 1 | 0 |
2021-10-01 10:00:00.000 | 1 | 1 | 1 |
2021-10-01 10:30:00.000 | 1 | 0 | 1 |
2021-10-01 11:00:00.000 | 1 | 0 | 0 |
2021-10-01 11:30:00.000 | 0 | 0 | 0 |
2021-10-01 09:00:00.000 | 2 |
2021-10-01 09:30:00.000 | 2 |
2021-10-01 10:00:00.000 | 3 |
2021-10-01 10:30:00.000 | 2 |
2021-10-01 11:00:00.000 | 1 |
2021-10-01 11:30:00.000 | 0 |
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P1 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 0 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P3 | 2021-10-01 09:00:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P2 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 10:00:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 0 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P1 | 2021-10-01 10:30:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 10:30:00.000 | 1 |
P1 | 2021-10-01 11:00:00.000 | 1 |
P2 | 2021-10-01 11:00:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
P2 | 2021-10-01 11:30:00.000 | 0 |
P3 | 2021-10-01 11:30:00.000 | 0 |
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
2021-10-01 09:00:00.000 | 1 | 1 | 0 |
2021-10-01 09:30:00.000 | 1 | 1 | 0 |
2021-10-01 10:00:00.000 | 1 | 1 | 1 |
2021-10-01 10:30:00.000 | 1 | 0 | 1 |
2021-10-01 11:00:00.000 | 1 | 0 | 0 |
2021-10-01 11:30:00.000 | 0 | 0 | 0 |
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P1 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 0 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P3 | 2021-10-01 09:00:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P2 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 10:00:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 0 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P1 | 2021-10-01 10:30:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 10:30:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P1 | 2021-10-01 11:00:00.000 | 1 |
P2 | 2021-10-01 11:00:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
P2 | 2021-10-01 11:30:00.000 | 0 |
P3 | 2021-10-01 11:30:00.000 | 0 |
2021-10-01 09:00:00.000 | 2 |
2021-10-01 09:30:00.000 | 2 |
2021-10-01 10:00:00.000 | 3 |
2021-10-01 10:30:00.000 | 2 |
2021-10-01 11:00:00.000 | 1 |
2021-10-01 11:30:00.000 | 0 |
ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc= |
AwIBBQAAAAL///////////////////// |
AwIBBQAAAAz///////////////////////////////////////////////9///////f///9/////7///////////////+/////////////////////////////////////////////8= |
AwIBBwAAAA/////////////////////////////////////////////////////////////////////////////////////////////////////////9///////////////////////////////////////////////7//////8= |
FASTHLL | FASTHLL stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format |
|
FASTHLLMV (Deprecated) | stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format |
|
| 100,000 | N/A |
| -1 (trim disabled) |
|
| 5,000 |
|
| 1,000,000 | N/A |
| -1 (use all execution threads) |
|
Pinot supports JOINs, including left, right, full, semi, anti, lateral, and equi JOINs. Use JOINs to connect two table to generate a unified view, based on a related column between the tables.
Important: To query using JOINs, you must use Pinot's multi-stage query engine (v2).
Pinot 1.0 introduces support for all JOIN types. JOINs in Pinot significantly reduce query latency and simplify architecture, achieving the best performance currently available for an OLAP database.
Use JOINs to combine two tables (a left and right table) together, based on a related column between the tables, and other join filters. JOINs let you gain more insights from your data.
The inner join selects rows that have matching values in both tables.
Syntax:
Joins a table containing user transactions with a table containing promotions shown to the users, to show the spending for every userID.
A left join returns all values from the left relation and the matched values from the right table, or appends NULL if there is no match. Also referred to as a left outer join.
Syntax:
A right join returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also referred to as a right outer join.
Syntax:
A full join returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join.
Syntax:
A cross join returns the Cartesian product of two relations. If no WHERE clause is used along with CROSS JOIN, this produces a result set that is the number of rows in the first table multiplied by the number of rows in the second table. If a WHERE clause is included with CROSS JOIN, it functions like an INNER JOIN.
Syntax:
Semi/anti-join returns rows from the first table where no matches are found in the second table. Returns one copy of each row in the first table for which no match is found.
Syntax:
An equi join uses an equality operator to match a single or multiple column values of the relative tables.
Syntax:
Pinot JOINs include the following optimizations:
Predicate push-down to individual tables
Indexing and pruning to reduce scanning and speeds up query processing
Smart data layout considerations to minimize data shuffling
Query hints for fine-tuning JOIN operations.
timeoutMs | Timeout of the query in milliseconds | Use table/broker level timeout |
enableNullHandling | Enable the null handling of the query (introduced in 0.11.0) |
|
explainPlanVerbose | Return verbose result for |
|
useMultistageEngine | Use multi-stage engine to execute the query (introduced in 0.11.0) |
|
maxExecutionThreads | Maximum threads to use to execute the query. Useful to limit the resource usage for expensive queries | Half of the CPU cores for non-group-by queries; all CPU cores for group-by queries |
numReplicaGroupsToQuery | When replica-group based routing is enabled, use it to query multiple replica-groups (introduced in 0.11.0) |
|
minSegmentGroupTrimSize | Server level config |
minServerGroupTrimSize | Server level config |
skipUpsert |
|
useStarTree | Useful to debug the star-tree index (introduced in 0.11.0) |
|
AndScanReordering | disabled |
Project a column where the maxima appears in a series of measuring columns. | ARG_MAX(measuring1, measuring2, measuring3, projection) | Will return no result |
Returns the count of the records as |
|
|
Returns the population covariance between of 2 numerical columns as |
|
|
Returns the sample covariance between of 2 numerical columns as |
|
|
Calculate the histogram of a numeric column as |
|
|
Returns the minimum value of a numeric column as |
|
|
Returns the maximum value of a numeric column as |
|
|
Returns the sum of the values for a numeric column as |
|
|
Returns the sum of the values for a numeric column with optional precision and scale as |
|
|
Returns the average of the values for a numeric column as |
|
|
Returns the most frequent value of a numeric column as |
|
|
Returns the |
|
|
Returns the Nth percentile of the values for a numeric column as |
|
|
|
|
|
|
|
|
PERCENTILESMARTTDIGEST | Returns the Nth percentile of the values for a numeric column as |
|
|
Returns the count of distinct values of a column as |
|
|
Returns the count of distinct values of a column as |
|
|
Returns an approximate distinct count using HyperLogLog as |
|
|
Returns HyperLogLog response serialized as |
|
|
DISTINCTCOUNTSMARTHLL | Returns the count of distinct values of a column as |
|
|
|
|
Returns the count of distinct values of a column as |
|
|
LASTWITHTIME(dataColumn, timeColumn, 'dataType') | Get the last value of dataColumn where the timeColumn is used to define the time of dataColumn and the dataType specifies the type of dataColumn, which can be |
|
|
FIRSTWITHTIME(dataColumn, timeColumn, 'dataType') | Get the first value of dataColumn where the timeColumn is used to define the time of dataColumn and the dataType specifies the type of dataColumn, which can be |
|
|
Use window functions to compute averages, sort, rank, or count items, calculate sums, and find minimum or maximum values.
Important: To query using Windows functions, you must use Pinot's multi-stage query engine (v2).
Use window functions to do the following:
Window functions consist of a supported function and an OVER clause.
To learn more about syntax, see Windows function query syntax, and read about the OVER clause.
To see Windows functions you can use in Pinot, see Supported Pinot window functions.
To query with Windows functions in Pinot, see examples for supported Windows functions.
For information about the Window function query syntax, see Calcite documentation.
The following query shows the complete components of the window function. Note, PARTITION BY
and ORDER BY
are optional.
If a PARTITION BY clause is specified, the intermediate results will be grouped into different partitions based on the values of the columns appearing in the PARTITION BY clause.
If the PARTITION BY clause isn’t specified, the whole result will be regarded as one big partition, i.e. there is only one partition in the result set.
If an ORDER BY clause is specified, all the rows within the same partition will be sorted based on the values of the columns appearing in the window ORDER BY clause. The ORDER BY clause decides the order in which the rows within a partition are to be processed.
If no ORDER BY clause is specified while a PARTITION BY clause is specified, the order of the rows is undefined. If output ordering is desired a global ORDER BY clause should be used in the query.
{RANGE|ROWS} frame_start OR
{RANGE|ROWS} BETWEEN frame_start AND frame_end; frame_start and frame_end can be any of:
UNBOUNDED PRECEDING: expression PRECEDING. May only be allowed in ROWS mode [depends on DB, some support some don’t]
CURRENT ROW expression FOLLOWING. May only be allowed in ROWS mode [depends on DB, some support some don’t]
UNBOUNDED FOLLOWING:
If no FRAME clause is specified, then the default frame behavior depends on whether ORDER BY is present or not.
If an ORDER BY clause is specified, the default behavior is to calculate the aggregation from the beginning of the partition to the current row or UNBOUNDED PRECEDING to CURRENT ROW.
If only a PARTITION BY clause is present, the default frame behavior is to calculate the aggregation from UNBOUNDED PRECEDING to CURRENT ROW.
If there is no FRAME, no PARTITION BY, and no ORDER BY clause specified in the OVER clause (empty OVER), the whole result set is regarded as one partition, and there's one frame in the window.
The OVER clause applies a specified supported Windows function to compute values over a group of rows, and return a single result for each row. The OVER clause specifies how the rows are arranged and how the aggregation is done on those rows.
Inside the over clause, there are three optional components, i.e. PARTITION BY clause, ORDER BY clause, and FRAME clause.
Calculate the rolling sum transaction amount ordered by the payment date for each customer ID (note, the default frame here is UNBOUNDED PRECEDING and CURRENT ROW).
Calculate the least (use MIN()
) or most expensive (use MAX()
) transaction made by each customer comparing all transactions made by the customer (default frame here is UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING). The following query shows how to find the least expensive transaction.
Calculate a customer’s average transaction amount for all transactions they’ve made (default frame here is UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING).
Use ROW_NUMBER()
to rank team members by their year-to-date sales (default frame here is UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING).
Count the number of transactions made by each customer (default frame here is UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING).
Lookup UDF Join is only supported with the single-stage query engine (v1). For more information about using JOINs with the multi-stage query engine, see JOINs.
Lookup UDF Join is only supported with the single-stage query engine (v1). For more information about using JOINs with the multi-stage query engine, see JOINs.
Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with a dimension table in Pinot.
The UDF function syntax is listed as below:
dimTable
Name of the dim table to perform the lookup on.
dimColToLookUp
The column name of the dim table to be retrieved to decorate our result.
dimJoinKey
The column name on which we want to perform the lookup i.e. the join column name for dim table.
factJoinKey
The column name on which we want to perform the lookup against e.g. the join column name for fact table
Noted that:
all the dim-table-related expressions are expressed as literal strings, this is the LOOKUP UDF syntax limitation: we cannot express column identifier which doesn't exist in the query's main table, which is the factTable
table.
the syntax definition of [ '''dimJoinKey''', factJoinKey ]*
indicates that if there are multiple dim partition columns, there should be multiple join key pair expressed.
Here are some of the examples
Consider the table baseballStats
and dim table dimBaseballTeams
several acceptable queries are:
Consider a single dimension table with schema:
BILLING SCHEMA
The data return type of the UDF will be that of the dimColToLookUp
column type.
when multiple primary key columns are used for the dimension table (e.g. composite primary key), ensure that the order of keys appearing in the lookup() UDF is the same as the order defined in the primaryKeyColumns
from the dimension table schema.
Pinot currently supports two ways for you to implement your own functions:
Groovy Scripts
Scalar Functions
Pinot allows you to run any function using Apache Groovy scripts. The syntax for executing Groovy script within the query is as follows:
GROOVY('result value metadata json', ''groovy script', arg0, arg1, arg2...)
This function will execute the groovy script using the arguments provided and return the result that matches the provided result value metadata. **** The function requires the following arguments:
Result value metadata json
- json string representing result value metadata. Must contain non-null keys resultType
and isSingleValue
.
Groovy script to execute
- groovy script string, which uses arg0
, arg1
, arg2
etc to refer to the arguments provided within the script
arguments
- pinot columns/other transform functions that are arguments to the groovy script
Examples
Add colA and colB and return a single-value INT
groovy( '{"returnType":"INT","isSingleValue":true}', 'arg0 + arg1', colA, colB)
\
Find the max element in mvColumn array and return a single-value INT
groovy('{"returnType":"INT","isSingleValue":true}', 'arg0.toList().max()', mvColumn)
\
Find all elements of the array mvColumn and return as a multi-value LONG column
groovy('{"returnType":"LONG","isSingleValue":false}', 'arg0.findIndexValues{ it > 5 }', mvColumn)
\
Multiply length of array mvColumn with colB and return a single-value DOUBLE
groovy('{"returnType":"DOUBLE","isSingleValue":true}', 'arg0 * arg1', arraylength(mvColumn), colB)
\
Find all indexes in mvColumnA which have value foo
, add values at those indexes in mvColumnB
groovy( '{"returnType":"DOUBLE","isSingleValue":true}', 'def x = 0; arg0.eachWithIndex{item, idx-> if (item == "foo") {x = x + arg1[idx] }}; return x' , mvColumnA, mvColumnB)
\
Switch case which returns a FLOAT value depending on length of mvCol array
groovy('{\"returnType\":\"FLOAT\", \"isSingleValue\":true}', 'def result; switch(arg0.length()) { case 10: result = 1.1; break; case 20: result = 1.2; break; default: result = 1.3;}; return result.floatValue()', mvCol)
\
Any Groovy script which takes no arguments
groovy('new Date().format( "yyyyMMdd" )', '{"returnType":"STRING","isSingleValue":true}')
Allowing execuatable Groovy in queries can be a security vulnerability. Use caution and be aware of the security risks if you decide to allow groovy. If you would like to enable Groovy in Pinot queries, you can set the following broker config.
pinot.broker.disable.query.groovy=false
If not set, Groovy in queries is disabled by default.
The above configuration applies across the entire Pinot cluster. If you want a table level override to enable/disable Groovy queries, the following property can be set in the query table config.
Since the 0.5.0 release, Pinot supports custom functions that return a single output for multiple inputs. Examples of scalar functions can be found in StringFunctions and DateTimeFunctions
Pinot automatically identifies and registers all the functions that have the @ScalarFunction
annotation.
Only Java methods are supported.
You can add new scalar functions as follows:
Create a new java project. Make sure you keep the package name as org.apache.pinot.scalar.XXXX
In your java project include the dependency
Annotate your methods with @ScalarFunction
annotation. Make sure the method is static
and returns only a single value output. The input and output can have one of the following types -
Integer
Long
Double
String
Place the compiled JAR in the /plugins
directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the function in a query as follows:
This document contains the list of all the transformation functions supported by Pinot SQL.
Function |
---|
Multiple string functions are supported out of the box from release-0.5.0 .
Function |
---|
Date time functions allow you to perform transformations on columns that contain timestamps or dates.
These functions can only be used in Pinot SQL queries.
These functions can be used for column transformation in table ingestion configs.
All of the functions mentioned till now only support single value columns. You can use the following functions to do operations on multi-value columns.
Pinot supports Geospatial queries on columns containing text-based geographies. For more details on the queries and how to enable them, see Geospatial.
Pinot supports pattern matching on text-based columns. Only the columns mentioned as text columns in table config can be queried using this method. For more details on how to enable pattern matching, see Text search support.
Minimum groups to keep when trimming groups at the segment level for group-by queries. See
Minimum groups to keep when trimming groups at the server level for group-by queries. See
For upsert-enabled table, skip the effect of upsert and query all the records. See
Returns the Nth percentile of the values for a numeric column using as Long
Returns the Nth percentile of the values for a numeric column using as Double
Returns the Nth percentile (using compression factor of CF) of the values for a numeric column using as Double
See
See
Returns the count of a multi-value column as Long
Returns the minimum value of a numeric multi-value column as Double
Returns the maximum value of a numeric multi-value column as Double
Returns the sum of the values for a numeric multi-value column as Double
Returns the average of the values for a numeric multi-value column as Double
Returns the max - min
value for a numeric multi-value column as Double
Returns the Nth percentile of the values for a numeric multi-value column as Double
Returns the Nth percentile using as Long
Returns the Nth percentile using as Double
Returns the Nth percentile (using compression factor CF) using as Double
Returns the count of distinct values for a multi-value column as Integer
Returns the count of distinct values for a multi-value column as Integer
. This function is accurate for INT or dictionary encoded column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collision.
Returns an approximate distinct count using HyperLogLog as Long
Returns HyperLogLog response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.
Function | Description | Example | Default Value When No Record Selected |
---|---|---|---|
customer_id | payment_date | amount | sum |
---|---|---|---|
customer_id | payment_date | amount | min |
---|---|---|---|
customer_id | payment_date | amount | avg |
---|---|---|---|
Row | FirstName | LastName | Total sales YTD |
---|---|---|---|
customer_id | payment_date | amount | count |
---|---|---|---|
Column | Type |
---|---|
Column | Type |
---|---|
playerName | teamID | teamName | teamAddress |
---|---|---|---|
teamID | nameFromLocal | nameFromLookup |
---|---|---|
Column | Type |
---|---|
customerId | missedPayment | lookedupCity |
---|---|---|
Note that Groovy script doesn't accept Built-In ScalarFunction that's specific to Pinot queries. See the section below for more information.
Enabling Groovy
Note that the function name in SQL is the same as the function name in Java. The SQL function name is case-insensitive as well.
Function |
---|
Function |
---|
Function |
---|
Function |
---|
Function |
---|
ADD(col1, col2, col3...) Sum of at least two values
SUB(col1, col2) Difference between two values
MULT(col1, col2, col3...) Product of at least two values
DIV(col1, col2) Quotient of two values
MOD(col1, col2) Modulo of two values
ABS(col1) Absolute of a value
CEIL(col1) Rounded up to the nearest integer.
FLOOR(col1) Rounded down to the nearest integer.
EXP(col1) Euler’s number(e) raised to the power of col.
LN(col1) Natural log of value i.e. ln(col1)
SQRT(col1) Square root of a value
UPPER(col) convert string to upper case
LOWER(col) convert string to lower case
REVERSE(col) reverse the string
SUBSTR(col, startIndex, endIndex) Gets substring of the input string from start to endIndex. Index begins at 0. Set endIndex to -1 to calculate till end of the string
CONCAT(col1, col2, seperator) Concatenate two input strings using the seperator
TRIM(col) trim spaces from both side of the string
LTRIM(col) trim spaces from left side of the string
RTRIM(col) trim spaces from right side of the string
LENGTH(col) calculate length of the string
STRPOS(col, find, N)
Find Nth instance of find
string in input. Returns 0 if input string is empty. Returns -1 if the Nth instance is not found or input string is null.
STARTSWITH(col, prefix)
returns true
if columns starts with prefix string.
REPLACE(col, find, substitute)
replace all instances of find
with replace
in input
RPAD(col, size, pad)
string padded from the right side with pad
to reach final size
LPAD(col, size, pad)
string padded from the left side with pad
to reach final size
CODEPOINT(col) the Unicode codepoint of the first character of the string
CHR(codepoint) the character corresponding to the Unicode codepoint
regexpExtract(value, regexp) Extracts values that match the provided regular expression
regexpReplace(input, matchRegexp, replaceRegexp, matchStartPos, occurrence, flag) Find and replace a string or regexp pattern with a target string or regexp pattern
remove(input, search) removes all instances of search from string
urlEncoding(string) url-encode a string with UTF-8 format
urlDecoding(string) decode a url to plaintext string
fromBase64(string) decode a Base64-encoded string to bytes represented as a hex string
toUtf8(string) decode a UTF8-encoded string to bytes represented as a hex string
isSubnetOf(ipPrefix, ipAddress) checks if ipAddress is in the subnet of the ipPrefix
TIMECONVERT(col, fromUnit, toUnit) Converts the value into another time unit. the column should be an epoch timestamp.
DATETIMECONVERT(columnName, inputFormat, outputFormat, outputGranularity) Converts the value into another date time format, and buckets time based on the given time granularity.
DATETRUNC Converts the value into a specified output granularity seconds since UTC epoch that is bucketed on a unit in a specified timezone.
ToEpoch<TIME_UNIT>(timeInMillis) Convert epoch milliseconds to epoch <Time Unit>.
ToEpoch<TIME_UNIT>Rounded(timeInMillis, bucketSize) Convert epoch milliseconds to epoch <Time Unit>, round to nearest rounding bucket(Bucket size is defined in <Time Unit>).
ToEpoch<TIME_UNIT>Bucket(timeInMillis, bucketSize) Convert epoch milliseconds to epoch <Time Unit>, and divided by bucket size(Bucket size is defined in <Time Unit>).
FromEpoch<TIME_UNIT> Convert epoch <Time Unit> to epoch milliseconds.(timeIn<Time_UNIT>)
FromEpoch<TIME_UNIT>Bucket(timeIn<Time_UNIT>, bucketSizeIn<Time_UNIT>) Convert epoch <Bucket Size><Time Unit> to epoch milliseconds.
ToDateTime(timeInMillis, pattern[, timezoneId]) Convert epoch millis value to DateTime string represented by pattern.
FromDateTime(dateTimeString, pattern) Convert DateTime string represented by pattern to epoch millis.
round(timeValue, bucketSize) Round the given time value to nearest bucket start value.
now() Return current time as epoch millis
ago() Return time as epoch millis before the given period (in ISO-8601 duration format)
timezoneHour(timeZoneId) Returns the hour of the time zone offset.
timezoneMinute(timeZoneId) Returns the minute of the time zone offset.
year(tsInMillis) Returns the year from the given epoch millis in UTC timezone.
year(tsInMillis, timeZoneId) Returns the year from the given epoch millis and timezone id.
yearOfWeek(tsInMillis)
Returns the year of the ISO week from the given epoch millis in UTC timezone. Alias yow
is also supported.
yearOfWeek(tsInMillis, timeZoneId)
Returns the year of the ISO week from the given epoch millis and timezone id. Alias yow
is also supported.
quarter(tsInMillis) Returns the quarter of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 4.
quarter(tsInMillis, timeZoneId) Returns the quarter of the year from the given epoch millis and timezone id. The value ranges from 1 to 4.
month(tsInMillis) Returns the month of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 12.
month(tsInMillis, timeZoneId) Returns the month of the year from the given epoch millis and timezone id. The value ranges from 1 to 12.
week(tsInMillis)
Returns the ISO week of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 53. Alias weekOfYear
is also supported.
week(tsInMillis, timeZoneId)
Returns the ISO week of the year from the given epoch millis and timezone id. The value ranges from 1 to 53. Alias weekOfYear
is also supported.
dayOfYear(tsInMillis)
Returns the day of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 366. Alias doy
is also supported.
dayOfYear(tsInMillis, timeZoneId)
Returns the day of the year from the given epoch millis and timezone id. The value ranges from 1 to 366. Alias doy
is also supported.
day(tsInMillis)
Returns the day of the month from the given epoch millis in UTC timezone. The value ranges from 1 to 31. Alias dayOfMonth
is also supported.
day(tsInMillis, timeZoneId)
Returns the day of the month from the given epoch millis and timezone id. The value ranges from 1 to 31. Alias dayOfMonth
is also supported.
dayOfWeek(tsInMillis)
Returns the day of the week from the given epoch millis in UTC timezone. The value ranges from 1(Monday) to 7(Sunday). Alias dow
is also supported.
dayOfWeek(tsInMillis, timeZoneId)
Returns the day of the week from the given epoch millis and timezone id. The value ranges from 1(Monday) to 7(Sunday). Alias dow
is also supported.
hour(tsInMillis) Returns the hour of the day from the given epoch millis in UTC timezone. The value ranges from 0 to 23.
hour(tsInMillis, timeZoneId) Returns the hour of the day from the given epoch millis and timezone id. The value ranges from 0 to 23.
minute(tsInMillis) Returns the minute of the hour from the given epoch millis in UTC timezone. The value ranges from 0 to 59.
minute(tsInMillis, timeZoneId) Returns the minute of the hour from the given epoch millis and timezone id. The value ranges from 0 to 59.
second(tsInMillis) Returns the second of the minute from the given epoch millis in UTC timezone. The value ranges from 0 to 59.
second(tsInMillis, timeZoneId) Returns the second of the minute from the given epoch millis and timezone id. The value ranges from 0 to 59.
millisecond(tsInMillis) Returns the millisecond of the second from the given epoch millis in UTC timezone. The value ranges from 0 to 999.
millisecond(tsInMillis, timeZoneId) Returns the millisecond of the second from the given epoch millis and timezone id. The value ranges from 0 to 999.
JSONEXTRACTSCALAR(jsonField, 'jsonPath', 'resultsType', [defaultValue])
Evaluates the 'jsonPath'
on jsonField
, returns the result as the type 'resultsType'
, use optional defaultValue
for null or parsing error.
JSONEXTRACTKEY(jsonField, 'jsonPath')
Extracts all matched JSON field keys based on 'jsonPath'
into a STRING_ARRAY.
EXTRACT(dateTimeField FROM dateTimeExpression)
Extracts the field from the DATETIME expression of the format 'YYYY-MM-DD HH:MM:SS'
. Currently, this transformation function supports YEAR
, MONTH
, DAY
, HOUR
, MINUTE
, and SECOND
fields.
JSONFORMAT(object) Convert object to JSON String
JSONPATH(jsonField, 'jsonPath')
Extracts the object value from jsonField
based on 'jsonPath'
, the result type is inferred based on JSON value. Cannot be used in query because data type is not specified.
JSONPATHLONG(jsonField, 'jsonPath', [defaultValue])
Extracts the Long value from jsonField
based on 'jsonPath'
, use optional defaultValue
for null or parsing error.
JSONPATHDOUBLE(jsonField, 'jsonPath', [defaultValue])
Extracts the Double value from jsonField
based on 'jsonPath'
, use optional defaultValue
for null or parsing error.
JSONPATHSTRING(jsonField, 'jsonPath', [defaultValue])
Extracts the String value from jsonField
based on 'jsonPath'
, use optional defaultValue
for null or parsing error.
JSONPATHARRAY(jsonField, 'jsonPath')
Extracts an array from jsonField
based on 'jsonPath'
, the result type is inferred based on JSON value. Cannot be used in query because data type is not specified.
JSONPATHARRAYDEFAULTEMPTY(jsonField, 'jsonPath')
Extracts an array from jsonField
based on 'jsonPath'
, the result type is inferred based on JSON value. Returns empty array for null or parsing error. Cannot be used in query because data type is not specified.
SHA(bytesCol)
Return SHA-1 digest of binary column(bytes
type) as hex string
SHA256(bytesCol)
Return SHA-256 digest of binary column(bytes
type) as hex string
SHA512(bytesCol)
Return SHA-512 digest of binary column(bytes
type) as hex string
MD5(bytesCol)
Return MD5 digest of binary column(bytes
type) as hex string
toBase64(bytesCol)
Return the Base64-encoded string of binary column(bytes
type)
fromUtf8(bytesCol)
Return the UTF8-encoded string of binary column(bytes
type)
ARRAYLENGTH Returns the length of a multi-value
MAP_VALUE
Select the value for a key from Map stored in Pinot.
MAP_VALUE(mapColumn, 'myKey', valueColumn)
VALUEIN
The transform function will filter the value from the multi-valued column with the given constant values. The VALUEIN
transform function is especially useful when the same multi-valued column is both filtering column and grouping column.
Returns the average of the values for a numeric column as aDouble over the specified number of rows or partition (if applicable).
AVG(playerScore)
Double.NEGATIVE_INFINITY
BOOL_AND
Returns true if all input values are true, otherwise false
BOOL_OR
Returns true if at least one input value is true, otherwise false
Returns the count of the records as Long
COUNT(*)
0
Returns the minimum value of a numeric column as Double
MIN(playerScore)
Double.POSITIVE_INFINITY
Returns the maximum value of a numeric column as Double
MAX(playerScore)
Double.NEGATIVE_INFINITY
Assigns a unique row number to all the rows in a specified table.
ROW_NUMBER()
0
Returns the sum of the values for a numeric column as Double
SUM(playerScore)
0
1
2023-02-14 23:22:38.996577
5.99
5.99
1
2023-02-15 16:31:19.996577
0.99
6.98
1
2023-02-15 19:37:12.996577
9.99
16.97
1
2023-02-16 13:47:23.996577
4.99
21.96
2
2023-02-17 19:23:24.996577
2.99
2.99
2
2023-02-17 19:23:24.996577
0.99
3.98
3
2023-02-16 00:02:31.996577
8.99
8.99
3
2023-02-16 13:47:36.996577
6.99
15.98
3
2023-02-17 03:43:41.996577
6.99
22.97
4
2023-02-15 07:59:54.996577
4.99
4.99
4
2023-02-16 06:37:06.996577
0.99
5.98
1
2023-02-14 23:22:38.996577
5.99
1
2023-02-15 16:31:19.996577
0.99
1
2023-02-15 19:37:12.996577
9.99
2
2023-04-30 04:34:36.996577
4.99
2
2023-04-30 12:16:09.996577
10.99
3
2023-03-23 05:38:40.996577
2.99
3
2023-04-07 08:51:51.996577
3.99
3
3 | 2023-04-08 11:15:37.996577
4.99
1
2023-02-14 23:22:38.996577
5.99
1
2023-02-15 16:31:19.996577
0.99
1
2023-02-15 19:37:12.996577
9.99
2
2023-04-30 04:34:36.996577
4.99
2
2023-04-30 12:16:09.996577
10.99
3
2023-03-23 05:38:40.996577
2.99
3
2023-04-07 08:51:51.996577
3.99
3
2023-04-08 11:15:37.996577
4.99
1
Joe
Smith
2
Alice
Davis
3
James
Jones
4
Dane
Scott
1
2023-02-14 23:22:38.99657
10.99
2
1
2023-02-15 16:31:19.996577
8.99
2
2
2023-04-30 04:34:36.996577
23.50
3
2
2023-04-07 08:51:51.996577
12.35
3
2
2023-04-08 11:15:37.996577
8.29
3
playerID
STRING
yearID
INT
teamID
STRING
league
STRING
playerName
STRING
playerStint
INT
numberOfGames
INT
numberOfGamesAsBatter
INT
AtBatting
INT
runs
INT
teamID
STRING
teamName
STRING
teamAddress
STRING
David Allan
BOS
Boston Red Caps/Beaneaters (from 1876–1900) or Boston Red Sox (since 1953)
4 Jersey Street, Boston, MA
David Allan
CHA
null
null
David Allan
SEA
Seattle Mariners (since 1977) or Seattle Pilots (1969)
1250 First Avenue South, Seattle, WA
David Allan
SEA
Seattle Mariners (since 1977) or Seattle Pilots (1969)
1250 First Avenue South, Seattle, WA
ANA
Anaheim Angels
Anaheim Angels
ARI
Arizona Diamondbacks
Arizona Diamondbacks
ATL
Atlanta Braves
Atlanta Braves
BAL
Baltimore Orioles (original- 1901–1902 current- since 1954)
Baltimore Orioles (original- 1901–1902 current- since 1954)
customerId
INT
creditHistory
STRING
firstName
STRING
lastName
STRING
isCarOwner
BOOLEAN
city
STRING
maritalStatus
STRING
buildingType
STRING
missedPayment
STRING
billingMonth
STRING
341
Paid
Palo Alto
374
Paid
Mountain View
398
Paid
Palo Alto
427
Paid
Cupertino
435
Paid
Cupertino