Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Learn how to query Apache Pinot using SQL or explore data using the web-based Pinot query console.
Pinot currently supports two ways for you to implement your own functions:
Groovy Scripts
Scalar Functions
Pinot allows you to run any function using Apache Groovy scripts. The syntax for executing Groovy script within the query is as follows:
GROOVY('result value metadata json', ''groovy script', arg0, arg1, arg2...)
This function will execute the groovy script using the arguments provided and return the result that matches the provided result value metadata. The function requires the following arguments:
Result value metadata json
- json string representing result value metadata. Must contain non-null keys resultType
and isSingleValue
.
Groovy script to execute
- groovy script string, which uses arg0
, arg1
, arg2
etc to refer to the arguments provided within the script
arguments
- pinot columns/other transform functions that are arguments to the groovy script
Examples
Add colA and colB and return a single-value INT
groovy( '{"returnType":"INT","isSingleValue":true}', 'arg0 + arg1', colA, colB)
Find the max element in mvColumn array and return a single-value INT
groovy('{"returnType":"INT","isSingleValue":true}', 'arg0.toList().max()', mvColumn)
Find all elements of the array mvColumn and return as a multi-value LONG column
groovy('{"returnType":"LONG","isSingleValue":false}', 'arg0.findIndexValues{ it > 5 }', mvColumn)
Multiply length of array mvColumn with colB and return a single-value DOUBLE
groovy('{"returnType":"DOUBLE","isSingleValue":true}', 'arg0 * arg1', arraylength(mvColumn), colB)
Find all indexes in mvColumnA which have value foo
, add values at those indexes in mvColumnB
groovy( '{"returnType":"DOUBLE","isSingleValue":true}', 'def x = 0; arg0.eachWithIndex{item, idx-> if (item == "foo") {x = x + arg1[idx] }}; return x' , mvColumnA, mvColumnB)
Switch case which returns a FLOAT value depending on length of mvCol array
groovy('{\"returnType\":\"FLOAT\", \"isSingleValue\":true}', 'def result; switch(arg0.length()) { case 10: result = 1.1; break; case 20: result = 1.2; break; default: result = 1.3;}; return result.floatValue()', mvCol)
Any Groovy script which takes no arguments
groovy('new Date().format( "yyyyMMdd" )', '{"returnType":"STRING","isSingleValue":true}')
Since the 0.5.0 release, Pinot supports custom functions that return a single output for multiple inputs. Examples of scalar functions can be found in StringFunctions and DateTimeFunctions
Pinot automatically identifies and registers all the functions that have the @ScalarFunction
annotation.
Only Java methods are supported.
You can add new scalar functions as follows:
Create a new java project. Make sure you keep the package name as org.apache.pinot.scalar.XXXX
In your java project include the dependency
Annotate your methods with @ScalarFunction
annotation. Make sure the method is static
and returns only a single value output. The input and output can have one of the following types -
Integer
Long
Double
String
Place the compiled JAR in the /plugins
directory in pinot. You will need to restart all Pinot instances if they are already running.
Now, you can use the function in a query as follows:
Note that the function name in SQL is the same as the function name in Java. The SQL function name is case-insensitive as well.
Cardinality estimation is a classic problem. Pinot solves it with multiple ways each of which has a trade-off between accuracy and latency.
Functions:
DistinctCount(x) -> LONG
Returns accurate count for all unique values in a column.
The underlying implementation is using a IntOpenHashSet in library: it.unimi.dsi:fastutil:8.2.3
to hold all the unique values.
It usually takes a lot of resources and time to compute accurate results for unique counting on large datasets. In some circumstances, we can tolerate a certain error rate, in which case we can use approximation functions to tackle this problem.
HyperLogLog is an approximation algorithm for unique counting. It uses fixed number of bits to estimate the cardinality of given data set.
Pinot leverages HyperLogLog Class in library com.clearspring.analytics:stream:2.7.0
as the data structure to hold intermediate results.
Functions:
DistinctCountHLL(x)_ -> LONG_
For column type INT/LONG/FLOAT/DOUBLE/STRING , Pinot treats each value as an individual entry to add into HyperLogLog Object, then compute the approximation by calling method cardinality().
For column type BYTES, Pinot treats each value as a serialized HyperLogLog Object with pre-aggregated values inside. The bytes value is generated by org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog)
.
All deserialized HyperLogLog object will be merged into one then calling method **cardinality() **to get the approximated unique count.
The Theta Sketch framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the Sketch Class and its extensions from the library org.apache.datasketches:datasketches-java:1.2.0-incubating
to perform distinct counting as well as evaluating set operations.
Functions:
DistinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**) **-> LONG
thetaSketchColumn (required): Name of the column to aggregate on.
thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches. Currently, the only supported parameter is nominalEntries
.
predicates (optional)_: _ These are individual predicates of form lhs <op> rhs
which are applied on rows selected by the where
clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn
that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA
are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
postAggregationExpressionToEvaluate (required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT
, where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.
In the example query below, the where
clause is responsible for identifying the matching rows. Note, the where clause can be completely independent of the postAggregationExpression
. Once matching rows are identified, each server unionizes all the sketches that match the individual predicates, i.e. country='USA'
, device='mobile'
in this case. Once the broker receives the intermediate sketches for each of these individual predicates from all servers, it performs the final aggregation by evaluating the postAggregationExpression
and returns the final cardinality of the resulting sketch.
DistinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**)** -> HexEncoded Serialized Sketch Bytes
This is the same as the previous function, except it returns the byte serialized sketch instead of the cardinality sketch. Since Pinot returns responses as JSON strings, bytes are returned as hex encoded strings. The hex encoded string can be deserialized into sketch by using the library org.apache.commons.codec.binary
as Hex.decodeHex(stringValue.toCharArray())
.
To see how JSON data can be queried, assume that we have the following table:
We also assume that "jsoncolumn" has a Json Index on it. Note that the last two rows in the table have different structure than the rest of the rows. In keeping with JSON specification, a JSON column can contain any valid JSON data and doesn't need to adhere to a predefined schema. To pull out the entire JSON document for each row, we can run the query below:
"101"
"{"name":{"first":"daffy","last":"duck"},"score":101,"data":["a","b","c","d"]}"
102"
"{"name":{"first":"donald","last":"duck"},"score":102,"data":["a","b","e","f"]}
"103"
"{"name":{"first":"mickey","last":"mouse"},"score":103,"data":["a","b","g","h"]}
"104"
"{"name":{"first":"minnie","last":"mouse"},"score":104,"data":["a","b","i","j"]}"
"105"
"{"name":{"first":"goofy","last":"dwag"},"score":104,"data":["a","b","i","j"]}"
"106"
"{"person":{"name":"daffy duck","companies":[{"name":"n1","title":"t1"},{"name":"n2","title":"t2"}]}}"
"107"
"{"person":{"name":"scrooge mcduck","companies":[{"name":"n1","title":"t1"},{"name":"n2","title":"t2"}]}}"
To drill down and pull out specific keys within the JSON column, we simply append the JsonPath expression of those keys to the end of the column name.
"101"
"duck"
"daffy"
"b"
"102"
"duck"
"donald"
"b"
"103"
"mouse"
"mickey"
"b"
"104"
"mouse"
"minnie"
"b"
"105"
"dwag"
"goofy"
"b"
"106"
"null"
"null"
"null"
"107"
"null"
"null"
"null"
Note that the third column (jsoncolumn.data[1]) is null for rows with id 106 and 107. This is because these rows have JSON documents that don't have a key with JsonPath jsoncolumn.data[1]. We can filter out these rows.
"101"
"duck"
"daffy"
"b"
"102"
"duck"
"donald"
"b"
"103"
"mouse"
"mickey"
"b"
"104"
"mouse"
"minnie"
"b"
"105"
"dwag"
"goofy"
"b"
Notice that certain last names (duck and mouse for example) repeat in the data above. We can get a count of each last name by running a GROUP BY query on a JsonPath expression.
"mouse"
"2"
"duck"
"2"
"dwag"
"1"
Also there is numerical information (jsconcolumn.score) embeded within the JSON document. We can extract those numerical values from JSON data into SQL and sum them up using the query below.
"mouse"
"207"
"dwag"
"104"
"duck"
"203"
In short, JSON querying support in Pinot will allow you to use a JsonPath expression whereever you can use a column name with the only difference being that to query a column with data type JSON, you must append a JsonPath expression after the name of the column.
Pinot provides support for aggregations using GROUP BY. You can use the following functions to get the aggregated value.
Function
Description
Example
COUNT
Get the count of rows in a group
COUNT(*)
MIN
Get the minimum value in a group
MIN(playerScore)
MAX
Get the maximum value in a group
MAX(playerScore)
SUM
Get the sum of values in a group
SUM(playerScore)
AVG
Get the average of the values in a group
AVG(playerScore)
MODE
Get the most frequent value in a group. When multiple modes are present it gives the minimum of all the modes. This behavior can be overridden to get the maximum or the average mode.
MODE(playerScore)
MODE(playerScore, 'MIN')
MODE(playerScore, 'MAX')
MODE(playerScore, 'AVG')
MINMAXRANGE
Returns the max - min
value in a group
MINMAXRANGE(playerScore)
PERCENTILE(column, N)
Returns the Nth percentile of the group where N is a decimal number between 0 and 100 inclusive
PERCENTILE(playerScore, 50), PERCENTILE(playerScore, 99.9)
PERCENTILEEST(column, N)
PERCENTILEEST(playerScore, 50), PERCENTILEEST(playerScore, 99.9)
PERCENTILETDigest(column, N)
PERCENTILETDIGEST(playerScore, 50), PERCENTILETDIGEST(playerScore, 99.9)
DISTINCT
Returns the distinct row values in a group
DISTINCT(playerName)
DISTINCTCOUNT
Returns the count of distinct row values in a group
DISTINCTCOUNT(playerName)
DISTINCTCOUNTBITMAP
Returns the count of distinct row values in a group. This function is accurate for INT column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collisions.
DISTINCTCOUNTBITMAP(playerName)
DISTINCTCOUNTHLL
Returns an approximate distinct count using HyperLogLog. It also takes an optional second argument to configure the log2m for the HyperLogLog.
DISTINCTCOUNTHLL(playerName, 12)
DISTINCTCOUNTRAWHLL
Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.
DISTINCTCOUNTRAWHLL(playerName)
FASTHLL (Deprecated)
WARN: will be deprecated soon. FASTHLL stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format
FASTHLL(playerName)
DISTINCTCOUNTTHETASKETCH
DISTINCTCOUNTRAWTHETASKETCH
SEGMENTPARTITIONEDDISTINCTCOUNT
Returns the count of distinct values of a column when the column is pre-partitioned for each segment, where there is no common value within different segments. This function calculates the exact count of distinct values within the segment, then simply sums up the results from different segments to get the final result.
SEGMENTPARTITIONEDDISTINCTCOUNT(playerName)
The following aggregation functions can be used for multi-value columns
Function
Description
Example
COUNTMV
Get the count of rows in a group
COUNTMV(playerName)
MINMV
Get the minimum value in a group
MINMV(playerScores)
MAXMV
Get the maximum value in a group
MAXMV(playerScores)
SUMMV
Get the sum of values in a group
SUMMV(playerScores)
AVGMV
Get the avg of values in a group
AVGMV(playerScores)
MINMAXRANGEMV
Returns the max - min
value in a group
MINMAXRANGEMV(playerScores)
PERCENTILEMV(column, N)
Returns the Nth percentile of the group where N is a decimal number between 0 and 100 inclusive
PERCENTILEMV(playerScores, 50),
PERCENTILEMV(playerScores, 99.9)
PERCENTILEESTMV(column, N)
PERCENTILEESTMV(playerScores, 50),
PERCENTILEESTMV(playerScores, 99.9)
PERCENTILETDIGESTMV(column, N)
PERCENTILETDIGESTMV(playerScores, 50),
PERCENTILETDIGESTMV(playerScores, 99.9),
DISTINCTCOUNTMV
Returns the count of distinct row values in a group
DISTINCTCOUNTMV(playerNames)
DISTINCTCOUNTBITMAPMV
Returns the count of distinct row values in a group. This function is accurate for INT or dictionary encoded column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collision.
DISTINCTCOUNTBITMAPMV(playerNames)
DISTINCTCOUNTHLLMV
Returns an approximate distinct count using HyperLogLog in a group
DISTINCTCOUNTHLLMV(playerNames)
DISTINCTCOUNTRAWHLLMV
Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.
DISTINCTCOUNTRAWHLLMV(playerNames)
FASTHLLMV (Deprecated)
stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format
FASTHLLMV(playerNames)
Learn how to write fast queries for looking up ids in a list of values.
A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but this approach doesn't perform well with large lists of ids. In these cases, you can use an IdSet.
ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )
This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:
INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.
Other types - Bloom Filter
The following parameters are used to configure the Bloom Filter:
expectedInsertions - Number of expected insertions for the BloomFilter, must be positive
fpp - Desired false positive probability for the BloomFilter, must be positive and < 1.0
Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.
IN_ID_SET(columnName, base64EncodedIdSet)
This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.
IN_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.
IN_PARTITIONED_SUBQUERY(columnName, subQuery)
This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.
This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.
You can create an IdSet of the values in the yearID column by running the following:
When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:
We can also configure the fpp parameter:
We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for yearIDs in the IdSet, run the following:
To return rows for yearIDs not in the IdSet, run the following:
To filter rows for yearIDs in the IdSet on a Pinot Broker, run the following query:
To filter rows for yearIDs not in the IdSet on a Pinot Broker, run the following query:
To filter rows for yearIDs in the IdSet on a Pinot Server, run the following query:
To filter rows for yearIDs not in the IdSet on a Pinot Server, run the following query:
Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with in Pinot. The UDF signature is as below:
dimTableName
Name of the dim table to perform the lookup on.
dimColToLookUp
The column name of the dim table to be retrieved to decorate our result.
dimJoinKey
The column name on which we want to perform the lookup i.e. the join column name for dim table.
factJoinKeyVal
The value of the dim table join column for which we will retrieve the dimColToLookUp for the scope and invocation.
Return type of the UDF will be that of the dimColToLookUp column type. There can also be multiple primary keys and corresponding values.
Note: If the dimension table uses a composite primary key i.e multiple primary keys, then ensure that the order of keys appearing in the lookup() UDF is same as the order defined for "primaryKeyColumns" in the dimension table schema.
This document contains the list of all the transformation functions supported by Pinot SQL.
Multiple string functions are supported out of the box from release-0.5.0 .
Date time functions allow you to perform transformations on columns that contain timestamps or dates.
Usage
'jsonPath'
and
'results_type'
are literals. Pinot uses single quotes to distinguish them from identifiers.
e.g.
JSONEXTRACTSCALAR(profile_json_str, '$.name', 'STRING')
is valid.
JSONEXTRACTSCALAR(profile_json_str, "$.name", "STRING")
is invalid.
Transform functions can only be used in Pinot SQL. Scalar functions can be used for column transformation in table ingestion configs.
Examples
The examples below are based on these 3 sample profile JSON documents:
Query 1: Extract string values from the field 'name'
Results are
Query 2: Extract integer values from the field 'age'
Results are
Query 3: Extract Bob's age from the JSON profile.
Results are
Query 4: Extract all field keys of JSON profile.
Results are
Another example of extracting JSON fields from below JSON record:
Extract JSON fields:
All of the functions mentioned till now only support single value columns. You can use the following functions to do operations on multi-value columns.
Returns the Nth percentile of the group using algorithm
Returns the Nth percentile of the group using
See
See
Returns the Nth percentile of the group using algorithm
Returns the Nth percentile of the group using
Pinot supports Geospatial queries on columns containing text-based geographies. For more details on the queries and how to enable them, see .
Pinot supports pattern matching on text-based columns. Only the columns mentioned as text columns in table config can be queried using this method. For more details on how to enable pattern matching, see .
ADD(col1, col2, col3...)
Sum of at least two values
ADD(score_maths, score_science, score_history)
SUB(col1, col2)
Difference between two values
SUB(total_score, score_science)
MULT(col1, col2, col3...)
Product of at least two values
MUL(score_maths, score_science, score_history)
DIV(col1, col2)
Quotient of two values
SUB(total_score, total_subjects)
MOD(col1, col2)
Modulo of two values
MOD(total_score, total_subjects)
ABS(col1)
Absolute of a value
ABS(score)
CEIL(col1)
Rounded up to the nearest integer.
CEIL(percentage)
FLOOR(col1)
Rounded down to the nearest integer.
FLOOR(percentage)
EXP(col1)
Euler’s number(e) raised to the power of col.
EXP(age)
LN(col1)
Natural log of value i.e. ln(col1)
LN(age)
SQRT(col1)
Square root of a value
SQRT(height)
UPPER(col)
convert string to upper case
UPPER(playerName)
LOWER(col)
convert string to lower case
LOWER(playerName)
REVERSE(col)
reverse the string
REVERSE(playerName)
SUBSTR(col, startIndex, endIndex)
get substring of the input string from start to endIndex. Index begins at 0. Set endIndex to -1 to calculate till end of the string
SUBSTR(playerName, 1, -1)
<code></code>
SUBSTR(playerName, 1, 4)
CONCAT(col1, col2, seperator)
Concatenate two input strings using the seperator
CONCAT(firstName, lastName, '-')
TRIM(col)
trim spaces from both side of the string
TRIM(playerName)
LTRIM(col)
trim spaces from left side of the string
LTRIM(playerName)
RTRIM(col)
trim spaces from right side of the string
RTRIM(playerName)
LENGTH(col)
calculate length of the string
LENGTH(playerName)
STRPOS(col, find, N)
find Nth instance of find
string in input.
Returns 0 if input string is empty. Returns -1 if the Nth instance is not found or input string is null.
STRPOS(playerName, 'david', 1)
STARTSWITH(col, prefix)
returns true
if columns starts with prefix string.
STARTSWITH(playerName, 'david')
REPLACE(col, find, substitute)
replace all instances of find
with replace
in input
REPLACE(playerName, 'david', 'henry')
RPAD(col, size, pad)
string padded from the right side with pad
to reach final size
RPAD(playerName, 20, 'foo')
LPAD(col, size, pad)
string padded from the left side with pad
to reach final size
LPAD(playerName, 20, 'foo')
CODEPOINT(col)
the Unicode codepoint of the first character of the string
CODEPOINT(playerName)
CHR(codepoint)
the character corresponding to the Unicode codepoint
CHR(68)
TIMECONVERT
(col, fromUnit, toUnit)
Converts the value into another time unit. the column should be an epoch timestamp. Supported units are
DAYS HOURS MINUTES SECONDS MILLISECONDS MICROSECONDS NANOSECONDS
TIMECONVERT(time, 'MILLISECONDS', 'SECONDS')
This expression converts the value of column time
(taken to be in milliseconds) to the nearest seconds (i.e. the nearest seconds that is lower than the value of date
column)
DATETIMECONVERT
(columnName, inputFormat, outputFormat, outputGranularity)
Takes 4 arguments, converts the value into another date time format, and buckets time based on the given time granularity. Note that, for weeks/months/quarters/years, please use function: DateTrunc.
The format is expressed as <time size>:<time unit>:<time format>:<pattern>
where,
time size
- size of the time unit eg: 1, 10
time unit
- DAYS HOURS MINUTES SECONDS MILLISECONDS MICROSECONDS NANOSECONDS
time format
- EPOCH
or SIMPLE_DATE_FORMAT
pattern
- this is defined in case of SIMPLE_DATE_FORMAT
eg: yyyy-MM-dd
. A specific timezone can be passed using tz(timezone). Timezone can be long or short string format timezone. e.g. Asia/Kolkata
or PDT
granularity
- specified in the format<time size>:<time unit>
Date
from hoursSinceEpoch
to daysSinceEpoch
and bucket it to 1 day granularity
DATETIMECONVERT(Date, '1:HOURS:EPOCH', '1:DAYS:EPOCH', '1:DAYS')
Date
to 15 minutes granularity
DATETIMECONVERT(Date, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '15:MINUTES')
Date
from hoursSinceEpoch
to format yyyyMdd
and bucket it to 1 days granularity
DATETIMECONVERT(Date, '1:HOURS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd', '1:DAYS')
Date
from milliseconds to format yyyyMdd
in timezone PST
DATETIMECONVERT(Date, '1:MILLISECONDS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd tz(America/Los_Angeles)', '1:DAYS')
DATETRUNC
(Presto) SQL compatible date truncation, equivalent to the Presto function date_trunc.
Takes at least 3 and upto 5 arguments, converts the value into a specified output granularity seconds since UTC epoch that is bucketed on a unit in a specified timezone.
DATETRUNC('week', time_in_seconds, 'SECONDS')
This expression converts the column time_in_seconds
, which is a long containing seconds since UTC epoch truncated at WEEK
(where a Week starts at Monday UTC midnight). The output is a long seconds since UTC epoch.
DATETRUNC('quarter', DIV(time_milliseconds/1000), 'SECONDS', 'America/Los_Angeles', 'HOURS')
This expression converts the expression time_in_milliseconds/1000
into hours that are truncated to QUARTER
at the Los Angeles time zone (where a Quarter begins on 1/1, 4/1, 7/1, 10/1 in Los Angeles timezone). The output is expressed as hours since UTC epoch (note that the output is not Los Angeles timezone)
ToEpoch<TIME_UNIT>(timeInMillis)
Convert epoch milliseconds to epoch <Time Unit>. Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
ToEpochSeconds(tsInMillis):
Converts column tsInMillis
value from epoch milliseconds to epoch seconds.
ToEpochDays(tsInMillis):
Converts column tsInMillis
value from epoch milliseconds to epoch days.
ToEpoch<TIME_UNIT>Rounded(timeInMillis, bucketSize)
Convert epoch milliseconds to epoch <Time Unit>, round to nearest rounding bucket(Bucket size is defined in <Time Unit>). Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
ToEpochSecondsRound(tsInMillis, 10):
Converts column tsInMillis
value from epoch milliseconds to epoch seconds and round to the 10-minute bucket value. E.g.ToEpochSecondsRound(
1613472303000, 10) = 1613472300
ToEpochMinutesRound(tsInMillis, 1440):
Converts column tsInMillis
value from epoch milliseconds to epoch Minutes, but round to 1-day bucket value. E.g.ToEpochMinutesRound(
1613472303000, 1440) = 26890560
ToEpoch<TIME_UNIT>Bucket(timeInMillis, bucketSize)
Convert epoch milliseconds to epoch <Time Unit>, and divided by bucket size(Bucket size is defined in <Time Unit>). Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
ToEpochSecondsBucket(tsInMillis, 10):
Converts column tsInMillis
value from epoch milliseconds to epoch seconds then divide by 10 to get the 10 seconds since epoch value. E.g.
ToEpochSecondsBucket(
1613472303000, 10) = 161347230
ToEpochHoursBucket(tsInMillis, 24):
Converts column tsInMillis
value from epoch milliseconds to epoch Hours, then divide by 24 to get 24 hours since epoch value.
FromEpoch<TIME_UNIT>(timeIn<Time_UNIT>)
Convert epoch <Time Unit> to epoch milliseconds. Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
FromEpochSeconds(tsInSeconds):
Converts column tsInSeconds
value from epoch seconds to epoch milliseconds. E.g.
FromEpochSeconds(
1613472303) = 1613472303000
FromEpoch<TIME_UNIT>Bucket(timeIn<Time_UNIT>, bucketSizeIn<Time_UNIT>)
Convert epoch <Bucket Size><Time Unit> to epoch milliseconds. E.g. 10 seconds since epoch or 5 minutes since Epoch. Supported <Time Unit>: SECONDS/MINUTES/HOURS/DAYS
FromEpochSecondsBucket(tsInSeconds, 10):
Converts column tsInSeconds
value from epoch 10-seconds to epoch milliseconds. E.g.
FromEpochSeconds(161347231)= 1613472310000
ToDateTime(timeInMillis, pattern[, timezoneId])
Convert epoch millis value to DateTime string represented by pattern. Time zone will be set to UTC if timezoneId
is not specified.
ToDateTime(tsInMillis, 'yyyy-MM-dd')
converts tsInMillis value to date time pattern yyyy-MM-dd
ToDateTime(tsInMillis, 'yyyy-MM-dd ZZZ', 'America/Los_Angeles')
converts tsInMillis value to date time pattern yyyy-MM-dd ZZZ
in America/Los_Angeles time zone
FromDateTime(dateTimeString, pattern)
Convert DateTime string represented by pattern to epoch millis.
FromDateTime(dateTime, 'yyyy-MM-dd')
converts dateTime
string value to millis epoch value
round(timeValue, bucketSize)
Round the given time value to nearest bucket start value.
round(tsInSeconds, 60)
round seconds epoch value to the start value of the 60 seconds bucket it belongs to. E.g. round(161347231, 60)= 161347200
now()
Return current time as epoch millis
Typically used in predicate to filter on timestamp for recent data. E.g. filter data on recent 1 day(86400 seconds).WHERE tsInMillis > now() - 86400000
timezoneHour(timeZoneId)
Returns the hour of the time zone offset.
timezoneMinute(timeZoneId)
Returns the minute of the time zone offset.
year(tsInMillis)
Returns the year from the given epoch millis in UTC timezone.
year(tsInMillis, timeZoneId)
Returns the year from the given epoch millis and timezone id.
yearOfWeek(tsInMillis)
Returns the year of the ISO week from the given epoch millis in UTC timezone. Alias yow
is also supported.
yearOfWeek(tsInMillis, timeZoneId)
Returns the year of the ISO week from the given epoch millis and timezone id. Alias yow
is also supported.
quarter(tsInMillis)
Returns the quarter of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 4.
quarter(tsInMillis, timeZoneId)
Returns the quarter of the year from the given epoch millis and timezone id. The value ranges from 1 to 4.
month(tsInMillis)
Returns the month of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 12.
month(tsInMillis, timeZoneId)
Returns the month of the year from the given epoch millis and timezone id. The value ranges from 1 to 12.
week(tsInMillis)
Returns the ISO week of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 53. Alias weekOfYear
is also supported.
week(tsInMillis, timeZoneId)
Returns the ISO week of the year from the given epoch millis and timezone id. The value ranges from 1 to 53. Alias weekOfYear
is also supported.
dayOfYear(tsInMillis)
Returns the day of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 366. Alias doy
is also supported.
dayOfYear(tsInMillis, timeZoneId)
Returns the day of the year from the given epoch millis and timezone id. The value ranges from 1 to 366. Alias doy
is also supported.
day(tsInMillis)
Returns the day of the month from the given epoch millis in UTC timezone. The value ranges from 1 to 31. Alias dayOfMonth
is also supported.
day(tsInMillis, timeZoneId)
Returns the day of the month from the given epoch millis and timezone id. The value ranges from 1 to 31. Alias dayOfMonth
is also supported.
dayOfWeek(tsInMillis)
Returns the day of the week from the given epoch millis in UTC timezone. The value ranges from 1(Monday) to 7(Sunday). Alias dow
is also supported.
dayOfWeek(tsInMillis, timeZoneId)
Returns the day of the week from the given epoch millis and timezone id. The value ranges from 1(Monday) to 7(Sunday). Alias dow
is also supported.
hour(tsInMillis)
Returns the hour of the day from the given epoch millis in UTC timezone. The value ranges from 0 to 23.
hour(tsInMillis, timeZoneId)
Returns the hour of the day from the given epoch millis and timezone id. The value ranges from 0 to 23.
minute(tsInMillis)
Returns the minute of the hour from the given epoch millis in UTC timezone. The value ranges from 0 to 59.
minute(tsInMillis, timeZoneId)
Returns the minute of the hour from the given epoch millis and timezone id. The value ranges from 0 to 59.
second(tsInMillis)
Returns the second of the minute from the given epoch millis in UTC timezone. The value ranges from 0 to 59.
second(tsInMillis, timeZoneId)
Returns the second of the minute from the given epoch millis and timezone id. The value ranges from 0 to 59.
millisecond(tsInMillis)
Returns the millisecond of the second from the given epoch millis in UTC timezone. The value ranges from 0 to 999.
millisecond(tsInMillis, timeZoneId)
Returns the millisecond of the second from the given epoch millis and timezone id. The value ranges from 0 to 999.
Function
Type
Description
JSONEXTRACTSCALAR
(jsonField, 'jsonPath', 'resultsType', [defaultValue])
Transform
Evaluates the 'jsonPath'
on jsonField,
returns the result as the type 'resultsType'
, use optional defaultValue
for null or parsing error.
JSONEXTRACTKEY
(jsonField, 'jsonPath')
Transform
Extracts all matched JSON field keys based on 'jsonPath'
Into aSTRING_ARRAY.
TOJSONMAPSTR(map)
Scalar
Convert map to JSON String
JSONFORMAT(object)
Scalar
Convert object to JSON String
JSONPATH(jsonField, 'jsonPath')
Scalar
Extracts the object value from jsonField
based on 'jsonPath'
, the result type is inferred based on JSON value. Cannot be used in query because data type is not specified.
JSONPATHLONG(jsonField, 'jsonPath', [defaultValue])
Scalar
Extracts the Long value from jsonField
based on 'jsonPath'
, use optional defaultValue
for null or parsing error.
JSONPATHDOUBLE(jsonField, 'jsonPath', [defaultValue])
Scalar
Extracts the Double value from jsonField
based on 'jsonPath'
, use optional defaultValue
for null or parsing error.
JSONPATHSTRING(jsonField, 'jsonPath', [defaultValue])
Scalar
Extracts the String value from jsonField
based on 'jsonPath'
, use optional defaultValue
for null or parsing error.
JSONPATHARRAY(jsonField, 'jsonPath')
Scalar
Extracts an array from jsonField
based on 'jsonPath'
, the result type is inferred based on JSON value. Cannot be used in query because data type is not specified.
JSONPATHARRAYDEFAULTEMPTY(jsonField, 'jsonPath')
Scalar
Extracts an array from jsonField
based on 'jsonPath'
, the result type is inferred based on JSON value. Returns empty array for null or parsing error. Cannot be used in query because data type is not specified.
Arguments
Description
jsonField
An Identifier/Expression contains JSON documents.
'jsonPath'
Follows JsonPath Syntax to read values from JSON documents.
'results_type'
One of the Pinot supported data types:INT, LONG, FLOAT, DOUBLE, BOOLEAN, TIMESTAMP, STRING,
INT_ARRAY, LONG_ARRAY, FLOAT_ARRAY, DOUBLE_ARRAY, STRING_ARRAY
.
JSONPATH(myJsonRecord, '$.name')
"Pete"
JSONPATH(myJsonRecord, '$.age')
24
JSONPATHSTRING(myJsonRecord, '$.age')
"24"
JSONPATHARRAY(myJsonRecord, '$.subjects[*].name')
["maths", "english"]
JSONPATHARRAY(myJsonRecord, '$.subjects[*].score')
[90, 70]
JSONPATHARRAY(myJsonRecord, '$.subjects[*].homework_grades[1]')
[85, 65]
SHA(bytesCol)
Return SHA-1 digest of binary column(bytes
type) as hex string
SHA(rawData)
SHA256(bytesCol)
Return SHA-256 digest of binary column(bytes
type) as hex string
SHA256(rawData)
SHA512(bytesCol)
Return SHA-512 digest of binary column(bytes
type) as hex string
SHA512(rawData)
MD5(bytesCol)
Return MD5 digest of binary column(bytes
type) as hex string
MD5(rawData)
ARRAYLENGTH
Returns the length of a multi-value column
MAP_VALUE
Select the value for a key from Map stored in Pinot.
MAP_VALUE(mapColumn, 'myKey', valueColumn)
VALUEIN
Takes at least 2 arguments, where the first argument is a multi-valued column, and the following arguments are constant values. The transform function will filter the value from the multi-valued column with the given constant values. The VALUEIN
transform function is especially useful when the same multi-valued column is both filtering column and grouping column.
VALUEIN(mvColumn, 3, 5, 15)
ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc=
AwIBBQAAAAL/////////////////////
AwIBBQAAAAz///////////////////////////////////////////////9///////f///9/////7///////////////+/////////////////////////////////////////////8=
AwIBBwAAAA/////////////////////////////////////////////////////////////////////////////////////////////////////////9///////////////////////////////////////////////7//////8=
Learn how to query Pinot using SQL
Pinot uses Calcite SQL Parser to parse queries and uses MYSQL_ANSI dialect. You can see the grammar here.
Pinot does not support Joins or nested Subqueries and we recommend using Presto for queries that span multiple tables. Read Engineering Full SQL support for Pinot at Uber for more info.
No DDL support. Tables can be created via the REST API.
In Pinot SQL:
Double quotes(") are used to force string identifiers, e.g. column name.
Single quotes(') are used to enclose string literals.
Mis-using those might cause unexpected query results:
E.g.
WHERE a='b'
means the predicate on the column a
equals to a string literal value 'b'
WHERE a="b"
means the predicate on the column a
equals to the value of the column b
Use single quotes for literals and double quotes (optional) for identifiers (column names)
If you name the columns as timestamp
, date
, or other reserved keywords, or the column name includes special characters, you need to use double quotes when you refer to them in the query.
For performant filtering of ids in a list, see Filtering with IdSet.
Note: results might not be consistent if column ordered by has same value in multiple rows.
To count rows where the column airlineName
starts with U
Pinot supports the CASE-WHEN-ELSE statement.
Example 1:
Example 2:
Functions have to be implemented within Pinot. Injecting functions is not yet supported. The example below demonstrate the use of UDFs. More examples in Transform Function in Aggregation Grouping
Pinot supports queries on BYTES column using HEX string. The query response also uses hex string to represent bytes value.
E.g. the query below fetches all the rows for a given UID.