Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This section contains reference documentation for the ago function.
Return time as epoch millis before the given period in ISO-8601 duration format (PnDTnHnMn.nS with days considered to be exactly 24 hours).
Examples:
"PT20.345S" -- parses as "20.345 seconds"
"PT15M" -- parses as "15 minutes" (where a minute is 60 seconds)
"PT10H" -- parses as "10 hours" (where an hour is 3600 seconds)
"P2D" -- parses as "2 days" (where a day is 24 hours or 86400 seconds)
"P2DT3H4M" -- parses as "2 days, 3 hours and 4 minutes"
"P-6H3M" -- parses as "-6 hours and +3 minutes"
"-P6H3M" -- parses as "-6 hours and -3 minutes"
"-P-6H+3M" -- parses as "+6 hours and -3 minutes"
ago()
1639150454255
This function is typically used in the predicate to filter on timestamps for recent data. e.g. filter data on recent 1 day.
This section contains reference documentation for the EXPR_MIN and EXPR_MAX function.
This function scans the given dataset to identify the maximum and minimum values in the specified measuring columns. Once these extreme values (the maxima and minima) are found, the function locates the corresponding entries in the projection column. These entries are associated with the rows where the extreme values were found in the measuring columns. The function then returns these projection column values, providing a way to link the extreme measurements with their corresponding data in another part of the dataset.
This function has to be used with the following configuration on the broker:
EXPR_MIN (projectionCol, measuringCol1, measuringCol2, measuringCol3)
EXPR_MAX (projectionCol, measuringCol1, measuringCol2, measuringCol3)
Find the user with maximum activity. If there are multiple users, break the tie with their last_activity_date. If still a tie, break with user_id. And project user_id.
More useful is that this multiple such aggregation function can be used with GROUP BY
Note:
In cases where multiple rows share the same extreme values in the measuring columns, all such rows will be returned by the function.
If the goal is to project multiple different columns that correspond to the same set of measuring columns, you can achieve this by invoking the function multiple times, each time specifying a different projection column.
This impl does not work with AS clause (e.g. SELECT exprmin(longCol, doubleCol) AS exprmin
won't work)
Putting exprmin/exprmax
column inside order by clause (e.g. SELECT intCol, exprmin(longCol, doubleCol) FROM table GROUP BY intCol ORDER BY exprmin(longCol, doubleCol)
) is not supported as semantically ordering multi-column multi-row exprmin/exprmax
results doesn't make sense
Currently projecting MV bytes column doesn't work for now due to an issue
For more detailed examples, see: https://github.com/apache/pinot/pull/10636
This page contains reference documentation for functions in Apache Pinot.
This page contains reference documentation for functions in Apache Pinot.
This section contains reference documentation for the arrayConcatDouble function.
This section contains reference documentation for the arrayConcatFloat function.
These examples are based on the .
26
37
63
This section contains reference documentation for the arrayConcatInt function.
Concatenates two arrays of ints.
arrayConcatInt('colName1', 'colName2')
These examples are based on the Hybrid Quick Start.
1453,1731
1453,1731,1415,1623
1908,1758
1908,1758,1339,2310
1453,1731
1453,1731,1415,1623
1908,1758
1908,1758,1339,2310
This section contains reference documentation for the arrayConcatLong function.
These examples are based on the .
These examples are based on the .
These examples are based on the .
These examples are based on the .
N7713A,N7713A
N7713A,N7713A,N7713A,N7713A
N344AA,N344AA
N344AA,N344AA,N344AA,N344AA
N344AA,N344AA
N344AA,N344AA,N344AA,N344AA
N7713A,N7713A
N7713A,N7713A,N7713A,N7713A
15016,11066
15016,11066
10620,14869
10620,14869
13891,12892
13891,12892
12264,10397
12264,10397
11066,12892
11066,12892
N7713A,N7713A
true
N344AA,N344AA
false
N7713A,N7713A
true
N7713A,N7713A
0
N344AA,N344AA
-1
N7713A,N7713A
0
This section contains reference documentation for the arrayDistinctString function.
Returns unique values in an array of strings.
arrayDistinctString('colName')
These examples are based on the Hybrid Quick Start.
N7713A,N7713A
N7713A
N344AA,N344AA
N344AA
N344AA,N344AA
N344AA
N7713A,N7713A
N7713A
This section contains reference documentation for the ARRAYLENGTH function.
Returns the length of a multi-value column
ARRAYLENGTH('colName')
These examples are based on the Hybrid Quick Start.
1
5382
37
267
33
223
17
166
22
160
The count(*)
values will increase each time we execute the query as data is constantly being ingested by the Hybrid Quick Start.
This section contains reference documentation for the arrayRemoveString function.
Removes value from array of strings.
arrayRemoveString('colName', value)
These examples are based on the Hybrid Quick Start.
SEA,PSC
PSC
SEA,PSC,PHX,MSY
PSC,PHX,MSY
SEA,PSC,PHX,MSY
PSC,PHX,MSY
SEA,PSC
PSC
SEA,PSC
PSC
These examples are based on the .
These examples are based on the .
13891,12892
13891
13198,12892
13198
11066,12892
11066
13198,12892
13198
13891,12892
13891
13891,12892
12892,13891
14683,14683
14683,14683
12339,12339
12339,12339
13487,13930
13930,13487
13029,11292
11292,13029
This section contains reference documentation for the arrayIndexOfInt function.
Finds the last index of the given value in the array starting at the given index.
arrayIndexOfInt('colName', valueToFind)
These examples are based on the Hybrid Quick Start.
13891,12892
-1
14683,14683
0
12339,12339
-1
13487,13930
-1
13029,11292
-1
This section contains reference documentation for the arraySliceInt function.
Returns the values in the array between the start and end positions.
arraySliceInt('colName', start, end)
These examples are based on the Hybrid Quick Start.
1531
13891
13891,12892
19
14683
14683,14683
829
12339
12339,12339
24
13198
13198,10721
548
10721
10721,12478
This section contains reference documentation for the arrayUnionInt function.
Create a union of two arrays of ints.
arrayUnionInt('colName1', 'colName2')
These examples are based on the Hybrid Quick Start.
1453,1731
1415,1623
1453,1731,1415,1623
1908,1758
1339,2310
1908,1758,1339,2310
1453,1731
1415,1623
1453,1731,1415,1623
1908,1758
1339,2310
1908,1758,1339,2310
This section contains reference documentation for the arrayContainsInt function.
Checks if int value exists in array.
arrayContainsInt('colName', valueToFind)
These examples are based on the Hybrid Quick Start.
13891,12892
false
14683,14683
true
12339,12339
false
13487,13930
false
13029,11292
false
This section contains reference documentation for the arraySortString function.
Sorts array of strings.
arraySortString('colName')
These examples are based on the Hybrid Quick Start.
3846
PSC,SEA
SEA,PSC
3635
MSY,PHX,PSC,SEA
SEA,PSC,PHX,MSY
429
MSY,PHX,PSC,SEA
SEA,PSC,PHX,MSY
1206
PSC,SEA
SEA,PSC
5300
PSC,SEA
SEA,PSC
This section contains reference documentation for the arrayUnionString function.
Create a union of two arrays of strings.
arrayUnionString('colName1', 'colName2')
These examples are based on the Hybrid Quick Start.
N7713A,N7713A
IND,IND
N7713A,IND
N344AA,N344AA
MCI,BOS
N344AA,MCI,BOS
N7713A,N7713A
IND,IND
N7713A,IND
N344AA,N344AA
MCI,BOS
N344AA,MCI,BOS
This section contains reference documentation for the arrayReverseString function.
Reverses array of strings.
arrayReverseString('colName')
These examples are based on the Hybrid Quick Start.
1206
PSC,SEA
SEA,PSC
5300
PSC,SEA
SEA,PSC
3359
MSY,PHX,PSC,SEA
SEA,PSC,PHX,MSY
1023
PHX,PSC,SEA
SEA,PSC,PHX
963
MSY,PHX,PSC,SEA
SEA,PSC,PHX,MSY
This section contains reference documentation for base64 encode and decode functions.
Encoding scheme follows java.util.Base64.Encoder
toBase64
returns Base64 encoded string of input binary data (bytes
type).
fromBase64
returns binary data (represented as a Hex string) from Base64-encoded string.
toBase64(bytesCol)
fromBase64(stringCol)
aGVsbG8h
hello!
Note that without UTF8 string conversion, returned BYTES will be represented as a Hex string following Pinot's BYTES column representation. See the example below.
68656c6c6f21
Note that the following query will throw compilation error as string is not a valid input type for toBase64
.
This section contains reference documentation for the arraySliceString function.
Returns the values in the array between the start and end positions.
arraySliceString('colName', start, end)
These examples are based on the Hybrid Quick Start.
671
SEA,PSC
SEA,PSC,PHX,MSY
1767
SEA,PSC
SEA,PSC,PHX
2522
SEA,PSC
SEA,PSC
424
SEA,PSC
SEA,PSC,PHX,MSY
3162
SEA,PSC
SEA,PSC,PHX,MSY
This section contains reference documentation for the arraySortInt function.
Sorts array of ints.
arraySortInt('colName')
These examples are based on the Hybrid Quick Start.
13891,12892
12892,13891
14683,14683
14683,14683
12339,12339
12339,12339
13198,10721
10721,13198
10721,12478
10721,12478
This section contains reference documentation for the AVGMV function.
Get the avg of values in a group
AVGMV(colName)
These examples are based on the Hybrid Quick Start.
18.465753424657535
This section contains reference documentation for the count function.
Get the count of rows in a group
COUNT(colName)
These examples are based on the Batch Quick Start.
97889
This section contains reference documentation for the COUNTMV function.
Get the count of rows in a group
COUNTMV(colName)
These examples are based on the Hybrid Quick Start.
The following query returns the documents that have a DivTailNums
with more than one value:
N7713A,N7713A
N344AA,N344AA
N344AA,N344AA
N7713A,N7713A
You can count the number of items in these rows by running the following query:
8
This section contains reference documentation for the COVAR_SAMP function.
Returns the sample covariance between of 2 numerical columns.
COVAR_SAMP(col1, col2) -> double
These examples are based on the Batch Quick Start.
8270.973200974102
This section contains reference documentation for the caseWhen function.
Returns values depending on boolean expressions. This function can only be used in an ingestion transformation function.
caseWhen(booleanExpr1, valueIfExpr1True, booleanExpr2, valueIfExpr2True) caseWhen(booleanExpr1, valueIfExpr1True, booleanExpr2, valueIfExpr2True, ... ,valueIfFalse)
booleanExpr1
A boolean expression
valueIfExpr1True
, valueIfExpr2True
A value to return.
The usage examples are based on extracting fields from the following JSON documents:
CASEWHEN(latitude > 0, 'North', 'South')
North
CASEWHEN(latitude > 0, 1, 0)
1
This function can be used in the table config to add northernHemisphere
column:
13
-12
This section contains reference documentation for the day function.
This section contains reference documentation for the dayOfYear function.
This section contains reference documentation for the COVAR_POP function.
Returns the population covariance between of 2 numerical columns.
COVAR_POP(col1, col2) -> double
These examples are based on the Batch Quick Start.
2314.249154477403
This section contains reference documentation for the dayOfWeek function.
This section contains reference documentation for the DISTINCTCOUNTBITMAP function.
Returns the count of distinct row values in a group. This function is accurate for INT column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collisions. For accurate distinct counting on all column types, see DISTINCTCOUNT.
DISTINCTCOUNTBITMAP(colName)
These examples are based on the Batch Quick Start.
7
148
This section contains reference documentation for the DISTINCTCOUNT function.
Returns the count of distinct row values in a group
DISTINCTCOUNT(colName)
These examples are based on the Batch Quick Start.
7
149
This section contains reference documentation for the DISTINCTCOUNTBITMAPMV function.
Returns the count of distinct row values in a group. This function is accurate for an INT or dictionary encoded column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collision.
DISTINCTCOUNTBITMAPMV(colName)
These examples are based on the Hybrid Quick Start.
34
2
This section contains reference documentation for the DISTINCTAVG function.
Returns the average of distinct row values in a group
DISTINCTAVG(colName) or avg(distinct col)
These examples are based on the Batch Quick Start.
83.36526946107784
349.1158798283262
This section contains reference documentation for the DISTINCTCOUNTHLL function.
Returns an approximate distinct count using HyperLogLog. It also takes an optional second argument to configure the log2m for the HyperLogLog. For accurate distinct counting, see DISTINCTCOUNT.
DISTINCTCOUNTHLL(colName, log2m)
These examples are based on the Batch Quick Start.
158
149
This section contains reference documentation for the DISTINCTAVGMV function.
Returns the average of distinct row values in a group
DISTINCTAVGMV(colName)
These examples are based on the Hybrid Quick Start.
32.4
This section contains reference documentation for the DISTINCT function.
Returns the distinct row values in a group
DISTINCT(colName)
These examples are based on the Batch Quick Start.
NL
UA
AL
NA
PL
AA
FL
NL
UA
AL
NA
PL
AA
FL
This section contains reference documentation for the DISTINCTCOUNTMV function.
Returns the count of distinct row values in a group
DISTINCTCOUNTMV(colName)
These examples are based on the Hybrid Quick Start.
The following query returns the documents that have a DivTailNums
with more than one value:
N7713A,N7713A
N344AA,N344AA
N344AA,N344AA
N7713A,N7713A
You can count the distinct number of items in these rows by running the following query:
2
This section contains reference documentation for the DISTINCTCOUNTRAWHLLMV function.
Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.
DISTINCTCOUNTRAWHLLMV(colName, log2m)
These examples are based on the Hybrid Quick Start.
00000008000000ac00000000000000000000000500000020000000000030000202108000040000010000000300010400000000000000000000000463000000000000000000010001041000200000002000000000000000000a00000000028001000000010800000000010000001008000000804000000000020000040000880000000000000000000000000000000000000000000000800000000800020004000000840000000002000000000000000000001400
0000000100000004000000e4
This section contains reference documentation for the DISTINCTCOUNTRAWTHETASKETCH function.
The Theta Sketch framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the Sketch Class and its extensions from the library org.apache.datasketches:datasketches-java:4.2.0
to perform distinct counting as well as evaluating set operations.
distinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate) -> HexEncoded
thetaSketchColumn
(required): Name of the column to aggregate on.
thetaSketchParams
(required): Semicolon-separated parameter string for constructing the intermediate theta-sketches.
The supported parameters are:
nominalEntries
: The nominal entries used to create the sketch. (Default 4096)
samplingProbability
: Sets the upfront uniform sampling probability, p. (Default 1.0)
accumulatorThreshold
: How many sketches should be kept in memory before merging. (Default 2)
Currently, the only supported parameter is nominalEntries
(defaults to 4096).
predicates
(optional)_: _ These are individual predicates of form lhs <op> rhs
which are applied on rows selected by the where
clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn
that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA
are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
postAggregationExpressionToEvaluate
(required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT
, where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.
These examples are based on the Batch Quick Start.
AgMDAAAKzJOVAAAAAACAPwDAATj...
AwMDAAAKzJMQAAAAAACAP4vpfPBbbQsO5N1zYV2c...
We can also provide predicates and a post aggregation expression to compute more complicated cardinalities:
AQMDAAA6zJN8QPYIsvHMNQ==
This section contains reference documentation for the DISTINCTCOUNTHLLMV function.
Returns an approximate distinct count using HyperLogLog in a group.
DISTINCTCOUNTHLLMV(colName)
These examples are based on the Hybrid Quick Start.
34
This section contains reference documentation for the DISTINCTCOUNTRAWHLL function.
Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.
DISTINCTCOUNTRAWHLL(colName, log2m)
These examples are based on the Batch Quick Start.
00000008000000ac00000800000084000210000000000020001020220030042002100420002010020210000300008020040180400001300310001863024004220870800004400421040104610220080000020000040000030000800002108420000110400800000106000060000080020000082000218c0002000000020000010200100000018c0006000400022004a0000088000200800000320820021000000221842000000000025088000220080100009420
000000010000000400000106
These examples are based on the .
13922
244032
This section contains reference documentation for the DISTINCTCOUNTTHETASKETCH function.
The Theta Sketch framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the Sketch Class and its extensions from the library org.apache.datasketches:datasketches-java:4.2.0
to perform distinct counting as well as evaluating set operations.
distinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate) -> Long
thetaSketchColumn
(required): Name of the column to aggregate on.
thetaSketchParams
(required): Semicolon-separated parameter string for constructing the intermediate theta-sketches.
The supported parameters are:
nominalEntries
: The nominal entries used to create the sketch. (Default 4096)
samplingProbability
: Sets the upfront uniform sampling probability, p. (Default 1.0)
accumulatorThreshold
: How many sketches should be kept in memory before merging. (Default 2)
predicates
(optional)_: _ These are individual predicates of form lhs <op> rhs
which are applied on rows selected by the where
clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn
that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA
are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
postAggregationExpressionToEvaluate
(required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT
, where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.
These examples are based on the Batch Quick Start.
149
146
We can also provide predicates and a post aggregation expression to compute more complicated cardinalities. For example, we could can find the intersection of the following queries:
1986
1985
1937
2003
1979
1900
1986
1978
2012
(the yearId 1986
is the only one in common)
By running the following query:
1
This section contains reference documentation for the DIV function.
Quotient of two values
DIV(col1, col2)
These examples are based on the Batch Quick Start.
26
153
0.16993464052287582
This section contains reference documentation for the DATETIMECONVERT function.
Converts the value from a column that contains an epoch timestamp into another time unit and buckets based on the given time granularity.
DATETIMECONVERT(columnName, inputFormat, outputFormat, outputGranularity)
inputFormat
and outputFormat
are defined using the following structure:
<time size>:<time unit>:<time format>:<pattern>
where:
time size
- size of the time unit eg: 1, 10
time unit
- DAYS
, HOURS
, MINUTES
, SECONDS
, MILLISECONDS
, MICROSECONDS
, NANOSECONDS
time format
EPOCH
SIMPLE_DATE_FORMAT
pattern - defined in case of SIMPLE_DATE_FORMAT
e.g. yyyy-MM-dd
. A specific timezone can be passed using tz(timezone)
. Timezone can be long or short string format timezone. e.g. Asia/Kolkata
or PDT
granularity
is specified in the format <time size>:<time unit>
.
These examples are based on the Batch JSON Quick Start.
created_at_timestamp
from milliseconds since epoch to days since epoch, bucketed to 1 day granularity:
7044874134
2018-01-01 11:00:00.0
1514804402000
17532
created_at_timestamp
bucketed to 15 minutes granularity:
7044874134
2018-01-01 11:00:00.0
1514804402000
1514804400000
created_at_timestamp
to format yyyy-MM-dd
, bucketed to 1 days granularity:
7044874134
2018-01-01 11:00:00.0
1514804402000
2018-01-01
created_at_timestamp
to format yyyy-MM-dd HH:mm
, in timezone Pacific/Kiritimati
:
7044874134
2018-01-01 11:00:00.0
1514804402000
2018-01-02 01:00
created_at_timestamp
to format yyyy-MM-dd
, in timezone Pacific/Kiritimati
and bucketed to 1 day granularity:
7044874134
2018-01-01 11:00:00.0
1514804402000
2018-01-02 00:00
This section contains reference documentation for the DATETRUNC function.
(Presto) SQL compatible date truncation, equivalent to the Presto function date_trunc
.
Converts the value into a specified output granularity seconds since UTC epoch that is bucketed on a unit in a specified timezone.
DATETRUNC(unit, timeValue)
DATETRUNC(unit, timeValue, inputTimeUnitStr)
DATETRUNC(unit, timeValue, inputTimeUnitStr, timeZone)
DATETRUNC(unit, timeValue, inputTimeUnitStr, timeZone, outputTimeUnitStr)
unit
supports the following values:
millisecond
second
minute
hour
day
week
month
quarter
year
inputTimeUnitStr
and outputTimeUnitStr
support the following values:
NANOSECONDS
MICROSECONDS
MILLISECONDS
SECONDS
MINUTES
HOURS
DAYS
Truncates an epoch in milliseconds at WEEK
(where a Week starts at Monday UTC midnight):
or
Truncates an epoch in milliseconds at WEEK
(where a Week starts at Monday UTC midnight) in the UTC
time zone, returning a result in epoch in seconds in UTC timezone:
Truncates an epoch in milliseconds at WEEK
(where a Week starts at Monday UTC midnight) in the CET
time zone, returning a result in epoch in seconds in UTC timezone:
Truncates an epoch in milliseconds at QUARTER
in the Los Angeles time zone (where a Quarter begins on Jan 1st, April 1st, July 1st, October 1st in Los Angeles timezone), returning a result in hours since UTC epoch:
These examples are based on the .
1639353600000
1639353600
1639350000
453631
1134
This section contains reference documentation for the FREQUENTSTRINGSSKETCH function.
FREQUENTSTRINGSSKETCH
is an estimation data-sketch function which can be used to estimate the frequency of an item. It is based on Apache Datasketches library and returns a serialized sketch object which can be merged with other sketches.
FREQUENTSTRINGSSKETCH(column, maxMapSize=256) -> Base64 encoded sketch object
column
(required): Name of the column to aggregate on. Needs to be a type which can be cast into 'STRING'.
maxMapSize
: This value specifies the maximum physical length of the internal hash map. The maxMapSize must be a power of 2 and the default value is 256.
BAEKCAUAAAAOAA...
Which can be used, for example in Java as:
For more examples on the sketch API, refer to the Datasketches documentation.
This section contains reference documentation for the firstwithtime function.
Returns the value of dataColumn
with the smallest timeColumn
value where:
timeColumn
is used to define the time of dataColumn
, which can be of type TIMESTAMP
, INT
, LONG
dataType
specifies the type for dataColumn
, which can be BOOLEAN
, INT
, LONG
, FLOAT
, DOUBLE
, STRING
FIRSTWITHTIME(dataColumn, timeColumn, 'dataType')
This example is based on the .
Converts a formatted date-time string to milliseconds, based on the provided .
group_name1016303453
1565136000000
1565190733000
1565190733000
1565190733000
This section contains reference documentation for the FREQUENTLONGSSKETCH function.
FREQUENTLONGSSKETCH
is an estimation data-sketch function which can be used to estimate the frequency of an item. It is based on Apache Datasketches library and returns a serialized sketch object which can be merged with other sketches.
FREQUENTLONGSSKETCH(column, maxMapSize=256) -> Base64 encoded sketch object
column
(required): Name of the column to aggregate on. Needs to be a type which can be cast into 'LONG'.
maxMapSize
: This value specifies the maximum physical length of the internal hash map. The maxMapSize must be a power of 2 and the default value is 256.
BAEKCAUAAAAOAAAAAA...
Which can be used, for example in Java as:
For more examples on the sketch API, refer to the Datasketches documentation.
The FunnelMaxStep
function in Pinot is designed to track user progress through a predefined series of steps or stages in a funnel, such as user interactions on a website from page views to purchases. This function is particularly useful for analyzing how far users progress through a conversion process within a specified time window.
This function returns the Integer value of the max steps that window funnel could proceed forward.
timestampExpression
:
Type: Expression in TIMESTAMP
or LONG
Description: This is an expression that evaluates to the timestamp of each event. It's used to determine the order of events for a particular user or session. The timestamp is crucial for evaluating whether subsequent actions fall within the specified window.
windowSize
:
Type: LONG
Description: Specifies the size of the time window in which the sequence of funnel steps must occur. The window is defined in milliseconds. This parameter sets the maximum allowed time between the first and the last step in the funnel for them to be considered as part of the same user journey.
numberSteps
:
Type: Integer
Description: Defines the total number of distinct steps in the funnel. This count should match the number of stepExpression
parameters provided.
stepExpression
:
Type: Boolean Expression
Description: These are expressions that define each step in the funnel. Typically, these are conditions that evaluate whether a specific event type or action has occurred. Multiple step expressions are separated by commas, with each expression corresponding to a step in the funnel sequence.
mode
(optional):
Type: String
Description: Defines additional modes or options that alter how the funnel analysis is calculated. Common modes might include settings to handle overlapping events, reset the window upon each step, or other custom behaviors specific to the needs of the funnel analysis. If unspecified, the default behavior as defined by Pinot is used.
The STRICT_DEDUPLICATION
mode ensures that repeating occurrences of the same event condition within a funnel sequence disrupt further processing of the funnel for that user session. This mode is crucial when it's important to identify and measure unique, non-repeated actions in a sequence, ensuring each step of the funnel represents a distinct action.
Event Sequence Interruption: When an event that satisfies a current step condition occurs repeatedly without progression to the next step, strict_deduplication
interrupts and essentially ends the analysis of the funnel for that sequence. This prevents the funnel from incorrectly advancing if the same action is merely repeated instead of moving through the intended steps.
Enhanced Accuracy in Funnel Progression: This mode is useful for scenarios where the continuity and progression of distinct steps are critical for accurate conversion analysis. It avoids the misinterpretation of user engagement where repeated similar actions might otherwise suggest a false progression through the funnel.
For instance, if a funnel is designed to track user progression from a homepage visit, to a search, to adding an item to a cart, and then to checkout, the strict_deduplication
mode would stop processing the funnel sequence if the user performs multiple searches without proceeding to add an item to the cart. This ensures that only a linear, non-repetitive progression through these steps is considered as valid funnel movement.
This mode helps maintain the integrity of each step in the user's journey, ensuring that the data reflects true user behavior without overcounting repetitive actions that do not lead to actual progression.
The strict_order
mode enforces a stringent sequence order for events within a funnel. This mode ensures that the progression through the steps follows the exact specified order without any intervening events that are not part of the defined sequence.
strict_order
Sequence Adherence: The strict_order
mode requires that the events occur in the exact order specified without any other types of events intervening. If an event occurs that is not the next expected step in the defined sequence, the analysis of the funnel for that user session is halted.
Early Termination: In the presence of an out-of-sequence event, the analysis stops, and the maximum event level is determined as the last correct step in the sequence before the interruption. For instance, in a specified sequence of A -> B -> C, if the sequence is A -> B -> D, then the funnel analysis terminates after B because D is not the expected next step (C).
Enhanced Precision in Path Analysis: This mode is particularly valuable when the precise order of actions is critical for the analysis, such as in strict process flows where each step must be followed in a specific order to be considered successful.
Avoids Misinterpretation: It prevents the misinterpretation of funnel progress where intervening or unordered events could suggest a misleading path through the funnel.
Consider a scenario where a funnel is set up to track user progression through the following steps: logging in (A), searching for products (B), adding a product to the cart (C), and completing a purchase (D). Using the strict_order
mode, if the sequence goes A -> B -> E -> C, the analysis will terminate after B because E (an unexpected event like viewing account details) intervenes before C, the expected next step. Therefore, the maximum step reached is reported as 2, representing the successful completion of steps A and B only.
This mode is crucial for scenarios requiring strict compliance to process steps, ensuring that only users who follow the exact intended sequence are counted in the funnel analysis.
The strict_increase
is designed to ensure that the sequence of events being analyzed has strictly increasing timestamps. This mode is crucial for accurately tracking and analyzing user behavior in scenarios where the chronological order of events directly impacts the interpretation of user actions within a funnel.
strict_increase
Timestamp Order: This mode requires that each subsequent event in the funnel must have a timestamp greater than the previous event. It ensures that the user's actions are not only in the correct sequence but also follow a temporal progression without any backtracking or simultaneous actions.
Analysis Integrity: If any event in the sequence does not adhere to the strictly increasing order by timestamp, the analysis for that sequence either stops at that point or ignores the out-of-order event, depending on how critical the temporal sequence is to the funnel's logic.
Temporal Validation: This mode is particularly useful in scenarios where the timing of events is crucial, such as in sessions where actions must follow one another in real-time to be considered valid. It validates the sequence not just by the type of event, but also by ensuring that these events are progressively happening over time.
Avoiding Data Errors: It helps in avoiding potential data errors or anomalies where timestamps might not have been recorded correctly, or events may appear out of order due to system errors or delays in logging events.
Consider a funnel designed to analyze a user's journey from visiting a website to making a purchase, defined by the following steps: page visit (A), item addition (B), checkout initiation (C), and payment completion (D). Using the strict_increase
mode, the funnel will only consider sequences where each action occurs later than the previous. If a user's sequence is A (t1) -> B (t2) -> A (t3) -> C (t4) with t3 being less than or equal to t2, then the analysis will ignore the second occurrence of A or terminate, depending on the specific implementation and requirements of the analysis.
This mode helps ensure that the funnel analysis reflects true, linear progress through the intended actions, with each step occurring in a timely, sequential manner.
The KEEP_ALL
mode is designed to ensure that all events in the data set are considered in the analysis, even if they do not match any of the specified step conditions in the funnel sequence. This mode is particularly useful for comprehensive data analysis where the context of non-matching events may still provide valuable insights about user behavior or system performance.
KEEP_ALL
Inclusive Analysis: In the KEEP_ALL
mode, the funnel function includes every event within the specified time window in the analysis, regardless of whether these events correspond to the predefined steps in the funnel. This allows for a more holistic view of the user's actions during the session.
Context Retention: By including all events, this mode helps retain the full context of a user's session, capturing activities that may not be directly related to the funnel but could influence or explain the user's behavior and decisions at other points.
Enhanced Insight: This mode is invaluable for scenarios where understanding the entirety of user interactions is crucial, such as in complex user journeys where additional actions between the main funnel steps might influence the outcomes or indicate other patterns of interest.
Data Completeness: It prevents data loss from filtering out non-matching events, which can be important when analyzing sessions for comprehensive patterns, troubleshooting issues, or performing detailed user journey analysis.
Consider a scenario where a funnel is set up to track user progress through steps like logging in, searching for a product, and making a purchase. With KEEP_ALL
mode enabled, if a user performs additional actions such as updating profile information or viewing terms and conditions, these events are also included in the analysis. This comprehensive inclusion allows analysts to see a fuller picture of what the user did during their session, not just the actions that directly relate to the funnel. This can reveal if other activities are detracting from the main conversion goals, or if they are part of a broader user engagement that doesn't neatly fit into the primary funnel steps.
This mode helps to ensure that no potential insights are lost by excluding events, making it a powerful option for detailed analysis and understanding of user interactions beyond the strict confines of the predefined funnel steps.
screen_viewed
1718112402
1
screen_clicked
1718112403
1
purchased
1718112404
1
screen_viewed
1718112405
1
screen_clicked
1718112406
1
purchased
1718112407
1
screen_viewed
1718112405
2
screen_clicked
1718112406
2
purchased
1718112407
2
screen_viewed
1718112404
3
screen_clicked
1718112405
3
cart_viewed
1718112406
3
purchased
1718112407
3
screen_viewed
1717939609
4
screen_clicked
1718112405
4
purchased
1718112405
4
Response
user_id
steps
1
2
2
2
3
4
4
2
Response
user_id
steps
1
3
2
3
3
3
4
1
Response
user_id
steps
1
3
2
3
3
2
4
1
Response
user_id
steps
1
3
2
3
3
3
4
3
This section contains reference documentation for the fromEpochBucket functions.
Convert epoch to epoch milliseconds. e.g. 10 seconds since epoch or 5 minutes since Epoch. The following time units are supported:
SECONDS
MINUTES
HOURS
DAYS
FromEpoch<TIME_UNIT>Bucket(timeInMillis, bucketSize)
1613472303000
3226944606000
1613472000000
1613466000000
1613088000000