1 of 100

Functions

This page contains reference documentation for functions in Apache Pinot.

ADD

This section contains reference documentation for the ADD function.

Sum of at least two values

Signature

ADD(col1, col2, col3...)

Usage Examples

These examples are based on the Batch Quick Start.

select homeRuns, baseOnBalls, ADD(homeRuns, baseOnBalls) AS total
from baseballStats 
WHERE teamID = 'ML1' 
AND yearID = 1956 
AND playerName = 'Henry Louis'

ago

This section contains reference documentation for the ago function.

Return time as epoch millis before the given period (in ISO-8601 duration format).

Examples:

"PT20.345S" -- parses as "20.345 seconds"
"PT15M" -- parses as "15 minutes" (where a minute is 60 seconds)
"PT10H" -- parses as "10 hours" (where an hour is 3600 seconds)
"P2D" -- parses as "2 days" (where a day is 24 hours or 86400 seconds)
"P2DT3H4M" -- parses as "2 days, 3 hours and 4 minutes"
"P-6H3M" -- parses as "-6 hours and +3 minutes"
"-P6H3M" -- parses as "-6 hours and -3 minutes"
"-P-6H+3M" -- parses as "+6 hours and -3 minutes"

Signature

ago()

Usage Examples

select ago('P1D') AS oneDayAgo
FROM ignoreMe

This function is typically used in the predicate to filter on timestamps for recent data. e.g. filter data on recent 1 day.

SELECT * 
FROM tableName
WHERE tsInMillis > ago('P1D')

ARG_MIN / ARG_MAX

This section contains reference documentation for the ARG_MIN and ARG_MAX function.

This function scans the given dataset to identify the maximum and minimum values in the specified measuring columns. Once these extreme values (the maxima and minima) are found, the function locates the corresponding entries in the projection column. These entries are associated with the rows where the extreme values were found in the measuring columns. The function then returns these projection column values, providing a way to link the extreme measurements with their corresponding data in another part of the dataset.

Signature

ARG_MIN (measuringCol1, measuringCol2, measuringCol3, projectionCol)
ARG_MAX (measuringCol1, measuringCol2, measuringCol3, projectionCol)

Usage Examples

Find the user with maximum activity. If there are multiple users, break the tie with their last_activity_date. If still a tie, break with user_id. And project user_id.

SELECT ARG_MAX(activity, last_activity_date, user_id, user_id)
FROM userEngagmentTable

More useful is that this multiple such aggregation function can be used with GROUP BY

SELECT user_region, ARG_MAX(activity, last_activity_date, user_id, user_id),
    ARG_MIN(user_satisfaction, user_id)
FROM userEngagmentTable
GROUP BY user_region

Note:

In cases where multiple rows share the same extreme values in the measuring columns, all such rows will be returned by the function.
If the goal is to project multiple different columns that correspond to the same set of measuring columns, you can achieve this by invoking the function multiple times, each time specifying a different projection column.
This impl does not work with AS clause (e.g. SELECT argmin(longCol, doubleCol) AS argmin won't work)
Putting argmin/argmax column inside order by clause (e.g. SELECT intCol, argmin(longCol, doubleCol) FROM table GROUP BY intCol ORDER BY argmin(longCol, doubleCol)) is not supported as semantically ordering multi-column multi-row argmin/argmax results doesn't make sense
Currently projecting MV bytes column doesn't work for now due to an issue

For more detailed examples, see: https://github.com/apache/pinot/pull/10636

arrayConcatDouble

This section contains reference documentation for the arrayConcatDouble function.

Concatenates two arrays of doubles.

Signature

arrayConcatDouble('colName1', 'colName2')

Usage Examples

This example assumes the multiValueTable columns mvCol1 and mvCol2 are both of type DOUBLE with singleValueField in the table schema set to false.

select mvCol1, 
       arrayConcatDouble(mvCol1, mvCol2) AS concatDoubles
from multiValueTable
WHERE arraylength(mvCol1) >= 2
limit 5

arrayConcatFloat

This section contains reference documentation for the arrayConcatFloat function.

Concatenates two arrays of floats.

Signature

arrayConcatFloat('colName1', 'colName2')

Usage Examples

This example assumes the multiValueTable columns mvCol1 and mvCol2 are both of type FLOAT with singleValueField in the table schema set to false.

select mvCol1, 
       arrayConcatFloat(mvCol1, mvCol2) AS concatFloats
from multiValueTable
WHERE arraylength(mvCol1) >= 2
limit 5

arrayConcatInt

This section contains reference documentation for the arrayConcatInt function.

Concatenates two arrays of ints.

Signature

arrayConcatInt('colName1', 'colName2')

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivWheelsOffs, 
       arrayConcatInt(DivWheelsOffs, DivWheelsOns) AS concatIds
from airlineStats 
WHERE arraylength(DivWheelsOffs) >= 2
limit 5

arrayConcatLong

This section contains reference documentation for the arrayConcatLong function.

Concatenates two arrays of longs.

Signature

arrayConcatLong('colName1', 'colName2')

Usage Examples

This example assumes the multiValueTable columns mvCol1 and mvCol2 are both of type LONG with singleValueField in the table schema set to false.

select mvCol1, 
       arrayConcatLong(mvCol1, mvCol2) AS concatLongs
from multiValueTable
WHERE arraylength(mvCol1) >= 2
limit 5

arrayContainsInt

This section contains reference documentation for the arrayContainsInt function.

Checks if int value exists in array.

Signature

arrayContainsInt('colName', valueToFind)

Usage Examples

arrayContainsString

This section contains reference documentation for the arrayContainsString function.

Checks if string value exists in array.

Signature

arrayContainsString('colName', valueToFind)

Usage Examples

arrayDistinctInt

This section contains reference documentation for the arrayDistinctInt function.

Returns unique values in an array of ints.

Signature

arrayDistinctInt('colName')

Usage Examples

arrayDistinctString

This section contains reference documentation for the arrayDistinctString function.

Returns unique values in an array of strings.

Signature

arrayDistinctString('colName')

Usage Examples

arrayIndexOfInt

This section contains reference documentation for the arrayIndexOfInt function.

Finds the last index of the given value in the array starting at the given index.

Signature

arrayIndexOfInt('colName', valueToFind)

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivAirportIDs, 
       arrayIndexOfInt(DivAirportIDs, 14683) AS index
from airlineStats 
WHERE arraylength(DivAirportIDs) >= 2
limit 5

arrayIndexOfString

This section contains reference documentation for the arrayIndexOfString function.

Finds the last index of the given value in the array starting at the given index.

Signature

arrayIndexOfString('colName', valueToFind)

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivTailNums, 
       arrayIndexOfString(DivTailNums, 'N7713A') AS index
from airlineStats 
WHERE arraylength(DivTailNums) >= 2
limit 5

ARRAYLENGTH

This section contains reference documentation for the ARRAYLENGTH function.

Returns the length of a multi-value column

Signature

ARRAYLENGTH('colName')

Usage Examples

These examples are based on the Hybrid Quick Start.

select ARRAYLENGTH(RandomAirports) AS length, count(*) 
from airlineStats 
GROUP BY length
ORDER BY count(*) DESC
LIMIT 5

The count(*) values will increase each time we execute the query as data is constantly being ingested by the Hybrid Quick Start.

arrayRemoveInt

This section contains reference documentation for the arrayRemoveInt function.

Removes value from array of ints.

Signature

arrayRemoveInt('colName', value)

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivAirportIDs, 
       arrayRemoveInt(DivAirportIDs, 12892) AS value
from airlineStats 
WHERE arraylength(DivAirportIDs) >= 2
AND arrayContainsInt(DivAirportIDs, 12892) = 1
limit 5

arrayRemoveString

This section contains reference documentation for the arrayRemoveString function.

Removes value from array of strings.

Signature

arrayRemoveString('colName', value)

Usage Examples

These examples are based on the Hybrid Quick Start.

select RandomAirports, 
       arrayRemoveString(RandomAirports, 'SEA') AS value
from airlineStats 
WHERE arraylength(RandomAirports) BETWEEN 2 AND 4
limit 5

arrayReverseInt

This section contains reference documentation for the arrayReverseInt function.

Reverses array of ints.

Signature

arrayReverseInt('colName')

Usage Examples

arrayReverseString

This section contains reference documentation for the arrayReverseString function.

Reverses array of strings.

Signature

arrayReverseString('colName')

Usage Examples

These examples are based on the Hybrid Quick Start.

select FlightNum, 
       arrayReverseString(RandomAirports) AS reversedAirports, 
       RandomAirports
from airlineStats 
WHERE arraylength(RandomAirports) BETWEEN 2 AND 4
limit 5

arraySliceInt

This section contains reference documentation for the arraySliceInt function.

Returns the values in the array between the start and end positions.

Signature

arraySliceInt('colName', start, end)

Usage Examples

These examples are based on the Hybrid Quick Start.

select FlightNum, 
       arraySliceInt(DivAirportIDs, 0, 1) AS airports, 
	     DivAirportIDs
from airlineStats 
WHERE arraylength(DivAirportIDs) >= 2
limit 5

arraySliceString

This section contains reference documentation for the arraySliceString function.

Returns the values in the array between the start and end positions.

Signature

arraySliceString('colName', start, end)

Usage Examples

arrayUnionInt

This section contains reference documentation for the arrayUnionInt function.

Create a union of two arrays of ints.

Signature

arrayUnionInt('colName1', 'colName2')

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivWheelsOffs, 
       DivWheelsOns,
       arrayUnionInt(DivWheelsOffs, DivWheelsOns) AS unionIds
from airlineStats 
WHERE arraylength(DivWheelsOffs) >= 2
limit 5

DISTINCTCOUNTRAWTHETASKETCH

This section contains reference documentation for the DISTINCTCOUNTRAWTHETASKETCH function.

The Theta Sketch framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the Sketch Class and its extensions from the library org.apache.datasketches:datasketches-java:1.2.0-incubating to perform distinct counting as well as evaluating set operations.

Signature

DISTINCTCOUNTRAWTHETASKETCH(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate) -> HexEncoded

thetaSketchColumn (required): Name of the column to aggregate on.
thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches.
- Currently, the only supported parameter is nominalEntries (defaults to 4096).
predicates (optional)_: _ These are individual predicates of form lhs <op> rhs which are applied on rows selected by the where clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
postAggregationExpressionToEvaluate (required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT , where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.

Usage Examples

These examples are based on the Batch Quick Start.

select distinctCountRawThetaSketch(teamID) AS value
from baseballStats

select distinctCountRawThetaSketch(teamID, 'nominalEntries=10') AS value
from baseballStats

We can also provide predicates and a post aggregation expression to compute more complicated cardinalities:

select distinctCountRawThetaSketch(
  yearID, 
  'nominalEntries=4096', 
  'teamID = ''SFN'' AND numberOfGames=28 AND homeRuns=1',
  'teamID = ''CHN'' AND numberOfGames=28 AND homeRuns=1',
  'SET_INTERSECT($1, $2)'
) AS value
from baseballStats

DATETIMECONVERT

This section contains reference documentation for the DATETIMECONVERT function.

Converts the value from a column that contains an epoch timestamp into another time unit and buckets based on the given time granularity.

Signature

DATETIMECONVERT(columnName, inputFormat, outputFormat, outputGranularity)

inputFormat and outputFormat are defined using the following structure:

<time size>:<time unit>:<time format>:<pattern>

where:

time size - size of the time unit eg: 1, 10
time unit - DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS, MICROSECONDS, NANOSECONDS
time format
- EPOCH
- SIMPLE_DATE_FORMAT pattern - defined in case of SIMPLE_DATE_FORMAT e.g. yyyy-MM-dd. A specific timezone can be passed using tz(timezone). Timezone can be long or short string format timezone. e.g. Asia/Kolkata or PDT

granularity is specified in the format <time size>:<time unit>.

Usage Examples

These examples are based on the Batch JSON Quick Start.

created_at_timestamp from milliseconds since epoch to days since epoch, bucketed to 1 day granularity:

select id, 
       created_at_timestamp, 
       cast(created_at_timestamp AS long) AS timeInMs,
       DATETIMECONVERT(
         created_at_timestamp, 
         '1:MILLISECONDS:EPOCH', 
         '1:DAYS:EPOCH', 
         '1:DAYS'
       ) AS convertedTime
from githubEvents
WHERE id = 7044874134

created_at_timestamp bucketed to 15 minutes granularity:

select id, 
       created_at_timestamp, 
       cast(created_at_timestamp AS long) AS timeInMs,
       DATETIMECONVERT(
         created_at_timestamp, 
         '1:MILLISECONDS:EPOCH', 
         '1:MILLISECONDS:EPOCH', 
         '15:MINUTES'
       ) AS convertedTime
from githubEvents
WHERE id = 7044874134

created_at_timestamp to format yyyy-MM-dd, bucketed to 1 days granularity:

select id, 
       created_at_timestamp, 
       cast(created_at_timestamp AS long) AS timeInMs,
       DATETIMECONVERT(
         created_at_timestamp, 
         '1:MILLISECONDS:EPOCH', 
         '1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd', 
         '1:DAYS'
       ) AS convertedTime
from githubEvents
WHERE id = 7044874134

created_at_timestamp to format yyyy-MM-dd HH:mm, in timezone Pacific/Kiritimati:

select id, 
       created_at_timestamp, 
       cast(created_at_timestamp AS long) AS timeInMs,
       DATETIMECONVERT(
         created_at_timestamp, 
         '1:MILLISECONDS:EPOCH', 
         '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm tz(Pacific/Kiritimati)', 
         '1:MILLISECONDS'
       ) AS convertedTime
from githubEvents
WHERE id = 7044874134

created_at_timestamp to format yyyy-MM-dd, in timezone Pacific/Kiritimati and bucketed to 1 day granularity:

select id, 
       created_at_timestamp, 
       cast(created_at_timestamp AS long) AS timeInMs,
       DATETIMECONVERT(
         created_at_timestamp, 
         '1:MILLISECONDS:EPOCH', 
         '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm tz(Pacific/Kiritimati)', 
         '1:DAYS'
       ) AS convertedTime
from githubEvents
WHERE id = 7044874134

FUNNELCOUNT

This section contains reference documentation for the FUNNELCOUNT function.

Funnel analytics aggregation function.

Returns array of distinct correlated counts for each funnel step.

Signature

FUNNEL_COUNT (
STEPS ( predicate1, predicate2 ... ),
CORRELATED_BY ( correlation_column ),
SETTINGS ( setting1, setting2 ... ) )

Usage Examples

Many datasets are time series in nature, tracking events of an entity over time. An example of such a dataset could be a user analytics activity log from a commerce web application.

Example

Funnel

We want to analyse the following checkout funnel:

/cart/add
/checkout/start
/checkout/confirmation

Counts

We want to answer the following questions about the above funnel:

How many users entered the top of the funnel?
How many of these users proceeded to the second step?
How many users reached the bottom of the funnel after completing all steps?

Query

select 
  FUNNEL_COUNT(
    STEPS(
      url = '/cart/add', 
      url = '/checkout/start', 
      url = '/checkout/confirmation'),
    CORRELATED_BY(user_id)
  ) AS counts
from user_log

Notes
Notice that although U1 user added to cart twice, it still counted as one conversion in the first step, as we report on unique counts rather than total events. Also notice that although U2 events were logged out of order, we still counted the user as converted.

Equivalence

The above query is equivalent to the below presto SQL query:

select 
   ARRAY[
     count_if(steps[1]),
     count_if(steps[1] and steps[2]),
     count_if(steps[1] and steps[2] and steps[3])
   ] as counts
 from (
   select 
     ARRAY[
       bool_or(url = '/cart/add'),
       bool_or(url = '/checkout/start'),
       bool_or(url = '/checkout/confirmation')
     ] as steps
   from user_log
   group by user_id
 )

Settings

For a large dataset we could use for example a theta_sketch strategy, or furthermore, partition the data by user_id and leverage a partitioned strategy. It is also important to filter in the where clause so to aggregate only necessary rows.

select 
  FUNNEL_COUNT(
    STEPS(
      url = '/cart/add', 
      url = '/checkout/start', 
      url = '/checkout/confirmation'),
    CORRELATED_BY(user_id),
    SETTINGS('theta_sketch', 'nominalEntries=4096')
  ) AS counts
from user_log 
where url in ('/cart/add', '/checkout/start', '/checkout/confirmation')

Another Example

We now want to learn how many users checkout after a text search; as opposed to other entry points such as browsing a product category listing. We want to then analyse the following funnel:

/product/search
/cart/add
/checkout/start
/checkout/confirmation

Query

select 
  FUNNEL_COUNT(
    STEPS(
      url = '/product/search',
      url = '/cart/add', 
      url = '/checkout/start', 
      url = '/checkout/confirmation'),
    CORRELATED_BY(user_id)
  ) AS counts
from user_log

Notes
Notice that U1 is not counted in this funnel, as the user did not perform any product search. Both U2 and U3 entered the top of the funnel and performed the second step, but only U2 converted to the bottom of the funnel.