1 of 100

Functions

This page contains reference documentation for functions in Apache Pinot.

ABS

This section contains reference documentation for the abs function.

Absolute of a value

Signature

ABS(col1)

Usage Examples

select ABS(-12.1) AS value
from ignoreMe

value

12.1

select ABS(12.1) AS value
from ignoreMe

value

12.1

ADD

This section contains reference documentation for the ADD function.

Sum of at least two values

Signature

ADD(col1, col2, col3...)

Usage Examples

ago

This section contains reference documentation for the ago function.

Return time as epoch millis before the given period in ISO-8601 duration format (PnDTnHnMn.nS with days considered to be exactly 24 hours).

Examples:

"PT20.345S" -- parses as "20.345 seconds"
"PT15M" -- parses as "15 minutes" (where a minute is 60 seconds)
"PT10H" -- parses as "10 hours" (where an hour is 3600 seconds)
"P2D" -- parses as "2 days" (where a day is 24 hours or 86400 seconds)
"P2DT3H4M" -- parses as "2 days, 3 hours and 4 minutes"
"P-6H3M" -- parses as "-6 hours and +3 minutes"
"-P6H3M" -- parses as "-6 hours and -3 minutes"
"-P-6H+3M" -- parses as "+6 hours and -3 minutes"

Signature

ago()

Usage Examples

This function is typically used in the predicate to filter on timestamps for recent data. e.g. filter data on recent 1 day.

EXPR_MIN / EXPR_MAX

This section contains reference documentation for the EXPR_MIN and EXPR_MAX function.

This function scans the given dataset to identify the maximum and minimum values in the specified measuring columns. Once these extreme values (the maxima and minima) are found, the function locates the corresponding entries in the projection column. These entries are associated with the rows where the extreme values were found in the measuring columns. The function then returns these projection column values, providing a way to link the extreme measurements with their corresponding data in another part of the dataset.

Prerequisite

This function has to be used with the following configuration on the broker:

Signature

EXPR_MIN (projectionCol, measuringCol1, measuringCol2, measuringCol3)
EXPR_MAX (projectionCol, measuringCol1, measuringCol2, measuringCol3)

Usage Examples

Find the user with maximum activity. If there are multiple users, break the tie with their last_activity_date. If still a tie, break with user_id. And project user_id.

More useful is that this multiple such aggregation function can be used with GROUP BY

Note:

In cases where multiple rows share the same extreme values in the measuring columns, all such rows will be returned by the function.
If the goal is to project multiple different columns that correspond to the same set of measuring columns, you can achieve this by invoking the function multiple times, each time specifying a different projection column.
This impl does not work with AS clause (e.g. SELECT exprmin(longCol, doubleCol) AS exprmin won't work)
Putting exprmin/exprmax column inside order by clause (e.g. SELECT intCol, exprmin(longCol, doubleCol) FROM table GROUP BY intCol ORDER BY exprmin(longCol, doubleCol)) is not supported as semantically ordering multi-column multi-row exprmin/exprmax results doesn't make sense
Currently projecting MV bytes column doesn't work for now due to an issue

arrayConcatDouble

This section contains reference documentation for the arrayConcatDouble function.

Concatenates two arrays of doubles.

Signature

arrayConcatDouble('colName1', 'colName2')

Usage Examples

This example assumes the multiValueTable columns mvCol1 and mvCol2 are both of type DOUBLE with singleValueField in the table schema set to false.

arrayConcatFloat

This section contains reference documentation for the arrayConcatFloat function.

Concatenates two arrays of floats.

Signature

arrayConcatFloat('colName1', 'colName2')

Usage Examples

This example assumes the multiValueTable columns mvCol1 and mvCol2 are both of type FLOAT with singleValueField in the table schema set to false.

select mvCol1, 
       arrayConcatFloat(mvCol1, mvCol2) AS concatFloats
from multiValueTable
WHERE arraylength(mvCol1) >= 2
limit 5

arrayConcatInt

This section contains reference documentation for the arrayConcatInt function.

Concatenates two arrays of ints.

Signature

arrayConcatInt('colName1', 'colName2')

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivWheelsOffs, 
       arrayConcatInt(DivWheelsOffs, DivWheelsOns) AS concatIds
from airlineStats 
WHERE arraylength(DivWheelsOffs) >= 2
limit 5

DivWheelsOffs

concatIds

1453,1731

1453,1731,1415,1623

1908,1758

1908,1758,1339,2310

1453,1731

1453,1731,1415,1623

1908,1758

1908,1758,1339,2310

arrayConcatLong

This section contains reference documentation for the arrayConcatLong function.

Concatenates two arrays of longs.

Signature

arrayConcatLong('colName1', 'colName2')

Usage Examples

This example assumes the multiValueTable columns mvCol1 and mvCol2 are both of type LONG with singleValueField in the table schema set to false.

select mvCol1, 
       arrayConcatLong(mvCol1, mvCol2) AS concatLongs
from multiValueTable
WHERE arraylength(mvCol1) >= 2
limit 5

arrayConcatString

This section contains reference documentation for the arrayConcatString function.

Concatenates two arrays of strings.

Signature

arrayConcatString('colName1', 'colName2')

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivTailNums, 
       arrayConcatString(DivTailNums, DivTailNums) AS concatIds
from airlineStats 
WHERE arraylength(DivTailNums) >= 2
limit 5

DivTailNums

concatIds

N7713A,N7713A

N7713A,N7713A,N7713A,N7713A

N344AA,N344AA

N344AA,N344AA,N344AA,N344AA

N344AA,N344AA

N344AA,N344AA,N344AA,N344AA

N7713A,N7713A

N7713A,N7713A,N7713A,N7713A

arrayContainsInt

This section contains reference documentation for the arrayContainsInt function.

Checks if int value exists in array.

Signature

arrayContainsInt('colName', valueToFind)

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivAirportIDs, 
       arrayContainsInt(DivAirportIDs, 14683) AS containsValue
from airlineStats 
WHERE arraylength(DivAirportIDs) >= 2
limit 5

DivAirportIDs

containsValue

13891,12892

false

14683,14683

true

12339,12339

false

13487,13930

false

13029,11292

false

arrayContainsString

This section contains reference documentation for the arrayContainsString function.

Checks if string value exists in array.

Signature

arrayContainsString('colName', valueToFind)

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivTailNums, 
       arrayContainsString(DivTailNums, 'N7713A') AS index
from airlineStats 
WHERE arraylength(DivTailNums) >= 2
limit 5

DivTailNums

index

N7713A,N7713A

true

N344AA,N344AA

false

N7713A,N7713A

true

arrayDistinctInt

This section contains reference documentation for the arrayDistinctInt function.

Returns unique values in an array of ints.

Signature

arrayDistinctInt('colName')

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivAirportIDs, 
       arrayDistinctInt(DivAirportIDs) AS unique
from airlineStats 
WHERE arraylength(DivAirportIDs) >= 2
limit 5

DivAirportIDs

unique

15016,11066

10620,14869

13891,12892

12264,10397

11066,12892

arrayDistinctString

This section contains reference documentation for the arrayDistinctString function.

Returns unique values in an array of strings.

Signature

arrayDistinctString('colName')

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivTailNums, 
       arrayDistinctString(DivTailNums) AS unique
from airlineStats 
WHERE arraylength(DivTailNums) >= 2
limit 5

DivTailNums

unique

N7713A,N7713A

N7713A

N344AA,N344AA

N344AA

N344AA,N344AA

N344AA

N7713A,N7713A

N7713A

arrayIndexOfInt

This section contains reference documentation for the arrayIndexOfInt function.

Finds the last index of the given value in the array starting at the given index.

Signature

arrayIndexOfInt('colName', valueToFind)

Usage Examples

arrayIndexOfString

This section contains reference documentation for the arrayIndexOfString function.

Finds the last index of the given value in the array starting at the given index.

Signature

arrayIndexOfString('colName', valueToFind)

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivTailNums, 
       arrayIndexOfString(DivTailNums, 'N7713A') AS index
from airlineStats 
WHERE arraylength(DivTailNums) >= 2
limit 5

DivTailNums

index

N7713A,N7713A

N344AA,N344AA

-1

N7713A,N7713A

ARRAYLENGTH

This section contains reference documentation for the ARRAYLENGTH function.

Returns the length of a multi-value column

Signature

ARRAYLENGTH('colName')

Usage Examples

These examples are based on the Hybrid Quick Start.

select ARRAYLENGTH(RandomAirports) AS length, count(*) 
from airlineStats 
GROUP BY length
ORDER BY count(*) DESC
LIMIT 5

length

count(*)

5382

267

223

166

160

The count(*) values will increase each time we execute the query as data is constantly being ingested by the Hybrid Quick Start.

arrayRemoveInt

This section contains reference documentation for the arrayRemoveInt function.

Removes value from array of ints.

Signature

arrayRemoveInt('colName', value)

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivAirportIDs, 
       arrayRemoveInt(DivAirportIDs, 12892) AS value
from airlineStats 
WHERE arraylength(DivAirportIDs) >= 2
AND arrayContainsInt(DivAirportIDs, 12892) = 1
limit 5

DivAirportIDs

value

13891,12892

13891

13198,12892

13198

11066,12892

11066

13198,12892

13198

13891,12892

13891

arrayRemoveString

This section contains reference documentation for the arrayRemoveString function.

Removes value from array of strings.

Signature

arrayRemoveString('colName', value)

Usage Examples

arrayReverseInt

This section contains reference documentation for the arrayReverseInt function.

Reverses array of ints.

Signature

arrayReverseInt('colName')

Usage Examples

arrayReverseString

This section contains reference documentation for the arrayReverseString function.

Reverses array of strings.

Signature

arrayReverseString('colName')

Usage Examples

These examples are based on the Hybrid Quick Start.

select FlightNum, 
       arrayReverseString(RandomAirports) AS reversedAirports, 
       RandomAirports
from airlineStats 
WHERE arraylength(RandomAirports) BETWEEN 2 AND 4
limit 5

FlightNum

reversedAirports

RandomAirports

1206

PSC,SEA

SEA,PSC

5300

PSC,SEA

SEA,PSC

3359

MSY,PHX,PSC,SEA

SEA,PSC,PHX,MSY

1023

PHX,PSC,SEA

SEA,PSC,PHX

963

MSY,PHX,PSC,SEA

SEA,PSC,PHX,MSY

arraySliceInt

This section contains reference documentation for the arraySliceInt function.

Returns the values in the array between the start and end positions.

Signature

arraySliceInt('colName', start, end)

Usage Examples

arraySliceString

This section contains reference documentation for the arraySliceString function.

Returns the values in the array between the start and end positions.

Signature

arraySliceString('colName', start, end)

Usage Examples

These examples are based on the Hybrid Quick Start.

select FlightNum, 
       arraySliceString(RandomAirports, 0, 2) AS airports, 
       RandomAirports
from airlineStats 
WHERE arraylength(RandomAirports) BETWEEN 2 AND 4
limit 5

FlightNum

airports

RandomAirports

671

SEA,PSC

SEA,PSC,PHX,MSY

1767

SEA,PSC

SEA,PSC,PHX

2522

SEA,PSC

424

SEA,PSC

SEA,PSC,PHX,MSY

3162

SEA,PSC

SEA,PSC,PHX,MSY

arraySortString

This section contains reference documentation for the arraySortString function.

Sorts array of strings.

Signature

arraySortString('colName')

Usage Examples

These examples are based on the Hybrid Quick Start.

select FlightNum, 
       arraySortString(RandomAirports) AS sortedAirports, 
       RandomAirports
from airlineStats 
WHERE arraylength(RandomAirports) BETWEEN 2 AND 4
limit 5

FlightNum

sortedAirports

RandomAirports

3846

PSC,SEA

SEA,PSC

3635

MSY,PHX,PSC,SEA

SEA,PSC,PHX,MSY

429

MSY,PHX,PSC,SEA

SEA,PSC,PHX,MSY

1206

PSC,SEA

SEA,PSC

5300

PSC,SEA

SEA,PSC

arrayUnionInt

This section contains reference documentation for the arrayUnionInt function.

Create a union of two arrays of ints.

Signature

arrayUnionInt('colName1', 'colName2')

Usage Examples

These examples are based on the Hybrid Quick Start.

select DivWheelsOffs, 
       DivWheelsOns,
       arrayUnionInt(DivWheelsOffs, DivWheelsOns) AS unionIds
from airlineStats 
WHERE arraylength(DivWheelsOffs) >= 2
limit 5

DivWheelsOffs

DivWheelsOns

unionIds

1453,1731

1415,1623

1453,1731,1415,1623

1908,1758

1339,2310

1908,1758,1339,2310

1453,1731

1415,1623

1453,1731,1415,1623

1908,1758

1339,2310

1908,1758,1339,2310

arrayUnionString

This section contains reference documentation for the arrayUnionString function.

Create a union of two arrays of strings.

Signature

arrayUnionString('colName1', 'colName2')

Usage Examples

AVGMV

This section contains reference documentation for the AVGMV function.

Get the avg of values in a group

Signature

AVGMV(colName)

Usage Examples

These examples are based on the Hybrid Quick Start.

select AVGMV(DivLongestGTimes) AS value
from airlineStats 
where arraylength(DivLongestGTimes) > 1

value

18.465753424657535

Base64

This section contains reference documentation for base64 encode and decode functions.

toBase64 returns Base64 encoded string of input binary data (bytes type).
fromBase64 returns binary data (represented as a Hex string) from Base64-encoded string.

Signature

toBase64(bytesCol)
fromBase64(stringCol)

Usage Examples

Note that the following query will throw compilation error as string is not a valid input type for toBase64.

caseWhen

This section contains reference documentation for the caseWhen function.

Signature

caseWhen(booleanExpr1, valueIfExpr1True, booleanExpr2, valueIfExpr2True) caseWhen(booleanExpr1, valueIfExpr1True, booleanExpr2, valueIfExpr2True, ... ,valueIfFalse)

Usage Examples

The usage examples are based on extracting fields from the following JSON documents:

ceil

This section contains reference documentation for the CEIL function.

Rounded up to the nearest integer.

Signature

CEIL(col1)

Usage Examples

select CEIL(12.1) AS value
from ignoreMe

value

select CEIL(-12.1) AS value
from ignoreMe

value

-12

CHR

This section contains reference documentation for the CHR function.

the character corresponding to the Unicode codepoint

Signature

CHR(codepoint)

Usage Examples

SELECT CHR(65) AS value
FROM ignoreMe

value

codepoint

This section contains reference documentation for the CODEPOINT function.

the Unicode codepoint of the first character of the string

Signature

CODEPOINT(col)

Usage Examples

concat

This section contains reference documentation for the concat function.

Concatenate two input strings using the seperator

Signature

CONCAT(col1, col2, seperator)

Usage Examples

count

This section contains reference documentation for the count function.

Get the count of rows in a group

Signature

COUNT(colName)

Usage Examples

These examples are based on the Batch Quick Start.

select count(*) AS value
from baseballStats

value

97889

COVAR_POP

This section contains reference documentation for the COVAR_POP function.

Returns the population covariance between of 2 numerical columns.

Signatures

COVAR_POP(col1, col2) -> double

Usage Examples

COVAR_SAMP

This section contains reference documentation for the COVAR_SAMP function.

Returns the sample covariance between of 2 numerical columns.

COVAR_SAMP(col1, col2) = COVAR_POP(col1, col2) * besselCorrection

Signatures

COVAR_SAMP(col1, col2) -> double

Usage Examples

These examples are based on the Batch Quick Start.

SELECT COVAR_SAMP(numberOfGames, AtBatting) AS covariance 
FROM baseballStats

covariance

8270.973200974102

day

This section contains reference documentation for the day function.

Returns the day of the month from the given epoch millis in UTC or specified timezone. The value ranges from 1 to 31.

Signature

day(tsInMillis)
day(tsInMillis, timeZoneId)
dayOfMonth(tsInMillis)
dayOfMonth(tsInMillis, timeZoneId)

Usage Examples

select day(1639351800000) AS day
FROM ignoreMe

day

select day(1639351800000, 'CET') AS day
FROM ignoreMe

day

select dayOfMonth(1639351800000) AS day
FROM ignoreMe

day

select dayOfMonth(1639351800000, 'CET') AS day
FROM ignoreMe

day

dayOfWeek

This section contains reference documentation for the dayOfWeek function.

Returns the day of the week from the given epoch millis in UTC timezone. The value ranges from 1(Monday) to 7(Sunday).

Signature

dayOfWeek(tsInMillis)
dayOfWeek(tsInMillis, timeZoneId)
dow(tsInMillis)
dow(tsInMillis, timeZoneId)

Usage Examples

select dayOfWeek(1639351800000) AS dayOfWeek
FROM ignoreMe

dayOfWeek

select dayOfWeek(1639351800000, 'CET') AS dayOfWeek
FROM ignoreMe

dayOfWeek

select dow(1639351800000) AS dayOfWeek
FROM ignoreMe

dayOfWeek

select dow(1639351800000, 'CET') AS dayOfWeek
FROM ignoreMe

dayOfWeek

dayOfYear

This section contains reference documentation for the dayOfYear function.

Returns the day of the year from the given epoch millis in UTC or specified timezone. The value ranges from 1 to 366.

Signature

dayOfYear(tsInMillis)
dayOfYear(tsInMillis, timeZoneId)
doy(tsInMillis)
doy(tsInMillis, timeZoneId)

Usage Examples

select dayOfYear(1639351800000) AS dayOfYear
FROM ignoreMe

dayOfYear

346

select dayOfYear(1639351800000, 'CET') AS dayOfYear
FROM ignoreMe

dayOfYear

347

select doy(1639351800000) AS dayOfYear
FROM ignoreMe

dayOfYear

346

select doy(1639351800000, 'CET') AS dayOfYear
FROM ignoreMe

dayOfYear

347

DISTINCT

This section contains reference documentation for the DISTINCT function.

Returns the distinct row values in a group

Signature

DISTINCT(colName)

Usage Examples

These examples are based on the Batch Quick Start.

select DISTINCT league AS value
from baseballStats

value

select DISTINCT(league) AS value
from baseballStats

value

DISTINCTAVG

This section contains reference documentation for the DISTINCTAVG function.

Returns the average of distinct row values in a group

Signature

DISTINCTAVG(colName) or avg(distinct col)

Usage Examples

These examples are based on the Batch Quick Start.

SELECT DISTINCTAVG(runs) AS VALUE
FROM baseballStats

VALUE

83.36526946107784

SELECT AVG(DISTINCT AtBatting) AS VALUE
FROM baseballStats

VALUE

349.1158798283262

DISTINCTAVGMV

This section contains reference documentation for the DISTINCTAVGMV function.

Returns the average of distinct row values in a group

Signature

DISTINCTAVGMV(colName)

Usage Examples

These examples are based on the Hybrid Quick Start.

SELECT DISTINCTAVGMV(DivLongestGTimes) AS VALUE
FROM airlineStats
WHERE arraylength(DivLongestGTimes) > 1

VALUE

32.4

DISTINCTCOUNT

This section contains reference documentation for the DISTINCTCOUNT function.

Returns the count of distinct row values in a group

Signature

DISTINCTCOUNT(colName)

Usage Examples

These examples are based on the Batch Quick Start.

select DISTINCTCOUNT(league) AS value
from baseballStats

value

select DISTINCTCOUNT(teamID) AS value
from baseballStats

value

149

DISTINCTCOUNTBITMAP

This section contains reference documentation for the DISTINCTCOUNTBITMAP function.

Returns the count of distinct row values in a group. This function is accurate for INT column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collisions. For accurate distinct counting on all column types, see DISTINCTCOUNT.

Signature

DISTINCTCOUNTBITMAP(colName)

Usage Examples

These examples are based on the Batch Quick Start.

select DISTINCTCOUNTBITMAP(league) AS value
from baseballStats

value

select DISTINCTCOUNTBITMAP(teamID) AS value
from baseballStats

value

148

DISTINCTCOUNTHLLMV

This section contains reference documentation for the DISTINCTCOUNTBITMAPMV function.

Returns the count of distinct row values in a group. This function is accurate for an INT or dictionary encoded column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collision.

Signature

DISTINCTCOUNTBITMAPMV(colName)

Usage Examples

These examples are based on the Hybrid Quick Start.

select DISTINCTCOUNTBITMAPMV(DivLongestGTimes) AS value
from airlineStats 
where arraylength(DivLongestGTimes) > 1

value

select DISTINCTCOUNTBITMAPMV(DivTailNums) AS value
from airlineStats 
where arraylength(DivTailNums) > 1

value

DISTINCTCOUNTHLL

This section contains reference documentation for the DISTINCTCOUNTHLL function.

Returns an approximate distinct count using HyperLogLog. It also takes an optional second argument to configure the log2m for the HyperLogLog. For accurate distinct counting, see DISTINCTCOUNT.

Signature

DISTINCTCOUNTHLL(colName, log2m)

Usage Examples

These examples are based on the Batch Quick Start.

select DISTINCTCOUNTHLL(teamID) AS value
from baseballStats

value

158

select DISTINCTCOUNTHLL(teamID, 12) AS value
from baseballStats

value

149

DISTINCTCOUNTBITMAPMV

This section contains reference documentation for the DISTINCTCOUNTHLLMV function.

Returns an approximate distinct count using HyperLogLog in a group.

Signature

DISTINCTCOUNTHLLMV(colName)

Usage Examples

DISTINCTCOUNTRAWHLLMV

This section contains reference documentation for the DISTINCTCOUNTRAWHLLMV function.

Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.

Signature

DISTINCTCOUNTRAWHLLMV(colName, log2m)

Usage Examples

These examples are based on the Hybrid Quick Start.

select DISTINCTCOUNTRAWHLLMV(DivAirports) AS value
from airlineStats 
where arraylength(DivAirports) > 1

value

00000008000000ac00000000000000000000000500000020000000000030000202108000040000010000000300010400000000000000000000000463000000000000000000010001041000200000002000000000000000000a00000000028001000000010800000000010000001008000000804000000000020000040000880000000000000000000000000000000000000000000000800000000800020004000000840000000002000000000000000000001400

select DISTINCTCOUNTRAWHLLMV(DivAirports, 1) AS value
from airlineStats 
where arraylength(DivAirports) > 1

value

0000000100000004000000e4

DIV

This section contains reference documentation for the DIV function.

Quotient of two values

Signature

DIV(col1, col2)

Usage Examples

These examples are based on the Batch Quick Start.

select homeRuns, numberOfGames, DIV(homeRuns, numberOfGames) AS total
from baseballStats 
WHERE teamID = 'ML1' 
AND yearID = 1956 
AND playerName = 'Henry Louis'

homeRuns

numberOfGames

total

153

0.16993464052287582

millisecond

This section contains reference documentation for the millisecond function.

Returns the millisecond of the second from the given epoch millis in UTC or specified timezone. The value ranges from 0 to 999.

Signature

millisecond(tsInMillis)
millisecond(tsInMillis, timeZoneId)

Usage Examples

select millisecond(1639351800000) AS millisecond
FROM ignoreMe

millisecond

select millisecond(1639351800000, 'America/St_Johns') AS millisecond
FROM ignoreMe

millisecond

FunnelMaxStep

The FunnelMaxStep function in Pinot is designed to track user progress through a predefined series of steps or stages in a funnel, such as user interactions on a website from page views to purchases. This function is particularly useful for analyzing how far users progress through a conversion process within a specified time window.

Syntax

FunnelMaxStep(
    timestampExpression, 
    windowSize, 
    numberSteps, stepExpression
    [, stepExpression[, stepExpression, ...]]
    [, mode [, mode, ... ]]
)

Return

This function returns the Integer value of the max steps that window funnel could proceed forward.

Arguments

timestampExpression:
- Type: Expression in TIMESTAMP or LONG
- Description: This is an expression that evaluates to the timestamp of each event. It's used to determine the order of events for a particular user or session. The timestamp is crucial for evaluating whether subsequent actions fall within the specified window.
windowSize:
- Type: LONG
- Description: Specifies the size of the time window in which the sequence of funnel steps must occur. The window is defined in milliseconds. This parameter sets the maximum allowed time between the first and the last step in the funnel for them to be considered as part of the same user journey.
numberSteps:
- Type: Integer
- Description: Defines the total number of distinct steps in the funnel. This count should match the number of stepExpression parameters provided.
stepExpression:
- Type: Boolean Expression
- Description: These are expressions that define each step in the funnel. Typically, these are conditions that evaluate whether a specific event type or action has occurred. Multiple step expressions are separated by commas, with each expression corresponding to a step in the funnel sequence.
mode (optional):
- Type: String
- Description: Defines additional modes or options that alter how the funnel analysis is calculated. Common modes might include settings to handle overlapping events, reset the window upon each step, or other custom behaviors specific to the needs of the funnel analysis. If unspecified, the default behavior as defined by Pinot is used.

Optional Mode Supported

STRICT_DEDUPLICATION

The STRICT_DEDUPLICATION mode ensures that repeating occurrences of the same event condition within a funnel sequence disrupt further processing of the funnel for that user session. This mode is crucial when it's important to identify and measure unique, non-repeated actions in a sequence, ensuring each step of the funnel represents a distinct action.

Practical Impact

Event Sequence Interruption: When an event that satisfies a current step condition occurs repeatedly without progression to the next step, strict_deduplication interrupts and essentially ends the analysis of the funnel for that sequence. This prevents the funnel from incorrectly advancing if the same action is merely repeated instead of moving through the intended steps.
Enhanced Accuracy in Funnel Progression: This mode is useful for scenarios where the continuity and progression of distinct steps are critical for accurate conversion analysis. It avoids the misinterpretation of user engagement where repeated similar actions might otherwise suggest a false progression through the funnel.

Example

For instance, if a funnel is designed to track user progression from a homepage visit, to a search, to adding an item to a cart, and then to checkout, the strict_deduplication mode would stop processing the funnel sequence if the user performs multiple searches without proceeding to add an item to the cart. This ensures that only a linear, non-repetitive progression through these steps is considered as valid funnel movement.

This mode helps maintain the integrity of each step in the user's journey, ensuring that the data reflects true user behavior without overcounting repetitive actions that do not lead to actual progression.

STRICT_ORDER

The strict_order mode enforces a stringent sequence order for events within a funnel. This mode ensures that the progression through the steps follows the exact specified order without any intervening events that are not part of the defined sequence.

Behavior of `strict_order`

Sequence Adherence: The strict_order mode requires that the events occur in the exact order specified without any other types of events intervening. If an event occurs that is not the next expected step in the defined sequence, the analysis of the funnel for that user session is halted.
Early Termination: In the presence of an out-of-sequence event, the analysis stops, and the maximum event level is determined as the last correct step in the sequence before the interruption. For instance, in a specified sequence of A -> B -> C, if the sequence is A -> B -> D, then the funnel analysis terminates after B because D is not the expected next step (C).

Practical Impact

Enhanced Precision in Path Analysis: This mode is particularly valuable when the precise order of actions is critical for the analysis, such as in strict process flows where each step must be followed in a specific order to be considered successful.
Avoids Misinterpretation: It prevents the misinterpretation of funnel progress where intervening or unordered events could suggest a misleading path through the funnel.

Example

Consider a scenario where a funnel is set up to track user progression through the following steps: logging in (A), searching for products (B), adding a product to the cart (C), and completing a purchase (D). Using the strict_order mode, if the sequence goes A -> B -> E -> C, the analysis will terminate after B because E (an unexpected event like viewing account details) intervenes before C, the expected next step. Therefore, the maximum step reached is reported as 2, representing the successful completion of steps A and B only.

This mode is crucial for scenarios requiring strict compliance to process steps, ensuring that only users who follow the exact intended sequence are counted in the funnel analysis.

STRICT_INCREASE

The strict_increase is designed to ensure that the sequence of events being analyzed has strictly increasing timestamps. This mode is crucial for accurately tracking and analyzing user behavior in scenarios where the chronological order of events directly impacts the interpretation of user actions within a funnel.

Behavior of `strict_increase`

Timestamp Order: This mode requires that each subsequent event in the funnel must have a timestamp greater than the previous event. It ensures that the user's actions are not only in the correct sequence but also follow a temporal progression without any backtracking or simultaneous actions.
Analysis Integrity: If any event in the sequence does not adhere to the strictly increasing order by timestamp, the analysis for that sequence either stops at that point or ignores the out-of-order event, depending on how critical the temporal sequence is to the funnel's logic.

Practical Impact

Temporal Validation: This mode is particularly useful in scenarios where the timing of events is crucial, such as in sessions where actions must follow one another in real-time to be considered valid. It validates the sequence not just by the type of event, but also by ensuring that these events are progressively happening over time.
Avoiding Data Errors: It helps in avoiding potential data errors or anomalies where timestamps might not have been recorded correctly, or events may appear out of order due to system errors or delays in logging events.

Example

Consider a funnel designed to analyze a user's journey from visiting a website to making a purchase, defined by the following steps: page visit (A), item addition (B), checkout initiation (C), and payment completion (D). Using the strict_increase mode, the funnel will only consider sequences where each action occurs later than the previous. If a user's sequence is A (t1) -> B (t2) -> A (t3) -> C (t4) with t3 being less than or equal to t2, then the analysis will ignore the second occurrence of A or terminate, depending on the specific implementation and requirements of the analysis.

This mode helps ensure that the funnel analysis reflects true, linear progress through the intended actions, with each step occurring in a timely, sequential manner.

KEEP_ALL

The KEEP_ALL mode is designed to ensure that all events in the data set are considered in the analysis, even if they do not match any of the specified step conditions in the funnel sequence. This mode is particularly useful for comprehensive data analysis where the context of non-matching events may still provide valuable insights about user behavior or system performance.

Behavior of `KEEP_ALL`

Inclusive Analysis: In the KEEP_ALL mode, the funnel function includes every event within the specified time window in the analysis, regardless of whether these events correspond to the predefined steps in the funnel. This allows for a more holistic view of the user's actions during the session.
Context Retention: By including all events, this mode helps retain the full context of a user's session, capturing activities that may not be directly related to the funnel but could influence or explain the user's behavior and decisions at other points.

Practical Impact

Enhanced Insight: This mode is invaluable for scenarios where understanding the entirety of user interactions is crucial, such as in complex user journeys where additional actions between the main funnel steps might influence the outcomes or indicate other patterns of interest.
Data Completeness: It prevents data loss from filtering out non-matching events, which can be important when analyzing sessions for comprehensive patterns, troubleshooting issues, or performing detailed user journey analysis.

Example

Consider a scenario where a funnel is set up to track user progress through steps like logging in, searching for a product, and making a purchase. With KEEP_ALL mode enabled, if a user performs additional actions such as updating profile information or viewing terms and conditions, these events are also included in the analysis. This comprehensive inclusion allows analysts to see a fuller picture of what the user did during their session, not just the actions that directly relate to the funnel. This can reveal if other activities are detracting from the main conversion goals, or if they are part of a broader user engagement that doesn't neatly fit into the primary funnel steps.

This mode helps to ensure that no potential insights are lost by excluding events, making it a powerful option for detailed analysis and understanding of user interactions beyond the strict confines of the predefined funnel steps.

Examples

Data Set

event_name

user_id

screen_viewed

1718112402

screen_clicked

1718112403

purchased

1718112404

screen_viewed

1718112405

screen_clicked

1718112406

purchased

1718112407

screen_viewed

1718112405

screen_clicked

1718112406

purchased

1718112407

screen_viewed

1718112404

screen_clicked

1718112405

cart_viewed

1718112406

purchased

1718112407

screen_viewed

1717939609

screen_clicked

1718112405

purchased

1718112405

Queries

Query funnels

SELECT user_id,
  funnelMaxStep(
    ts,
    '1000000',
    4,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'cart_viewed',
    event_name = 'purchased'
  ) as steps
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id

Response

user_id

steps

Query with strict_order

SELECT user_id,
  funnelMaxStep(
    ts,
    '100000',
    3,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'purchased',
    'strict_order'
  ) as steps
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id

Response

user_id

steps

Query with strict_order and keep_all

SELECT user_id,
  funnelMaxStep(
    ts,
    '100000',
    3,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'purchased',
    'strict_order',
    'keep_all'
  ) as steps
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id

Response

user_id

steps

Query with longer window

SELECT user_id,
  funnelMaxStep(
    ts,
    '1000000',
    3,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'purchased',
    'strict_order'
  ) as steps
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id

Response

user_id

steps

FunnelCompleteCount

The FunnelCompleteCount function in Pinot is designed to track user progress through a predefined series of steps or stages in a funnel, such as user interactions on a website from page views to purchases. This function is particularly useful for analyzing how many times users progress through the whole conversion processes within a specified time window.

Syntax

FunnelCompleteCount(
    timestampExpression, 
    windowSize, 
    numberSteps, stepExpression
    [, stepExpression[, stepExpression, ...]]
    [, mode [, mode, ... ]]
)

Return

This function returns how many times the funnel has been went through.

Arguments

timestampExpression:
- Type: Expression in TIMESTAMP or LONG
- Description: This is an expression that evaluates to the timestamp of each event. It's used to determine the order of events for a particular user or session. The timestamp is crucial for evaluating whether subsequent actions fall within the specified window.
windowSize:
- Type: LONG
- Description: Specifies the size of the time window in which the sequence of funnel steps must occur. The window is defined in milliseconds. This parameter sets the maximum allowed time between the first and the last step in the funnel for them to be considered as part of the same user journey.
numberSteps:
- Type: Integer
- Description: Defines the total number of distinct steps in the funnel. This count should match the number of stepExpression parameters provided.
stepExpression:
- Type: Boolean Expression
- Description: These are expressions that define each step in the funnel. Typically, these are conditions that evaluate whether a specific event type or action has occurred. Multiple step expressions are separated by commas, with each expression corresponding to a step in the funnel sequence.
mode (optional):
- Type: String
- Description: Defines additional modes or options that alter how the funnel analysis is calculated. Common modes might include settings to handle overlapping events, reset the window upon each step, or other custom behaviors specific to the needs of the funnel analysis. If unspecified, the default behavior as defined by Pinot is used.

Optional Mode Supported

STRICT_DEDUPLICATION

Practical Impact

Event Sequence Interruption: When an event that satisfies a current step condition occurs repeatedly without progression to the next step, strict_deduplication interrupts and essentially ends the analysis of the funnel for that sequence. This prevents the funnel from incorrectly advancing if the same action is merely repeated instead of moving through the intended steps.
Enhanced Accuracy in Funnel Progression: This mode is useful for scenarios where the continuity and progression of distinct steps are critical for accurate conversion analysis. It avoids the misinterpretation of user engagement where repeated similar actions might otherwise suggest a false progression through the funnel.

Example

STRICT_ORDER

Behavior of `strict_order`

Sequence Adherence: The strict_order mode requires that the events occur in the exact order specified without any other types of events intervening. If an event occurs that is not the next expected step in the defined sequence, the analysis of the funnel for that user session is halted.
Early Termination: In the presence of an out-of-sequence event, the analysis stops, and the maximum event level is determined as the last correct step in the sequence before the interruption. For instance, in a specified sequence of A -> B -> C, if the sequence is A -> B -> D, then the funnel analysis terminates after B because D is not the expected next step (C).

Practical Impact

Enhanced Precision in Path Analysis: This mode is particularly valuable when the precise order of actions is critical for the analysis, such as in strict process flows where each step must be followed in a specific order to be considered successful.
Avoids Misinterpretation: It prevents the misinterpretation of funnel progress where intervening or unordered events could suggest a misleading path through the funnel.

Example

This mode is crucial for scenarios requiring strict compliance to process steps, ensuring that only users who follow the exact intended sequence are counted in the funnel analysis.

STRICT_INCREASE

Behavior of `strict_increase`

Timestamp Order: This mode requires that each subsequent event in the funnel must have a timestamp greater than the previous event. It ensures that the user's actions are not only in the correct sequence but also follow a temporal progression without any backtracking or simultaneous actions.
Analysis Integrity: If any event in the sequence does not adhere to the strictly increasing order by timestamp, the analysis for that sequence either stops at that point or ignores the out-of-order event, depending on how critical the temporal sequence is to the funnel's logic.

Practical Impact

Temporal Validation: This mode is particularly useful in scenarios where the timing of events is crucial, such as in sessions where actions must follow one another in real-time to be considered valid. It validates the sequence not just by the type of event, but also by ensuring that these events are progressively happening over time.
Avoiding Data Errors: It helps in avoiding potential data errors or anomalies where timestamps might not have been recorded correctly, or events may appear out of order due to system errors or delays in logging events.

Example

This mode helps ensure that the funnel analysis reflects true, linear progress through the intended actions, with each step occurring in a timely, sequential manner.

KEEP_ALL

Behavior of `KEEP_ALL`

Inclusive Analysis: In the KEEP_ALL mode, the funnel function includes every event within the specified time window in the analysis, regardless of whether these events correspond to the predefined steps in the funnel. This allows for a more holistic view of the user's actions during the session.
Context Retention: By including all events, this mode helps retain the full context of a user's session, capturing activities that may not be directly related to the funnel but could influence or explain the user's behavior and decisions at other points.

Practical Impact

Enhanced Insight: This mode is invaluable for scenarios where understanding the entirety of user interactions is crucial, such as in complex user journeys where additional actions between the main funnel steps might influence the outcomes or indicate other patterns of interest.
Data Completeness: It prevents data loss from filtering out non-matching events, which can be important when analyzing sessions for comprehensive patterns, troubleshooting issues, or performing detailed user journey analysis.

Example

Examples

Data Set

event_name

user_id

screen_viewed

1718112402

screen_clicked

1718112403

purchased

1718112404

screen_viewed

1718112405

screen_clicked

1718112406

purchased

1718112407

screen_viewed

1718112405

screen_clicked

1718112406

purchased

1718112407

screen_viewed

1718112404

screen_clicked

1718112405

cart_viewed

1718112406

purchased

1718112407

screen_viewed

1717939609

screen_clicked

1718112405

purchased

1718112405

Queries

Query funnels

SELECT user_id,
  funnelCompleteCount(
    ts,
    '1000000',
    4,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'cart_viewed',
    event_name = 'purchased'
  ) as rounds
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id

Response

user_id

rounds

Query with strict_order

SELECT user_id,
  funnelCompleteCount(
    ts,
    '1000000',
    3,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'purchased',
    'strict_order'
  ) as rounds
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id

Response

user_id

rounds

Query with strict_order and keep_all

SELECT user_id,
  funnelCompleteCount(
    ts,
    '100000',
    3,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'purchased',
    'strict_order',
    'keep_all'
  ) as rounds
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id

Response

user_id

rounds

Query with longer window

SELECT user_id,
  funnelMaxStep(
    ts,
    '1000000',
    3,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'purchased',
    'strict_order'
  ) as rounds
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id

Response

user_id

rounds

FunnelMatchStep

The FunnelMatchStep function in Pinot is designed to track user progress through a predefined series of steps or stages in a funnel, such as user interactions on a website from page views to purchases. This function is particularly useful for analyzing how far users progress through a conversion process within a specified time window.

Syntax

FunnelMatchStep(
    timestampExpression, 
    windowSize, 
    numberSteps, stepExpression
    [, stepExpression[, stepExpression, ...]]
    [, mode [, mode, ... ]]
)

Return

This function is similar to the function FunnelMaxStep, instead of returning the number of max step, it returns an array of the size 'number of steps', and marked the matched steps as 1, non-matching as 0.

E.g.

numberSteps = 3, maxStep = 0 -> [0, 0, 0]
numberSteps = 4, maxStep = 2 -> [1, 1, 0, 0]

Arguments

timestampExpression:
- Type: Expression in TIMESTAMP or LONG
- Description: This is an expression that evaluates to the timestamp of each event. It's used to determine the order of events for a particular user or session. The timestamp is crucial for evaluating whether subsequent actions fall within the specified window.
windowSize:
- Type: LONG
- Description: Specifies the size of the time window in which the sequence of funnel steps must occur. The window is defined in milliseconds. This parameter sets the maximum allowed time between the first and the last step in the funnel for them to be considered as part of the same user journey.
numberSteps:
- Type: Integer
- Description: Defines the total number of distinct steps in the funnel. This count should match the number of stepExpression parameters provided.
stepExpression:
- Type: Boolean Expression
- Description: These are expressions that define each step in the funnel. Typically, these are conditions that evaluate whether a specific event type or action has occurred. Multiple step expressions are separated by commas, with each expression corresponding to a step in the funnel sequence.
mode (optional):
- Type: String
- Description: Defines additional modes or options that alter how the funnel analysis is calculated. Common modes might include settings to handle overlapping events, reset the window upon each step, or other custom behaviors specific to the needs of the funnel analysis. If unspecified, the default behavior as defined by Pinot is used.

Optional Mode Supported

STRICT_DEDUPLICATION

Practical Impact

Event Sequence Interruption: When an event that satisfies a current step condition occurs repeatedly without progression to the next step, strict_deduplication interrupts and essentially ends the analysis of the funnel for that sequence. This prevents the funnel from incorrectly advancing if the same action is merely repeated instead of moving through the intended steps.
Enhanced Accuracy in Funnel Progression: This mode is useful for scenarios where the continuity and progression of distinct steps are critical for accurate conversion analysis. It avoids the misinterpretation of user engagement where repeated similar actions might otherwise suggest a false progression through the funnel.

Example

STRICT_ORDER

Behavior of `strict_order`

Sequence Adherence: The strict_order mode requires that the events occur in the exact order specified without any other types of events intervening. If an event occurs that is not the next expected step in the defined sequence, the analysis of the funnel for that user session is halted.
Early Termination: In the presence of an out-of-sequence event, the analysis stops, and the maximum event level is determined as the last correct step in the sequence before the interruption. For instance, in a specified sequence of A -> B -> C, if the sequence is A -> B -> D, then the funnel analysis terminates after B because D is not the expected next step (C).

Practical Impact

Enhanced Precision in Path Analysis: This mode is particularly valuable when the precise order of actions is critical for the analysis, such as in strict process flows where each step must be followed in a specific order to be considered successful.
Avoids Misinterpretation: It prevents the misinterpretation of funnel progress where intervening or unordered events could suggest a misleading path through the funnel.

Example

This mode is crucial for scenarios requiring strict compliance to process steps, ensuring that only users who follow the exact intended sequence are counted in the funnel analysis.

STRICT_INCREASE

Behavior of `strict_increase`

Timestamp Order: This mode requires that each subsequent event in the funnel must have a timestamp greater than the previous event. It ensures that the user's actions are not only in the correct sequence but also follow a temporal progression without any backtracking or simultaneous actions.
Analysis Integrity: If any event in the sequence does not adhere to the strictly increasing order by timestamp, the analysis for that sequence either stops at that point or ignores the out-of-order event, depending on how critical the temporal sequence is to the funnel's logic.

Practical Impact

Temporal Validation: This mode is particularly useful in scenarios where the timing of events is crucial, such as in sessions where actions must follow one another in real-time to be considered valid. It validates the sequence not just by the type of event, but also by ensuring that these events are progressively happening over time.
Avoiding Data Errors: It helps in avoiding potential data errors or anomalies where timestamps might not have been recorded correctly, or events may appear out of order due to system errors or delays in logging events.

Example

This mode helps ensure that the funnel analysis reflects true, linear progress through the intended actions, with each step occurring in a timely, sequential manner.

KEEP_ALL

Behavior of `KEEP_ALL`

Inclusive Analysis: In the KEEP_ALL mode, the funnel function includes every event within the specified time window in the analysis, regardless of whether these events correspond to the predefined steps in the funnel. This allows for a more holistic view of the user's actions during the session.
Context Retention: By including all events, this mode helps retain the full context of a user's session, capturing activities that may not be directly related to the funnel but could influence or explain the user's behavior and decisions at other points.

Practical Impact

Enhanced Insight: This mode is invaluable for scenarios where understanding the entirety of user interactions is crucial, such as in complex user journeys where additional actions between the main funnel steps might influence the outcomes or indicate other patterns of interest.
Data Completeness: It prevents data loss from filtering out non-matching events, which can be important when analyzing sessions for comprehensive patterns, troubleshooting issues, or performing detailed user journey analysis.

Example

Examples

Data Set

event_name

user_id

screen_viewed

1718112402

screen_clicked

1718112403

purchased

1718112404

screen_viewed

1718112405

screen_clicked

1718112406

purchased

1718112407

screen_viewed

1718112405

screen_clicked

1718112406

purchased

1718112407

screen_viewed

1718112404

screen_clicked

1718112405

cart_viewed

1718112406

purchased

1718112407

screen_viewed

1717939609

screen_clicked

1718112405

purchased

1718112405

Queries

Query funnels

SELECT user_id,
  funnelMatchStep(
    ts,
    '1000000',
    4,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'cart_viewed',
    event_name = 'purchased'
  ) as matchedsteps
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id

Response

user_id

matchedSteps

[1, 1, 0, 0]

[1, 1, 1, 1]

[1, 1, 0, 0]

Query with funnel count analysis

The below query puts the above query in the CTE, then use sumArrayLong to show the funnel transitions for each steps.

WITH funnelMatchSteps AS (
  SELECT user_id,
  funnelMatchStep(
    ts,
    '1000000',
    4,
    event_name = 'screen_viewed',
    event_name = 'screen_clicked',
    event_name = 'cart_viewed',
    event_name = 'purchased'
  ) as matchedsteps
  FROM clickstreamFunnel
  GROUP BY user_id
)

SELECT sumArrayLong(matchedsteps) as funnelCounts FROM funnelMatchSteps

Response

funnelCounts

[4, 4, 1, 1]

FUNNELCOUNT

This section contains reference documentation for the FUNNELCOUNT function.

Funnel analytics aggregation function.

Returns array of distinct correlated counts for each funnel step.

Signature

FUNNEL_COUNT (
STEPS ( predicate1, predicate2 ... ),
CORRELATE_BY ( correlation_column ),
SETTINGS ( setting1, setting2 ... ) )

Parameter

Arguments

Description

STEPS

predicates 1...n

(required) These are individual predicates representing funnel steps which are applied on rows selected by the where clause. Distinct values from the correlation_column that satisfy these predicates are counted per step. For example, all filtered rows that match url = '/checkout' are unionized into a set. Sets are intersected with the sets resulted from the preceding steps, each step retaining only individuals present in previous steps. Finally, unique counts are returned for each step in the funnel.

CORRELATE_BY

correlation_column

(required) Column to leverage for funnel correlation, distinct values from this column are counted per step during aggregation. Only dictionary-encoded columns are supported.

SETTINGS

settings 1...n

(optional) Settings to select and configure a funnel counting strategy:

nominalEntries: theta-sketch strategy parameter (defaults to 4096). Can only be used in conjunction with theta_sketch setting.

sorted: This strategy counts funnel steps per segment with zero memory footprint. Correlation column should be configured as sort column for this strategy. Can only be used in conjunction with partitioned setting.

Usage Examples

Many datasets are time series in nature, tracking events of an entity over time. An example of such a dataset could be a user analytics activity log from a commerce web application.

Example

user_id

event_time

url

2021-10-01 09:01:00.000

/product/listing

2021-10-01 09:17:00.000

/product/search

2021-10-01 09:33:00.000

/product/details

2021-10-01 09:47:00.000

/cart/add

2021-10-01 10:02:00.000

/product/listing

2021-10-01 10:05:00.000

/product/search

2021-10-01 10:06:00.000

/product/search

2021-10-01 10:15:00.000

/checkout/start

2021-10-01 10:16:00.000

/cart/add

2021-10-01 11:17:00.000

/product/details

2021-10-01 11:18:00.000

/checkout/confirmation

2021-10-01 11:21:00.000

/cart/add

2021-10-01 11:33:00.000

/cart/add

2021-10-01 11:46:00.000

/checkout/start

2021-10-01 11:54:00.000

/checkout/confirmation

Funnel

We want to analyse the following checkout funnel:

/cart/add
/checkout/start
/checkout/confirmation

Counts

We want to answer the following questions about the above funnel:

How many users entered the top of the funnel?
How many of these users proceeded to the second step?
How many users reached the bottom of the funnel after completing all steps?

Query

select 
  FUNNEL_COUNT(
    STEPS(
      url = '/cart/add', 
      url = '/checkout/start', 
      url = '/checkout/confirmation'),
    CORRELATE_BY(user_id)
  ) AS counts
from user_log

counts

3, 2, 2

Notes
Notice that although U1 user added to cart twice, it still counted as one conversion in the first step, as we report on unique counts rather than total events. Also notice that although U2 events were logged out of order, we still counted the user as converted.

Equivalence

The above query is equivalent to the below presto SQL query:

select 
   ARRAY[
     count_if(steps[1]),
     count_if(steps[1] and steps[2]),
     count_if(steps[1] and steps[2] and steps[3])
   ] as counts
 from (
   select 
     ARRAY[
       bool_or(url = '/cart/add'),
       bool_or(url = '/checkout/start'),
       bool_or(url = '/checkout/confirmation')
     ] as steps
   from user_log
   group by user_id
 )

Settings

For a large dataset we could use for example a theta_sketch strategy, or furthermore, partition the data by user_id and leverage a partitioned strategy. It is also important to filter in the where clause so to aggregate only necessary rows.

select 
  FUNNEL_COUNT(
    STEPS(
      url = '/cart/add', 
      url = '/checkout/start', 
      url = '/checkout/confirmation'),
    CORRELATE_BY(user_id),
    SETTINGS('theta_sketch', 'nominalEntries=4096')
  ) AS counts
from user_log 
where url in ('/cart/add', '/checkout/start', '/checkout/confirmation')

counts

3, 2, 2

Another Example

We now want to learn how many users checkout after a text search; as opposed to other entry points such as browsing a product category listing. We want to then analyse the following funnel:

/product/search
/cart/add
/checkout/start
/checkout/confirmation

Query

select 
  FUNNEL_COUNT(
    STEPS(
      url = '/product/search',
      url = '/cart/add', 
      url = '/checkout/start', 
      url = '/checkout/confirmation'),
    CORRELATE_BY(user_id)
  ) AS counts
from user_log

counts

2, 2, 1, 1

Notes
Notice that U1 is not counted in this funnel, as the user did not perform any product search. Both U2 and U3 entered the top of the funnel and performed the second step, but only U2 converted to the bottom of the funnel.