DISTINCTCOUNTTHETASKETCH
This section contains reference documentation for the DISTINCTCOUNTTHETASKETCH function.
The Theta Sketch framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the Sketch Class and its extensions from the library org.apache.datasketches:datasketches-java:4.2.0 to perform distinct counting as well as evaluating set operations.
Signature
distinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate) -> Long
thetaSketchColumn(required): Name of the column to aggregate on.thetaSketchParams(required): Parameters for constructing the intermediate theta-sketches.Currently, the only supported parameter is
nominalEntries(defaults to 4096).
predicates(optional)_: _ These are individual predicates of formlhs <op> rhswhich are applied on rows selected by thewhereclause. During intermediate sketch aggregation, sketches from thethetaSketchColumnthat satisfies these predicates are unionized individually. For example, all filtered rows that matchcountry=USAare unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.postAggregationExpressionToEvaluate(required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations areSET_DIFF, SET_UNION, SET_INTERSECT, where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.
Usage Examples
These examples are based on the Batch Quick Start.
select distinctCountThetaSketch(teamID) AS value
from baseballStats 149
select distinctCountThetaSketch(teamID, 'nominalEntries=10') AS value
from baseballStats146
We can also provide predicates and a post aggregation expression to compute more complicated cardinalities. For example, we could can find the intersection of the following queries:
select yearID
from baseballStats
where teamID = 'SFN' AND numberOfGames = 28 AND homeRuns = 11986
1985
select yearID
from baseballStats
where teamID = 'CHN' AND numberOfGames = 28 AND homeRuns = 11937
2003
1979
1900
1986
1978
2012
(the yearId 1986 is the only one in common)
By running the following query:
select distinctCountThetaSketch(
yearID,
'nominalEntries=4096',
'teamID = ''SFN'' AND numberOfGames=28 AND homeRuns=1',
'teamID = ''CHN'' AND numberOfGames=28 AND homeRuns=1',
'SET_INTERSECT($1, $2)'
) AS value
from baseballStats 1
Was this helpful?

