DISTINCTCOUNTRAWTHETASKETCH
This section contains reference documentation for the DISTINCTCOUNTRAWTHETASKETCH function.
The Theta Sketch framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the Sketch Class and its extensions from the library org.apache.datasketches:datasketches-java:4.2.0 to perform distinct counting as well as evaluating set operations.
Signature
distinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate) -> HexEncoded
thetaSketchColumn(required): Name of the column to aggregate on.thetaSketchParams(required): Semicolon-separated parameter string for constructing the intermediate theta-sketches.The supported parameters are:
nominalEntries: The nominal entries used to create the sketch. (Default 4096)samplingProbability: Sets the upfront uniform sampling probability, p. (Default 1.0)accumulatorThreshold: How many sketches should be kept in memory before merging. (Default 2)Currently, the only supported parameter is
nominalEntries(defaults to 4096).
predicates(optional)_: _ These are individual predicates of formlhs <op> rhswhich are applied on rows selected by thewhereclause. During intermediate sketch aggregation, sketches from thethetaSketchColumnthat satisfies these predicates are unionized individually. For example, all filtered rows that matchcountry=USAare unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.postAggregationExpressionToEvaluate(required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations areSET_DIFF, SET_UNION, SET_INTERSECT, where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.
Usage Examples
These examples are based on the Batch Quick Start.
select distinctCountRawThetaSketch(teamID) AS value
from baseballStats AgMDAAAKzJOVAAAAAACAPwDAATj...
select distinctCountRawThetaSketch(teamID, 'nominalEntries=10') AS value
from baseballStatsAwMDAAAKzJMQAAAAAACAP4vpfPBbbQsO5N1zYV2c...
We can also provide predicates and a post aggregation expression to compute more complicated cardinalities:
select distinctCountRawThetaSketch(
yearID,
'nominalEntries=4096',
'teamID = ''SFN'' AND numberOfGames=28 AND homeRuns=1',
'teamID = ''CHN'' AND numberOfGames=28 AND homeRuns=1',
'SET_INTERSECT($1, $2)'
) AS value
from baseballStats AQMDAAA6zJN8QPYIsvHMNQ==
Last updated
Was this helpful?

