DISTINCTCOUNTRAWTHETASKETCH
This section contains reference documentation for the DISTINCTCOUNTRAWTHETASKETCH function.
The Theta Sketch framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the Sketch Class and its extensions from the library org.apache.datasketches:datasketches-java:4.2.0 to perform distinct counting as well as evaluating set operations.
Signature
distinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate) -> HexEncoded
- thetaSketchColumn(required): Name of the column to aggregate on.
- thetaSketchParams(required): Semicolon-separated parameter string for constructing the intermediate theta-sketches.- The supported parameters are: 
- nominalEntries: The nominal entries used to create the sketch. (Default 4096)
- samplingProbability: Sets the upfront uniform sampling probability, p. (Default 1.0)
- accumulatorThreshold: How many sketches should be kept in memory before merging. (Default 2)
- Currently, the only supported parameter is - nominalEntries(defaults to 4096).
 
- predicates(optional)_: _ These are individual predicates of form- lhs <op> rhswhich are applied on rows selected by the- whereclause. During intermediate sketch aggregation, sketches from the- thetaSketchColumnthat satisfies these predicates are unionized individually. For example, all filtered rows that match- country=USAare unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.
- postAggregationExpressionToEvaluate(required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are- SET_DIFF, SET_UNION, SET_INTERSECT, where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.
Usage Examples
These examples are based on the Batch Quick Start.
select distinctCountRawThetaSketch(teamID) AS value
from baseballStats AgMDAAAKzJOVAAAAAACAPwDAATj...
select distinctCountRawThetaSketch(teamID, 'nominalEntries=10') AS value
from baseballStatsAwMDAAAKzJMQAAAAAACAP4vpfPBbbQsO5N1zYV2c...
We can also provide predicates and a post aggregation expression to compute more complicated cardinalities:
select distinctCountRawThetaSketch(
  yearID, 
  'nominalEntries=4096', 
  'teamID = ''SFN'' AND numberOfGames=28 AND homeRuns=1',
  'teamID = ''CHN'' AND numberOfGames=28 AND homeRuns=1',
  'SET_INTERSECT($1, $2)'
) AS value
from baseballStats AQMDAAA6zJN8QPYIsvHMNQ==
Was this helpful?

