arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

DISTINCTCOUNTTHETASKETCH

This section contains reference documentation for the DISTINCTCOUNTTHETASKETCH function.

The Theta Sketcharrow-up-right framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the Sketch Classarrow-up-right and its extensions from the library org.apache.datasketches:datasketches-java:4.2.0 to perform distinct counting as well as evaluating set operations.

hashtag
Signature

distinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate) -> Long

  • thetaSketchColumn (required): Name of the column to aggregate on.

  • thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches.

hashtag
Usage Examples

These examples are based on the .

value
value

We can also provide predicates and a post aggregation expression to compute more complicated cardinalities. For example, we could can find the intersection of the following queries:

yearID
yearID

(the yearId 1986 is the only one in common)

By running the following query:

value
Currently, the only supported parameter is nominalEntries (defaults to 4096).
  • predicates (optional)_: _ These are individual predicates of form lhs <op> rhs which are applied on rows selected by the where clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.

  • postAggregationExpressionToEvaluate (required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT , where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.

  • 149

    146

    1986

    1985

    1937

    2003

    1979

    1900

    1986

    1978

    2012

    1

    Batch Quick Start
    select distinctCountThetaSketch(teamID) AS value
    from baseballStats 
    select distinctCountThetaSketch(teamID, 'nominalEntries=10') AS value
    from baseballStats
    select yearID
    from baseballStats
    where teamID = 'SFN' AND numberOfGames = 28 AND homeRuns = 1
    select yearID
    from baseballStats
    where teamID = 'CHN' AND numberOfGames = 28 AND homeRuns = 1
    select distinctCountThetaSketch(
      yearID, 
      'nominalEntries=4096', 
      'teamID = ''SFN'' AND numberOfGames=28 AND homeRuns=1',
      'teamID = ''CHN'' AND numberOfGames=28 AND homeRuns=1',
      'SET_INTERSECT($1, $2)'
    ) AS value
    from baseballStats