arrow-left

All pages
gitbookPowered by GitBook
1 of 11

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Query

Learn how to query Apache Pinot using SQL or explore data using the web-based Pinot query console.

Querying Pinotchevron-right
Transformation Functionschevron-right
Aggregation Functionschevron-right
User-Defined Functions (UDFs)chevron-right
Cardinality Estimationchevron-right
Lookup UDF Joinchevron-right
Querying JSON datachevron-right
Explain Planchevron-right
Grouping Algorithmchevron-right

Filtering with IdSet

Learn how to write fast queries for looking up ids in a list of values.

A common use case is filtering on an id field with a list of values. This can be done with the IN clause, but this approach doesn't perform well with large lists of ids. In these cases, you can use an IdSet.

hashtag
Functions

hashtag
ID_SET

ID_SET(columnName, 'sizeThresholdInBytes=8388608;expectedInsertions=5000000;fpp=0.03' )

This function returns a base 64 encoded IdSet of the values for a single column. The IdSet implementation used depends on the column data type:

  • INT - RoaringBitmap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.

  • LONG - Roaring64NavigableMap unless sizeThresholdInBytes is exceeded, in which case Bloom Filter.

  • Other types - Bloom Filter

The following parameters are used to configure the Bloom Filter:

  • expectedInsertions - Number of expected insertions for the BloomFilter, must be positive

  • fpp - Desired false positive probability for the BloomFilter, must be positive and < 1.0

Note that when a Bloom Filter is used, the filter results are approximate - you can get false-positive results (for membership in the set), leading to potentially unexpected results.

hashtag
IN_ID_SET

IN_ID_SET(columnName, base64EncodedIdSet)

This function returns 1 if a column contains a value specified in the IdSet and 0 if it does not.

hashtag
IN_SUBQUERY

IN_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot broker.

hashtag
IN__PARTITIONED__SUBQUERY

IN_PARTITIONED_SUBQUERY(columnName, subQuery)

This function generates an IdSet from a subquery and then filters ids based on that IdSet on a Pinot server.

This function works best when the data is partitioned by the id column and each server contains all the data for a partition. The generated IdSet for the subquery will be smaller as it will only contain the ids for the partitions served by the server. This will give better performance.

circle-info

The query passed to IN_SUBQUERY and IN__PARTITIONED__SUBQUERY can be run on any table - they aren't restricted to the table used in the parent query.

hashtag
Examples

hashtag
Create IdSet

You can create an IdSet of the values in the yearID column by running the following:

idset(yearID)

When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions:

idset(playerName)
idset(playerName)

We can also configure the fpp parameter:

idset(playerName)

hashtag
Filter by values in IdSet

We can use the IN_ID_SET function to filter a query based on an IdSet. To return rows for yearIDs in the IdSet, run the following:

hashtag
Filter by values not in IdSet

To return rows for yearIDs not in the IdSet, run the following:

hashtag
Filter on broker

To filter rows for yearIDs in the IdSet on a Pinot Broker, run the following query:

To filter rows for yearIDs not in the IdSet on a Pinot Broker, run the following query:

hashtag
Filter on server

To filter rows for yearIDs in the IdSet on a Pinot Server, run the following query:

To filter rows for yearIDs not in the IdSet on a Pinot Server, run the following query:

hashtag

Querying Pinot

Learn how to query Pinot using SQL

hashtag
SQL Dialect

Pinot uses the Calcite SQL parser to parse queries and uses MYSQL_ANSI dialect. You can see the grammar in the Calcite documentationarrow-up-right.

hashtag
Limitations

Pinot does not support joins or nested subqueries. We recommend using Presto for queries that span multiple tables. For more information, see .

There is no DDL support. Tables can be created via the .

hashtag
Identifier vs Literal

In Pinot SQL:

  • Double quotes(") are used to force string identifiers, e.g. column names

  • Single quotes(') are used to enclose string literals. If the string literal also contains a single quote, escape this with a single quote e.g '''Pinot''' to match the string literal 'Pinot'

Mis-using those might cause unexpected query results:

e.g.

  • WHERE a='b' means the predicate on the column a equals to a string literal value 'b'

  • WHERE a="b" means the predicate on the column a equals to the value of the column b

If your column names use reserved keywords (e.g. timestamp or date) or special charactesr, you will need to use double quotes when referring to them in queries.

hashtag
Example Queries

hashtag
Selection

hashtag
Aggregation

hashtag
Grouping on Aggregation

hashtag
Ordering on Aggregation

hashtag
Filtering

For performant filtering of ids in a list, see .

hashtag
Filtering with NULL predicate

hashtag
Selection (Projection)

hashtag
Ordering on Selection

hashtag
Pagination on Selection

Results might not be consistent if the order by column has the same value in multiple rows.

hashtag
Wild-card match (in WHERE clause only)

To count rows where the column airlineName starts with U

hashtag
Case-When Statement

Pinot supports the CASE-WHEN-ELSE statement.

Example 1:

Example 2:

hashtag
UDF

Functions have to be implemented within Pinot. Injecting functions is not yet supported. The example below demonstrate the use of UDFs.

For more examples, see .

hashtag
BYTES column

Pinot supports queries on BYTES column using HEX string. The query response also uses HEX string to represent bytes values.

e.g. the query below fetches all the rows for a given UID.

ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc=

AwIBBQAAAAL/////////////////////

AwIBBQAAAAz///////////////////////////////////////////////9///////f///9/////7///////////////+/////////////////////////////////////////////8=

AwIBBwAAAA/////////////////////////////////////////////////////////////////////////////////////////////////////////9///////////////////////////////////////////////7//////8=

Engineering Full SQL support for Pinot at Uberarrow-up-right
REST APIarrow-up-right
Filtering with IdSetarrow-up-right
Transform Function in Aggregation Groupingarrow-up-right
SELECT ID_SET(yearID)
FROM baseballStats
WHERE teamID = 'WS1'
SELECT ID_SET(playerName, 'expectedInsertions=10')
FROM baseballStats
WHERE teamID = 'WS1'
SELECT ID_SET(playerName, 'expectedInsertions=100')
FROM baseballStats
WHERE teamID = 'WS1'
SELECT ID_SET(playerName, 'expectedInsertions=100;fpp=0.01')
FROM baseballStats
WHERE teamID = 'WS1'
SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_ID_SET(
 yearID,   
 'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
  ) = 1 
GROUP BY yearID
SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_ID_SET(
  yearID,   
  'ATowAAABAAAAAAA7ABAAAABtB24HbwdwB3EHcgdzB3QHdQd2B3cHeAd5B3oHewd8B30Hfgd/B4AHgQeCB4MHhAeFB4YHhweIB4kHigeLB4wHjQeOB48HkAeRB5IHkweUB5UHlgeXB5gHmQeaB5sHnAedB54HnwegB6EHogejB6QHpQemB6cHqAc='
  ) = 0 
GROUP BY yearID
SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 1
GROUP BY yearID  
SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 0
GROUP BY yearID  
SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_PARTITIONED_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 1
GROUP BY yearID  
SELECT yearID, count(*) 
FROM baseballStats 
WHERE IN_PARTITIONED_SUBQUERY(
  yearID, 
  'SELECT ID_SET(yearID) FROM baseballStats WHERE teamID = ''WS1'''
  ) = 0
GROUP BY yearID  
//default to limit 10
SELECT * 
FROM myTable 

SELECT * 
FROM myTable 
LIMIT 100
SELECT "date", "timestamp"
FROM myTable 
SELECT COUNT(*), MAX(foo), SUM(bar) 
FROM myTable
SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
LIMIT 50
SELECT MIN(foo), MAX(foo), SUM(foo), AVG(foo), bar, baz 
FROM myTable
GROUP BY bar, baz 
ORDER BY bar, MAX(foo) DESC 
LIMIT 50
SELECT COUNT(*) 
FROM myTable
  WHERE foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))
SELECT COUNT(*) 
FROM myTable
  WHERE foo IS NOT NULL
  AND foo = 'foo'
  AND bar BETWEEN 1 AND 20
  OR (baz < 42 AND quux IN ('hello', 'goodbye') AND quuux NOT IN (42, 69))
SELECT * 
FROM myTable
  WHERE quux < 5
  LIMIT 50
SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 100
SELECT foo, bar 
FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 50, 100
SELECT COUNT(*) 
FROM myTable
  WHERE REGEXP_LIKE(airlineName, '^U.*')
  GROUP BY airlineName LIMIT 10
SELECT
    CASE
      WHEN price > 30 THEN 3
      WHEN price > 20 THEN 2
      WHEN price > 10 THEN 1
      ELSE 0
    END AS price_category
FROM myTable
SELECT
  SUM(
    CASE
      WHEN price > 30 THEN 30
      WHEN price > 20 THEN 20
      WHEN price > 10 THEN 10
      ELSE 0
    END) AS total_cost
FROM myTable
SELECT COUNT(*)
FROM myTable
GROUP BY DATETIMECONVERT(timeColumnName, '1:MILLISECONDS:EPOCH', '1:HOURS:EPOCH', '1:HOURS')
SELECT * 
FROM myTable
WHERE UID = 'c8b3bce0b378fc5ce8067fc271a34892'

Aggregation Functions

Pinot provides support for aggregations using GROUP BY. You can use the following functions to get the aggregated value.

Function
Description
Example
Default Value When No Record Selected

Get the count of rows in a group

COUNT(*)

0

Deprecated functions:

Function
Description
Example

hashtag
Multi-value column functions

The following aggregation functions can be used for multi-value columns

Function

Deprecated functions:

Function
Description
Example

Returns the count of distinct row values in a group. This function is accurate for INT or dictionary encoded column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collision.

Returns an approximate distinct count using HyperLogLog in a group

Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.

MIN

Get the minimum value in a group

MIN(playerScore)

Double.POSITIVE_INFINITY

MAX

Get the maximum value in a group

MAX(playerScore)

Double.NEGATIVE_INFINITY

SUM

Get the sum of values in a group

SUM(playerScore)

0

AVGarrow-up-right

Get the average of the values in a group

AVG(playerScore)

Double.NEGATIVE_INFINITY

MODE

Get the most frequent value in a group. When multiple modes are present it gives the minimum of all the modes. This behavior can be overridden to get the maximum or the average mode.

MODE(playerScore)

MODE(playerScore, 'MIN')

MODE(playerScore, 'MAX')

MODE(playerScore, 'AVG')

Double.NEGATIVE_INFINITY

MINMAXRANGE

Returns the max - min value in a group

MINMAXRANGE(playerScore)

Double.NEGATIVE_INFINITY

PERCENTILE(column, N)

Returns the Nth percentile of the group where N is a decimal number between 0 and 100 inclusive

PERCENTILE(playerScore, 50), PERCENTILE(playerScore, 99.9)

Double.NEGATIVE_INFINITY

PERCENTILEEST(column, N)

Returns the Nth percentile of the group using Quantile Digestarrow-up-right algorithm

PERCENTILEEST(playerScore, 50), PERCENTILEEST(playerScore, 99.9)

Long.MIN_VALUE

PERCENTILETDigest(column, N)

Returns the Nth percentile of the group using T-digest algorithmarrow-up-right

PERCENTILETDIGEST(playerScore, 50), PERCENTILETDIGEST(playerScore, 99.9)

Double.NaN

DISTINCTCOUNT

Returns the count of distinct row values in a group

DISTINCTCOUNT(playerName)

0

DISTINCTCOUNTBITMAP

Returns the count of distinct row values in a group. This function is accurate for INT column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collisions.

DISTINCTCOUNTBITMAP(playerName)

0

DISTINCTCOUNTHLL

Returns an approximate distinct count using HyperLogLog. It also takes an optional second argument to configure the log2m for the HyperLogLog.

DISTINCTCOUNTHLL(playerName, 12)

0

DISTINCTCOUNTRAWHLL

Returns HLL response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.

DISTINCTCOUNTRAWHLL(playerName)

0

DISTINCTCOUNTTHETASKETCH

See Cardinality Estimation

0

DISTINCTCOUNTRAWTHETASKETCH

See Cardinality Estimation

0

SEGMENTPARTITIONEDDISTINCTCOUNT

Returns the count of distinct values of a column when the column is pre-partitioned for each segment, where there is no common value within different segments. This function calculates the exact count of distinct values within the segment, then simply sums up the results from different segments to get the final result.

SEGMENTPARTITIONEDDISTINCTCOUNT(playerName)

0

LASTWITHTIME(dataColumn, timeColumn, 'dataType')

Get the last value of dataColumn where the timeColumn is used to define the time of dataColumn and the dataType specifies the type of dataColumn, which can be BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING

LASTWITHTIME(playerScore, timestampColumn, 'BOOLEAN')

LASTWITHTIME(playerScore, timestampColumn, 'INT')

LASTWITHTIME(playerScore, timestampColumn, 'LONG')

LASTWITHTIME(playerScore, timestampColumn, 'FLOAT')

LASTWITHTIME(playerScore, timestampColumn, 'DOUBLE')

LASTWITHTIME(playerScore, timestampColumn, 'STRING')

INT: Int.MIN_VALUE LONG: Long.MIN_VALUE FLOAT: Float.NaN DOUBLE: Double.NaN STRING: ""

FASTHLL

FASTHLL stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format

FASTHLL(playerName)

COUNTMV Get the count of rows in a group

MINMV Get the minimum value in a group

MAXMV Get the maximum value in a group

SUMMV Get the sum of values in a group

AVGMV Get the avg of values in a group

MINMAXRANGEMV Returns the max - min value in a group

PERCENTILEMV(column, N) Returns the Nth percentile of the group where N is a decimal number between 0 and 100 inclusive

PERCENTILEESTMV(column, N) Returns the Nth percentile of the group using Quantile Digestarrow-up-right

PERCENTILETDIGESTMV(column, N) Returns the Nth percentile of the group using T-digest algorithmarrow-up-right

FASTHLLMV (Deprecated)

stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format

FASTHLLMV(playerNames)

COUNT

Returns the count of distinct row values in a group

Lookup UDF Join

Lookup UDF is used to get dimension data via primary key from a dimension table allowing a decoration join functionality. Lookup UDF can only be used with in Pinot. The UDF signature is as below:

  • dimTableName Name of the dim table to perform the lookup on.

  • dimColToLookUp The column name of the dim table to be retrieved to decorate our result.

DISTINCTCOUNTMV
DISTINCTCOUNTBITMAPMV
DISTINCTCOUNTHLLMV
DISTINCTCOUNTRAWHLLMV
  • dimJoinKey The column name on which we want to perform the lookup i.e. the join column name for dim table.

  • factJoinKeyVal The value of the dim table join column for which we will retrieve the dimColToLookUp for the scope and invocation.

  • Return type of the UDF will be that of the dimColToLookUp column type. There can also be multiple primary keys and corresponding values.

    circle-exclamation

    Note: If the dimension table uses a composite primary key i.e multiple primary keys, then ensure that the order of keys appearing in the lookup() UDF is same as the order defined for "primaryKeyColumns" in the dimension table schema.

    lookUp('dimTableName', 'dimColToLookUp', 'dimJoinKey1', factJoinKeyVal1, 'dimJoinKey2', factJoinKeyVal2 ... )
    a dimension table

    Cardinality Estimation

    Cardinality estimation is a classic problem. Pinot solves it with multiple ways each of which has a trade-off between accuracy and latency.

    hashtag
    Accurate Results

    Functions:

    • DistinctCount(x) -> LONG

    Returns accurate count for all unique values in a column.

    The underlying implementation is using a IntOpenHashSet in library: it.unimi.dsi:fastutil:8.2.3 to hold all the unique values.

    hashtag
    Approximation Results

    It usually takes a lot of resources and time to compute accurate results for unique counting on large datasets. In some circumstances, we can tolerate a certain error rate, in which case we can use approximation functions to tackle this problem.

    hashtag
    HyperLogLog

    is an approximation algorithm for unique counting. It uses fixed number of bits to estimate the cardinality of given data set.

    Pinot leverages in library com.clearspring.analytics:stream:2.7.0as the data structure to hold intermediate results.

    Functions:

    • DistinctCountHLL(x)_ -> LONG_

    For column type INT/LONG/FLOAT/DOUBLE/STRING , Pinot treats each value as an individual entry to add into HyperLogLog Object, then compute the approximation by calling method cardinality().

    For column type BYTES, Pinot treats each value as a serialized HyperLogLog Object with pre-aggregated values inside. The bytes value is generated by org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog).

    All deserialized HyperLogLog object will be merged into one then calling method **cardinality() **to get the approximated unique count.

    hashtag
    Theta Sketches

    The framework enables set operations over a stream of data, and can also be used for cardinality estimation. Pinot leverages the and its extensions from the library org.apache.datasketches:datasketches-java:1.2.0-incubating to perform distinct counting as well as evaluating set operations.

    Functions:

    • DistinctCountThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**) **-> LONG

      • thetaSketchColumn (required): Name of the column to aggregate on.

      • thetaSketchParams (required): Parameters for constructing the intermediate theta-sketches. Currently, the only supported parameter is nominalEntries

    In the example query below, the where clause is responsible for identifying the matching rows. Note, the where clause can be completely independent of the postAggregationExpression. Once matching rows are identified, each server unionizes all the sketches that match the individual predicates, i.e. country='USA' , device='mobile' in this case. Once the broker receives the intermediate sketches for each of these individual predicates from all servers, it performs the final aggregation by evaluating the postAggregationExpression and returns the final cardinality of the resulting sketch.

    • DistinctCountRawThetaSketch(<thetaSketchColumn>, <thetaSketchParams>, predicate1, predicate2..., postAggregationExpressionToEvaluate**)** -> HexEncoded Serialized Sketch Bytes

    This is the same as the previous function, except it returns the byte serialized sketch instead of the cardinality sketch. Since Pinot returns responses as JSON strings, bytes are returned as hex encoded strings. The hex encoded string can be deserialized into sketch by using the library org.apache.commons.codec.binaryas Hex.decodeHex(stringValue.toCharArray()).

    Explain Plan

    Query execution within Pinot is modelled as a sequence of operators that are executed in a pipelined manner to produce the final result. The output of the EXPLAIN PLAN statement can be used to see how queries are being run or to further optimize queries.

    hashtag
    Introduction

    In the EXPLAIN PLAN output above, the Operator column describes the operator that Pinot will run; while as, the Operator_Id and Parent_Id

    User-Defined Functions (UDFs)

    Pinot currently supports two ways for you to implement your own functions:

    • Groovy Scripts

    • Scalar Functions

    Grouping Algorithm

    In this guide we will learn about the heuristics used for trimming results in Pinot's grouping algorithm (used when processing GROUP BY queries) to make sure that the server doesn't run out of memory.

    hashtag
    Within segment

    When grouping rows within a segment, Pinot keeps a maximum of <numGroupsLimit> groups per segment. This value is set to 100,000 by default and can be configured by the pinot.server.query.executor.num.groups.limit

    .
  • predicates (optional)_: _ These are individual predicates of form lhs <op> rhs which are applied on rows selected by the where clause. During intermediate sketch aggregation, sketches from the thetaSketchColumn that satisfies these predicates are unionized individually. For example, all filtered rows that match country=USA are unionized into a single sketch. Complex predicates that are created by combining (AND/OR) of individual predicates is supported.

  • postAggregationExpressionToEvaluate (required): The set operation to perform on the individual intermediate sketches for each of the predicates. Currently supported operations are SET_DIFF, SET_UNION, SET_INTERSECT , where DIFF requires two arguments and the UNION/INTERSECT allow more than two arguments.

  • HyperLogLogarrow-up-right
    HyperLogLog Classarrow-up-right
    Theta Sketcharrow-up-right
    Sketch Classarrow-up-right
    columns show the parent-child relationship between operators. This parent-child relationship shows the order in which operators execute. For example,
    FILTER_MATCH_ENTIRE_SEGMENT
    will execute before and pass its output to
    PROJECT
    . Similarly,
    PROJECT
    will execute before and pass its output to
    TRANSFORM_PASSTHROUGH
    operator and so on. Although the EXPLAIN PLAN query produces tabular output, in this document, we show a tree representation of the EXPLAIN PLAN output so that parent-child relationship between operators are easy to see.

    EXPLAIN PLAN output should only be used for informational purposes because it is likely to change from version to version as Pinot is further developed and enhanced. Pinot uses a "Scatter Gather" approach to query evaluation (see Pinot Architecturearrow-up-right for more details). At the Broker, an incoming query is split into several server-level queries for each backend server to evaluate. At each Server, the query is further split into segment-level queries that are evaluated against each segment on the server. The results of segment queries are combined and sent to the Broker. The Broker in turn combines the results from all the Servers and sends the final results back to the user. Note that EXPLAIN PLAN output currently extrapolates the overall query plan from a randomly picked Segment on a randomly picked Server, so it is possible that all segments of all servers may not execute the query in exactly the same way. However, in general, the EXPLAIN PLAN output is representative of overall query execution in Pinot.

    Reading the EXPLAIN PLAN output from bottom to top will show how data flows from a table to query results. In the example shown above, the FILTER_MATCH_ENTIRE_SEGMENT operator shows that all 977889 records of the segment matched the query. The PROJECT operator over the filter operator pulls only those columns that were referenced in the query. The TRANSFORM_PASSTHROUGH operator just passes the column data from PROJECT operator to the SELECT operator. At SELECT, the query has been successfully evaluated against one segment. Results from different data segments are then combined (COMBINE_SELECT) and sent to the Broker. The Broker combines and reduces the results from different servers (BROKER_REDUCE) into a final result that is sent to the user.

    The rest of this document illustrates the EXPLAIN PLAN output with examples and describe the operators that show up in the output of the EXPLAIN PLAN.

    hashtag
    EXPLAIN PLAN ON INDEX ACCESS QUERY

    The EXPLAIN PLAN output above shows that Pinot used an inverted index to evaluate the predicate "playerID = 'aardsda01'" (FILTER_INVERTED_INDEX). The result was then fully scanned (FILTER_FULL_SCAN) to evaluate the second predicate "playerName = 'David Allan'". Note that the two predicates are being combined using AND in the query; hence, only the data that satsified the first predicate needs to be scanned for evaluating the second predicate. However, if the predicates were being combined using OR, the query would run very slowly because the entire "playerName" column would need to be scanned from top to bottom to look for values satisfying the second predicate. To improve query efficiency in such cases, one should consider indexing the "playerName" column as well.

    hashtag
    EXPLAIN PLAN ON GROUP BY QUERY

    The EXPLAIN PLAN output above shows how GROUP BY queries are evaluated in Pinot. GROUP BY results are created on the server (AGGREGATE_GROUPBY_ORDERBY) for each segment on the server. The server then combines segment-level GROUP BY results (COMBINE_GROUPBY_ORDERBY) and sends the combined result to the Broker. The Broker combines GROUP BY result from all the servers to produce the final result which is send to the user. Note that the COMBINE_SELECT operator from the previous query was not used here, instead a different COMBINE_GROUPBY_ORDERBY operator was used. Depending upon the type of query different combine operators such as COMBINE_DISTINCT and COMBINE_ORDERBY etc may be seen.

    hashtag
    EXPLAIN PLAN OPERATORS

    The root operator of the EXPLAIN PLAN output is BROKER_REDUCE. BROKER_REDUCE indicates that Broker is processing and combining server results into final result that is sent back to the user. BROKER_REDUCE has a COMBINE operator as it's child. Combine operator combines the results of query evaluation from each segment on the server and sends the combined result to the Broker. There are several combine operators (COMBINE_GROUPBY_ORDERBY, COMBINE_DISTINCT, COMBINE_AGGREGATE, etc.) that run depending upon the operations being performned by the query. Under the Combine operator, either a Select (SELECT, SELECT_ORDERBY, etc.) or an Aggregate (AGGREGATE, AGGREGATE_GROUPBY_ORDERBY, etc.) can appear. Aggreate operator is present when query performs aggregation (count(*), min, max, etc.); otherwise, a Select operator is present. If the query performs scalar transformations (Addition, Multiplication, Concat, etc.), then one would see TRANSFORM operator appear under the SELECT operator. Often a TRANSFORM_PASSTHROUGH operator is present instead of the TRANSFORM operator. TRANSFORM_PASSTHROUGH just passes results from operators that appear lower in the operator execution heirarchy to the SELECT operator. FILTER operators usually appear at the bottom of the operator heirarchy and show index use. For example, the presence of FILTER_FULL_SCAN indicates that index was not used (and hence the query is likely to run relatively slow). However, if the query used an index one of the indexed filter operators (FILTER_SORTED_INDEX, FILTER_RANGE_INDEX, FILTER_INVERTED_INDEX, FILTER_JSON_INDEX, etc.) will show up.

    hashtag
    Groovy Scripts

    Pinot allows you to run any function using Apache Groovyarrow-up-right scripts. The syntax for executing Groovy script within the query is as follows:

    GROOVY('result value metadata json', ''groovy script', arg0, arg1, arg2...)

    This function will execute the groovy script using the arguments provided and return the result that matches the provided result value metadata. **** The function requires the following arguments:

    • Result value metadata json - json string representing result value metadata. Must contain non-null keys resultType and isSingleValue.

    • Groovy script to execute- groovy script string, which uses arg0, arg1, arg2 etc to refer to the arguments provided within the script

    • arguments - pinot columns/other transform functions that are arguments to the groovy script

    Examples

    • Add colA and colB and return a single-value INT groovy( '{"returnType":"INT","isSingleValue":true}', 'arg0 + arg1', colA, colB)\

    • Find the max element in mvColumn array and return a single-value INT

      groovy('{"returnType":"INT","isSingleValue":true}', 'arg0.toList().max()', mvColumn)\

    • Find all elements of the array mvColumn and return as a multi-value LONG column

      groovy('{"returnType":"LONG","isSingleValue":false}', 'arg0.findIndexValues{ it > 5 }', mvColumn)\

    • Multiply length of array mvColumn with colB and return a single-value DOUBLE

      groovy('{"returnType":"DOUBLE","isSingleValue":true}', 'arg0 * arg1', arraylength(mvColumn), colB)\

    • Find all indexes in mvColumnA which have value foo, add values at those indexes in mvColumnB

      groovy( '{"returnType":"DOUBLE","isSingleValue":true}', 'def x = 0; arg0.eachWithIndex{item, idx-> if (item == "foo") {x = x + arg1[idx] }}; return x' , mvColumnA, mvColumnB)\

    • Switch case which returns a FLOAT value depending on length of mvCol array

      groovy('{\"returnType\":\"FLOAT\", \"isSingleValue\":true}', 'def result; switch(arg0.length()) { case 10: result = 1.1; break; case 20: result = 1.2; break; default: result = 1.3;}; return result.floatValue()', mvCol) \

    • Any Groovy script which takes no arguments

      groovy('new Date().format( "yyyyMMdd" )', '{"returnType":"STRING","isSingleValue":true}')

    ⚠️ Note that Groovy script doesn't accept Built-In ScalarFunction that's specific to Pinot queries. See the section below for more information.

    ⚠️ Disabling Groovy

    Allowing execuatable Groovy in queries can be a security vulnerability. If you would like to disable Groovy in Pinot queries, you can set the following broker config.

    pinot.broker.disable.query.groovy=true

    If not set, Groovy in queries is enabled by default.

    The above configuration applies across the entire Pinot cluster. If you want a table level override to enable/disable Groovy queries, the following property can be set in the query table config.

    hashtag
    Scalar Functions

    Since the 0.5.0 release, Pinot supports custom functions that return a single output for multiple inputs. Examples of scalar functions can be found in StringFunctions and DateTimeFunctions

    Pinot automatically identifies and registers all the functions that have the @ScalarFunction annotation.

    Only Java methods are supported.

    hashtag
    Adding user defined scalar functions

    You can add new scalar functions as follows:

    • Create a new java project. Make sure you keep the package name as org.apache.pinot.scalar.XXXX

    • In your java project include the dependency

    • Annotate your methods with @ScalarFunction annotation. Make sure the method is static and returns only a single value output. The input and output can have one of the following types -

      • Integer

      • Long

      • Double

      • String

    • Place the compiled JAR in the /plugins directory in pinot. You will need to restart all Pinot instances if they are already running.

    • Now, you can use the function in a query as follows:

    ⚠️ Note that the function name in SQL is the same as the function name in Java. The SQL function name is case-insensitive as well.

    select distinctCountThetaSketch(
      sketchCol, 
      'nominalEntries=1024', 
      'country'=''USA'' AND 'state'=''CA'', 'device'=''mobile'', 'SET_INTERSECT($1, $2)'
    ) 
    from table 
    where country = 'USA' or device = 'mobile...' 
    EXPLAIN PLAN FOR SELECT playerID, playerName FROM baseballStats
    
    +---------------------------------------------|------------|---------|
    | Operator                                    | Operator_Id|Parent_Id|
    +---------------------------------------------|------------|---------|
    |BROKER_REDUCE(limit:10)                      | 0          | -1      |
    |COMBINE_SELECT                               | 1          | 0       |
    |SELECT(selectList:playerID, playerName)      | 2          | 1       |
    |TRANSFORM_PASSTHROUGH(playerID, playerName)  | 3          | 2       |
    |PROJECT(playerName, playerID)                | 4          | 3       |
    |FILTER_MATCH_ENTIRE_SEGMENT(docs:97889)      | 5          | 4       |
    +---------------------------------------------|------------|---------|
    BROKER_REDUCE(limit:10)
    └── COMBINE_SELECT
        └── SELECT(selectList:playerID, playerName)
            └── TRANSFORM_PASSTHROUGH(playerID, playerName)
                └── PROJECT(playerName, playerID)
                    └── FILTER_MATCH_ENTIRE_SEGMENT(docs:97889)
    EXPLAIN PLAN FOR
      SELECT playerID, playerName
        FROM baseballStats
       WHERE playerID = 'aardsda01' AND playerName = 'David Allan'
    
    BROKER_REDUCE(limit:10)
    └── COMBINE_SELECT
        └── SELECT(selectList:playerID, playerName)
            └── TRANSFORM_PASSTHROUGH(playerID, playerName)
                └── PROJECT(playerName, playerID)
                    └── FILTER_AND
                        ├── FILTER_INVERTED_INDEX(indexLookUp:inverted_index,operator:EQ,predicate:playerID = 'aardsda01')
                        └── FILTER_FULL_SCAN(operator:EQ,predicate:playerName = 'David Allan')
    EXPLAIN PLAN FOR
      SELECT playerID, count(*)
        FROM baseballStats
       WHERE playerID != 'aardsda01'
       GROUP BY playerID
    
    BROKER_REDUCE(limit:10)
    └── COMBINE_GROUPBY_ORDERBY
        └── AGGREGATE_GROUPBY_ORDERBY(groupKeys:playerID, aggregations:count(*))
            └── TRANORM_PASSTHROUGH(playerID)
                └── PROJECT(playerID)
                    └── FILTER_INVERTED_INDEX(indexLookUp:inverted_index,operator:NOT_EQ,predicate:playerID != 'aardsda01')
    <dependency>
      <groupId>org.apache.pinot</groupId>
      <artifactId>pinot-common</artifactId>
      <version>0.5.0</version>
     </dependency>
    include 'org.apache.pinot:pinot-common:0.5.0'
    {
      "tableName": "myTable",
      "tableType": "OFFLINE",
     
      "queryConfig" : {
        "disableGroovy": true
      }
    }
    //Example Scalar function
    
    @ScalarFunction
    static String mySubStr(String input, Integer beginIndex) {
      return input.substring(beginIndex);
    }
    SELECT mysubstr(playerName, 4) 
    FROM baseballStats
    property.

    If the number of groups of a segment reaches this value, the extra groups will be ignored and the results returned may not be completely accurate. The numGroupsLimitReached property will be set to true in the query response if the value is reached.

    hashtag
    Trimming tail groups

    After the inner segment groups have been computed, the Pinot query engine optionally trims tail groups. Tail groups are ones that have a lower rank based on the ORDER BY clause used in the query.

    This configuration is disabled by default, but can be enabled by configuring the pinot.server.query.executor.min.segment.group.trim.size property.

    When segment group trim is enabled, the query engine will trim the tail groups and keep max(<minSegmentGroupTrimSize>, 5 * LIMIT) groups if it gets more groups. Pinot keeps at least 5 * LIMIT groups when trimming tail groups to ensure the accuracy of results.

    This value can be overridden on a query by query basis by passing the following option:

    hashtag
    Cross segments

    Once grouping has been done within a segment, Pinot will merge segment results and trim tail groups and keep max(<minServerGroupTrimSize>, 5 * LIMIT) groups if it gets more groups.

    <minServerGroupTrimSize> is set to 5,000 by default and can be adjusted by configuring the pinot.server.query.executor.min.server.group.trim.size property. When setting the configuration to -1, the cross segments trim can be disabled.

    This value can be overridden on a query by query basis by passing the following option:

    When cross segments trim is enabled, the server will trim the tail groups before sending the results back to the broker. It will also trim the tail groups when the number of groups reaches the <trimThreshold>.

    This configuration is set to 1,000,000 by default and can be adjusted by configuring the pinot.server.query.executor.groupby.trim.threshold property.

    A higher threshold reduces the amount of trimming done, but consumes more heap memory. If the threshold is set to more than 1,000,000,000, the server will only trim the groups once before returning the results to the broker.

    hashtag
    GROUP BY behavior

    Pinot sets a default LIMIT of 10 if one isn't defined and this applies to GROUP BY queries as well. Therefore, if no limit is specified, Pinot will return 10 groups.

    Pinot will trim tail groups based on the ORDER BY clause to reduce the memory footprint and improve the query performance. It keeps at least 5 * LIMIT groups so that the results give good enough approximation in most cases. The configurable min trim size can be used to increase the groups kept to improve the accuracy but has a larger extra memory footprint.

    hashtag
    HAVING behavior

    If the query has a HAVING clause, it is applied on the merged GROUP BY results that already have the tail groups trimmed. If the HAVING clause is the opposite of the ORDER BY order, groups matching the condition might already be trimmed and not returned. e.g.

    Increase min trim size to keep more groups in these cases.

    hashtag
    Configuration Parameters

    Parameter
    Default
    Query Override

    pinot.server.query.executor.num.groups.limit The maximum number of groups allowed per segment.

    100,000

    N/A

    pinot.server.query.executor.min.segment.group.trim.size The minimum number of groups to keep when trimming groups at the segment level.

    -1 (trim disabled)

    OPTION(minSegmentGroupTrimSize=<minSegmentGroupTrimSize>)

    pinot.server.query.executor.min.server.group.trim.size The minimum number of groups to keep when trimming groups at the server level.

    5,000

    OPTION(minServerGroupTrimSize=<minServerGroupTrimSize>)

    SELECT * 
    FROM ...
    
    OPTION(minSegmentGroupTrimSize=<minSegmentGroupTrimSize>)
    SELECT * 
    FROM ...
    
    OPTION(minServerGroupTrimSize=<minServerGroupTrimSize>)
    SELECT SUM(colA) 
    FROM myTable 
    GROUP BY colB 
    ORDER BY SUM(colA) DESC 
    HAVING SUM(colA) < 100 
    LIMIT 10

    pinot.server.query.executor.groupby.trim.threshold The number of groups to trigger the server level trim.

    1,000,000

    N/A

    pinot.server.query.executor.max.execution.threads The maximum number of execution threads (parallelism of segment processing) used per query.

    -1 (use all execution threads)

    OPTION(maxExecutionThreads=<maxExecutionThreads>)

    Transformation Functions

    This document contains the list of all the transformation functions supported by Pinot SQL.

    hashtag
    Math Functions

    Function

    Sum of at least two values

    Difference between two values

    Product of at least two values

    hashtag
    String Functions

    Multiple string functions are supported out of the box from release-0.5.0 .

    Function

    hashtag
    DateTime Functions

    Date time functions allow you to perform transformations on columns that contain timestamps or dates.

    Function

    hashtag
    JSON Functions

    hashtag
    Transform Functions

    These functions can only be used in Pinot SQL queries.

    Function

    hashtag
    Scalar Functions

    These functions can be used for column transformation in table ingestion configs.

    Function

    hashtag
    Binary Functions

    Function

    hashtag
    Multi-value Column Functions

    All of the functions mentioned till now only support single value columns. You can use the following functions to do operations on multi-value columns.

    Function

    hashtag
    Advanced Queries

    hashtag
    Geospatial Queries

    Pinot supports Geospatial queries on columns containing text-based geographies. For more details on the queries and how to enable them, see .

    hashtag
    Text Queries

    Pinot supports pattern matching on text-based columns. Only the columns mentioned as text columns in table config can be queried using this method. For more details on how to enable pattern matching, see .

    returns true if columns starts with prefix string.

    replace all instances of find with replace in input

    string padded from the right side with pad to reach final size

    string padded from the left side with pad to reach final size

    the Unicode codepoint of the first character of the string

    the character corresponding to the Unicode codepoint

    Extracts values that match the provided regular expression

    removes all instances of search from string

    Round the given time value to nearest bucket start value.

    Return current time as epoch millis

    Returns the hour of the time zone offset.

    Returns the minute of the time zone offset.

    Returns the year from the given epoch millis in UTC timezone.

    Returns the year from the given epoch millis and timezone id.

    Returns the year of the ISO week from the given epoch millis in UTC timezone. Alias yowis also supported.

    Returns the year of the ISO week from the given epoch millis and timezone id. Alias yowis also supported.

    Returns the quarter of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 4.

    Returns the quarter of the year from the given epoch millis and timezone id. The value ranges from 1 to 4.

    Returns the month of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 12.

    Returns the month of the year from the given epoch millis and timezone id. The value ranges from 1 to 12.

    Returns the ISO week of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 53. Alias weekOfYear is also supported.

    Returns the ISO week of the year from the given epoch millis and timezone id. The value ranges from 1 to 53. Alias weekOfYear is also supported.

    Returns the day of the year from the given epoch millis in UTC timezone. The value ranges from 1 to 366. Alias doy is also supported.

    Returns the day of the year from the given epoch millis and timezone id. The value ranges from 1 to 366. Alias doy is also supported.

    Returns the day of the month from the given epoch millis in UTC timezone. The value ranges from 1 to 31. Alias dayOfMonth is also supported.

    Returns the day of the month from the given epoch millis and timezone id. The value ranges from 1 to 31. Alias dayOfMonth is also supported.

    Returns the day of the week from the given epoch millis in UTC timezone. The value ranges from 1(Monday) to 7(Sunday). Alias dow is also supported.

    Returns the day of the week from the given epoch millis and timezone id. The value ranges from 1(Monday) to 7(Sunday). Alias dow is also supported.

    Returns the hour of the day from the given epoch millis in UTC timezone. The value ranges from 0 to 23.

    Returns the hour of the day from the given epoch millis and timezone id. The value ranges from 0 to 23.

    Returns the minute of the hour from the given epoch millis in UTC timezone. The value ranges from 0 to 59.

    Returns the minute of the hour from the given epoch millis and timezone id. The value ranges from 0 to 59.

    Returns the second of the minute from the given epoch millis in UTC timezone. The value ranges from 0 to 59.

    Returns the second of the minute from the given epoch millis and timezone id. The value ranges from 0 to 59.

    Returns the millisecond of the second from the given epoch millis in UTC timezone. The value ranges from 0 to 999.

    Returns the millisecond of the second from the given epoch millis and timezone id. The value ranges from 0 to 999.

    DIV(col1, col2) Quotient of two values

    MOD(col1, col2) Modulo of two values

    ABS(col1) Absolute of a value

    CEIL(col1) Rounded up to the nearest integer.

    FLOOR(col1) Rounded down to the nearest integer.

    EXP(col1) Euler’s number(e) raised to the power of col.

    LN(col1) Natural log of value i.e. ln(col1)

    SQRT(col1) Square root of a value

    UPPER(col) convert string to upper case

    LOWER(col) convert string to lower case

    REVERSE(col) reverse the string

    SUBSTR(col, startIndex, endIndex) Gets substring of the input string from start to endIndex. Index begins at 0. Set endIndex to -1 to calculate till end of the string

    CONCAT(col1, col2, seperator) Concatenate two input strings using the seperator

    TRIM(col) trim spaces from both side of the string

    LTRIM(col) trim spaces from left side of the string

    RTRIM(col) trim spaces from right side of the string

    LENGTH(col) calculate length of the string

    TIMECONVERT(col, fromUnit, toUnit) Converts the value into another time unit. the column should be an epoch timestamp.

    DATETIMECONVERT(columnName, inputFormat, outputFormat, outputGranularity) Converts the value into another date time format, and buckets time based on the given time granularity.

    DATETRUNC Converts the value into a specified output granularity seconds since UTC epoch that is bucketed on a unit in a specified timezone.

    ToEpoch<TIME_UNIT>(timeInMillis) Convert epoch milliseconds to epoch <Time Unit>.

    ToEpoch<TIME_UNIT>Rounded(timeInMillis, bucketSize) Convert epoch milliseconds to epoch <Time Unit>, round to nearest rounding bucket(Bucket size is defined in <Time Unit>).

    ToEpoch<TIME_UNIT>Bucket(timeInMillis, bucketSize) Convert epoch milliseconds to epoch <Time Unit>, and divided by bucket size(Bucket size is defined in <Time Unit>).

    FromEpoch<TIME_UNIT> Convert epoch <Time Unit> to epoch milliseconds.(timeIn<Time_UNIT>)

    FromEpoch<TIME_UNIT>Bucket(timeIn<Time_UNIT>, bucketSizeIn<Time_UNIT>) Convert epoch <Bucket Size><Time Unit> to epoch milliseconds.

    ToDateTime(timeInMillis, pattern[, timezoneId]) Convert epoch millis value to DateTime string represented by pattern.

    JSONEXTRACTSCALAR(jsonField, 'jsonPath', 'resultsType', [defaultValue]) Evaluates the 'jsonPath' on jsonField, returns the result as the type 'resultsType', use optional defaultValuefor null or parsing error.

    JSONEXTRACTKEY(jsonField, 'jsonPath') Extracts all matched JSON field keys based on 'jsonPath' into a STRING_ARRAY.

    TOJSONMAPSTR(map) Convert map to JSON String

    JSONFORMAT(object) Convert object to JSON String

    JSONPATH(jsonField, 'jsonPath') Extracts the object value from jsonField based on 'jsonPath', the result type is inferred based on JSON value. Cannot be used in query because data type is not specified.

    JSONPATHLONG(jsonField, 'jsonPath', [defaultValue]) Extracts the Long value from jsonField based on 'jsonPath', use optional defaultValuefor null or parsing error.

    JSONPATHDOUBLE(jsonField, 'jsonPath', [defaultValue]) Extracts the Double value from jsonField based on 'jsonPath', use optional defaultValuefor null or parsing error.

    JSONPATHSTRING(jsonField, 'jsonPath', [defaultValue]) Extracts the String value from jsonField based on 'jsonPath', use optional defaultValuefor null or parsing error.

    JSONPATHARRAY(jsonField, 'jsonPath') Extracts an array from jsonField based on 'jsonPath', the result type is inferred based on JSON value. Cannot be used in query because data type is not specified.

    JSONPATHARRAYDEFAULTEMPTY(jsonField, 'jsonPath') Extracts an array from jsonField based on 'jsonPath', the result type is inferred based on JSON value. Returns empty array for null or parsing error. Cannot be used in query because data type is not specified.

    SHA(bytesCol) Return SHA-1 digest of binary column(bytes type) as hex string

    SHA256(bytesCol) Return SHA-256 digest of binary column(bytes type) as hex string

    SHA512(bytesCol) Return SHA-512 digest of binary column(bytes type) as hex string

    MD5(bytesCol) Return MD5 digest of binary column(bytes type) as hex string

    ARRAYLENGTH Returns the length of a multi-value

    MAP_VALUE Select the value for a key from Map stored in Pinot. MAP_VALUE(mapColumn, 'myKey', valueColumn)

    VALUEIN The transform function will filter the value from the multi-valued column with the given constant values. The VALUEIN transform function is especially useful when the same multi-valued column is both filtering column and grouping column.

    Geospatial
    Text search support
    ADD(col1, col2, col3...)
    SUB(col1, col2)
    MULT(col1, col2, col3...)

    Find Nth instance of find string in input. Returns 0 if input string is empty. Returns -1 if the Nth instance is not found or input string is null.

    Convert DateTime string represented by pattern to epoch millis.

    STRPOS(col, find, N)
    STARTSWITH(col, prefix)
    REPLACE(col, find, substitute)
    RPAD(col, size, pad)
    LPAD(col, size, pad)
    CODEPOINT(col)
    CHR(codepoint)
    regexpExtract(value, regexp)
    remove(input, search)
    FromDateTime(dateTimeString, pattern)
    round(timeValue, bucketSize)
    now()
    timezoneHour(timeZoneId)
    timezoneMinute(timeZoneId)
    year(tsInMillis)
    year(tsInMillis, timeZoneId)
    yearOfWeek(tsInMillis)
    yearOfWeek(tsInMillis, timeZoneId)
    quarter(tsInMillis)
    quarter(tsInMillis, timeZoneId)
    month(tsInMillis)
    month(tsInMillis, timeZoneId)
    week(tsInMillis)
    week(tsInMillis, timeZoneId)
    dayOfYear(tsInMillis)
    dayOfYear(tsInMillis, timeZoneId)
    day(tsInMillis)
    day(tsInMillis, timeZoneId)
    dayOfWeek(tsInMillis)
    dayOfWeek(tsInMillis, timeZoneId)
    hour(tsInMillis)
    hour(tsInMillis, timeZoneId)
    minute(tsInMillis)
    minute(tsInMillis, timeZoneId)
    second(tsInMillis)
    second(tsInMillis, timeZoneId)
    millisecond(tsInMillis)
    millisecond(tsInMillis, timeZoneId)

    Querying JSON data

    To see how JSON data can be queried, assume that we have the following table:

    We also assume that "jsoncolumn" has a Json Indexarrow-up-right on it. Note that the last two rows in the table have different structure than the rest of the rows. In keeping with JSON specification, a JSON column can contain any valid JSON data and doesn't need to adhere to a predefined schema. To pull out the entire JSON document for each row, we can run the query below:

    id
    jsoncolumn

    "101"

    "{"name":{"first":"daffy","last":"duck"},"score":101,"data":["a","b","c","d"]}"

    102"

    To drill down and pull out specific keys within the JSON column, we simply append the JsonPath expression of those keys to the end of the column name.

    id
    last_name
    first_name
    value

    Note that the third column (value) is null for rows with id 106 and 107. This is because these rows have JSON documents that don't have a key with JsonPath $.data[1]. We can filter out these rows.

    id
    last_name
    first_name
    value

    Certain last names (duck and mouse for example) repeat in the data above. We can get a count of each last name by running a GROUP BY query on a JsonPath expression.

    jsoncolumn.name.last
    count(*)

    Also there is numerical information (jsconcolumn.$.id) embeded within the JSON document. We can extract those numerical values from JSON data into SQL and sum them up using the query below.

    jsoncolumn.name.last
    sum(jsoncolumn.score)
    Table myTable:
      id        INTEGER
      jsoncolumn    JSON 
    
    Table data:
    101,{"name":{"first":"daffy"\,"last":"duck"}\,"score":101\,"data":["a"\,"b"\,"c"\,"d"]}
    102,{"name":{"first":"donald"\,"last":"duck"}\,"score":102\,"data":["a"\,"b"\,"e"\,"f"]}
    103,{"name":{"first":"mickey"\,"last":"mouse"}\,"score":103\,"data":["a"\,"b"\,"g"\,"h"]}
    104,{"name":{"first":"minnie"\,"last":"mouse"}\,"score":104\,"data":["a"\,"b"\,"i"\,"j"]}
    105,{"name":{"first":"goofy"\,"last":"dwag"}\,"score":104\,"data":["a"\,"b"\,"i"\,"j"]}
    106,{"person":{"name":"daffy duck"\,"companies":[{"name":"n1"\,"title":"t1"}\,{"name":"n2"\,"title":"t2"}]}}
    107,{"person":{"name":"scrooge mcduck"\,"companies":[{"name":"n1"\,"title":"t1"}\,{"name":"n2"\,"title":"t2"}]}}
    SELECT id, jsoncolumn 
      FROM myTable

    mickey

    b

    104

    mouse

    minnie

    b

    105

    dwag

    goofy

    b

    106

    null

    null

    null

    107

    null

    null

    null

    mickey

    b

    104

    mouse

    minnie

    b

    105

    dwag

    goofy

    b

    "{"name":{"first":"donald","last":"duck"},"score":102,"data":["a","b","e","f"]}

    "103"

    "{"name":{"first":"mickey","last":"mouse"},"score":103,"data":["a","b","g","h"]}

    "104"

    "{"name":{"first":"minnie","last":"mouse"},"score":104,"data":["a","b","i","j"]}"

    "105"

    "{"name":{"first":"goofy","last":"dwag"},"score":104,"data":["a","b","i","j"]}"

    "106"

    "{"person":{"name":"daffy duck","companies":[{"name":"n1","title":"t1"},{"name":"n2","title":"t2"}]}}"

    "107"

    "{"person":{"name":"scrooge mcduck","companies":[{"name":"n1","title":"t1"},{"name":"n2","title":"t2"}]}}"

    101

    duck

    daffy

    b

    102

    duck

    donald

    b

    103

    101

    duck

    daffy

    b

    102

    duck

    donald

    b

    103

    "mouse"

    "2"

    "duck"

    "2"

    "dwag"

    "1"

    "mouse"

    "207"

    "dwag"

    "104"

    "duck"

    "203"

    mouse

    mouse

    SELECT id,
           json_extract_scalar(jsoncolumn, '$.name.last', 'STRING', 'null') last_name,
           json_extract_scalar(jsoncolumn, '$.name.first', 'STRING', 'null') first_name
           json_extract_scalar(jsoncolumn, '$.data[1]', 'STRING', 'null') value
      FROM myTable
    SELECT id,
           json_extract_scalar(jsoncolumn, '$.name.last', 'STRING', 'null') last_name,
           json_extract_scalar(jsoncolumn, '$.name.first', 'STRING', 'null') first_name,
           json_extract_scalar(jsoncolumn, '$.data[1]', 'STRING', 'null') value
      FROM myTable
     WHERE JSON_MATCH(jsoncolumn, '"$.data[1]" IS NOT NULL')
      SELECT json_extract_scalar(jsoncolumn, '$.name.last', 'STRING', 'null') last_name,
             count(*)
        FROM myTable
       WHERE JSON_MATCH(jsoncolumn, '"$.data[1]" IS NOT NULL')
    GROUP BY json_extract_scalar(jsoncolumn, '$.name.last', 'STRING', 'null')
    ORDER BY 2 DESC
      SELECT json_extract_scalar(jsoncolumn, '$.name.last', 'STRING', 'null') last_name,
             sum(json_extract_scalar(jsoncolumn, '$.id', 'INT', 0)) total
        FROM myTable
       WHERE JSON_MATCH(jsoncolumn, '"$.name.last" IS NOT NULL')
    GROUP BY json_extract_scalar(jsoncolumn, '$.name.last', 'STRING', 'null')