Leaf

Describes the leaf operator in the multi-stage query engine.

The leaf operator is the operator that actually reads the data from the segments. Instead of being just a simple table scan, the leaf operator is a meta-operator that wraps the single-stage query engine and executes all the operators in the leaf stage of the query plan.

Implementation details

The leaf operator is not a relational operator itself but a meta-operator that is able to execute single-stage queries. When servers execute a leaf stage, they compile all operations in the stage but the send operator into the equivalent single-stage query and execute that using a slightly modified version of the single-stage engine.

As a result, leaf stage operators can use all the optimizations and indices that the single-stage engine can use but it also means that there may be slight differences when an operator is executed in a leaf stage compared to when it is executed in an intermediate stage. For example, operations pushed down to the leaf stage may use indexes (see how to know if indexes are used) or the semantics can be slightly different.

You can read Troubleshoot issues with the multi-stage query engine (v2) for more information on the differences between the leaf and intermediate stages, but the main ones are:

  • Null handling is different.

  • Some functions are only supported in multi-stage and some others only in single-stage.

  • Type coercion is different. While the single-stage engine always operates with generic types (ie uses doubles when mathematical operations are used), the multi-stage engine tries to keep the types (ie adding two integers will result in an integer).

Blocking nature

One of the slight differences between the leaf and the normal single-stage engine is that the leaf engine tries to be not blocking.

Hints

None

Stats

executionTimeMs

Type: Long

The summation of time spent by all threads executing the operator. This means that the wall time spent in the operation may be smaller that this value if the parallelism is larger than 1.

emittedRows

Type: Long

The number of groups emitted by the operator.

table

Type: String

The name of the table that is scanned. This is the name without the type suffix (so without _REALTIME or _OFFLINE). This is very useful to understand which table is being scanned by this leaf stage in case of complex queries.

numDocsScanned

Type: Long

Similar to the same stat in single-stage queries, this stat indicates the number of rows selected after the filter phase.

If it is very high, that means the selectivity for the query is low and lots of rows need to be processed after the filtering. Consider refining the filter to increase the selectivity of the query.

numEntriesScannedInFilter

Type: Long

Similar to the same stat in single-stage queries, this stat indicates the number of entries (aka scalar values) scanned in the filtering phase of query execution.

Can be larger than the total scanned doc count because of multiple filtering predicates or multi-value entries. Can also be smaller than the total scanned doc count if indexing is used for filtering.

This along with numEntriesScannedPostFilter indicates where most of the time is spent during table scan processing. If this value is high, enabling indexing for affected columns is a way to bring it down. Another option is to partition the data based on the dimension most heavily used in your filter queries.

numEntriesScannedPostFilter

Type: Long

Similar to the same stat in single-stage queries, this stat indicates the number of entries (aka scalar values) scanned after the filtering phase of query execution, ie. aggregation and/or group-by phases. This is equivalent to numDocScanned * number of projected columns.

This along with numEntriesScannedInFilter indicates where most of the time is spent during table scan processing. A high number for this means the selectivity is low (that is, Pinot needs to scan a lot of records to answer the query). If this is high, consider using star-tree index, given a regular index won't improve performance.

numSegmentsQueried

Type: Integer

Similar to the same stat in single-stage queries, this stat indicates the total number of segment queried for a query. May be less than the total number of segments if the broker applies optimizations.

The broker decides how many segments to query on each server, based on broker pruning logic. The server decides how many of these segments to actually look at, based on server pruning logic. After processing segments for a query, fewer may have the matching records.

In general, numSegmentsQueried >= numSegmentsProcessed >= numSegmentsMatched.

numSegmentsProcessed

Type: Integer

Similar to the same stat in single-stage queries, this stat indicates the number of segments processed with at least one document matched in the query response.

The more segments are processed, the more IO has to be done. This is why selective queries where numSegmentsProcessed is close to numSegmentsQueried can be optimized by changing the data distribution.

numSegmentsMatched

Type: Integer

Similar to the same stat in single-stage queries, this stat indicates the number of segment operators used to process segments. Indicates the effectiveness of the pruning logic.

totalDocs

Type: Long

Similar to the same stat in single-stage queries, this stat indicates the number of rows in the table.

numGroupsLimitReached

Type: Boolean

Similar to the same stat in single-stage queries and the same in aggregate operators, this stat indicates if the max group limit has been reached in a group by aggregation operator executed in the leaf stage.

If this boolean is set to true, the query result may not be accurate. The default value for numGroupsLimit is 100k, and should be sufficient for most use cases.

numResizes

Type: Integer

Number of result resizes for queries

resizeTimeMs

Type: Long

Time spent in resizing results for the output. Either because of LIMIT or maximum allowed group by keys or any other criteria.

threadCpuTimeNs

Type: Long

Aggregated thread cpu time in nanoseconds for query processing from servers. This metric is only available if Pinot is configured with pinot.server.instance.enableThreadCpuTimeMeasurement.

systemActivitiesCpuTimeNs

Type: Long

Aggregated system activities cpu time in nanoseconds for query processing (e.g. GC, OS paging etc.) This metric is only available if Pinot is configured with pinot.server.instance.enableThreadCpuTimeMeasurement.

numSegmentsPrunedByServer

Type: Integer

The number of segments pruned by the server, for any reason.

numSegmentsPrunedInvalid

Type: Integer

The number of segments pruned because they are invalid. Segments are invalid when the schema has changed and the segment has not been refreshed.

For example, if a column is added to the schema, the segment will be invalid for queries that use that column until it is refreshed.

numSegmentsPrunedByLimit

Type: Integer

The number of segments pruned because they are not needed for the query due to the limit clause.

Pinot keeps a count of the number of rows returned by each segment. Once it's guaranteed that no more segments need to be read to satisfy the limit clause without breaking semantics, the remaining segments are pruned.

For example, a query like SELECT col1 FROM table2 LIMIT 10 can be pruned for this reason while a query like SELECT col1 FROM table2 ORDER BY col1 DESC LIMIT 10 cannot because Pinot needs to read all segments to guarantee the larger values of col1 are returned.

numSegmentsPrunedByValue

Type: Integer

The number of segments pruned because they are not needed for the query due to a value clause, usually a where.

Pinot keeps the maximum and minimum values of each segment for each column. If the value clause is such that the segment cannot contain any rows that satisfy the clause, the segment is pruned.

numConsumingSegmentsProcessed

Type: Integer

Like numSegmentsProcessed but only for consuming segments.

numConsumingSegmentsMatched

Type: Integer

Like numSegmentsMatched but only for consuming segments.

operatorExecutionTimeMs

Type: Long

The time spent by the operator executing.

operatorExecStartTimeMs

Type: Long

The instant in time when the operator started executing.

Explain attributes

Given that the leaf operator is a meta-operator, it is not actually shown in the explain plan. But the leaf stage is the only operator that can execute table scans, so here we list the attributes that can be found in the explain plan for a table scan

table

Type: String array

Example: table=[[default, userGroups]]

The qualified name of the table that is scanned, which means it also contains the name of the database being used.

Tips and tricks

Try to push as much as possible to the leaf stage

Leaf stage operators can use all the optimizations and indices that the single-stage engine can use. This means that it is usually better to push down as much as possible to the leaf stage.

The engine is smart enough to push down filters and aggregations without breaking semantics, but sometimes there are subtle SQL semantics and what the domain expert writing the query wants to do.

Sometimes things the engine is too paranoid about null handling or the query includes an unnecessary limit clause that prevents the engine from pushing down the filter.

It is recommended to analyze your explain plan to be sure that the engine is able to push down as much logic as you expect.

Last updated