Filter

Describes the filter relation operator in the multi-stage query engine.

The filter operator is used to filter rows based on a condition.

This page describes the filter operator defined in the relational algebra used by multi-stage queries. This operator is generated by the multi-stage query engine when you use the where, having or sometimes on clauses.

Implementation details

Filter operations apply a predicate to each row and only keep the rows that satisfy the predicate.

It is important to notice that filter operators can only be optimized using indexes when they are executed in the leaf stage. The reason for that is that the intermediate stages don't have access to the actual segments. This is why the engine will try to push down the filter operation to the leaf stage whenever possible.

As explained in explain-plan-multiple-stages, the explain plan in the multi-stage query engine does not indicate whether indexes are used or not.

Blocking nature

The filter operator is a streaming operator. It emits the blocks of rows as soon as they are received from the upstream operator.

Hints

None

Stats

executionTimeMs

Type: Long

The summation of time spent by all threads executing the operator. This means that the wall time spent in the operation may be smaller that this value if the parallelism is larger than 1. This number is affected by the number of received rows and the complexity of the predicate.

emittedRows

Type: Long

The number of groups emitted by the operator. A large number of emitted rows may not be problematic, but indicates that the predicate is not very selective.

Explain attributes

The filter operator is represented in the explain plan as a LogicalFilter explain node.

condition

Type: Expression

The condition that is being applied to the rows. The expression may use indexed columns ($0, $1, etc), functions and literals. The indexed columns are always 0-based.

For example, the following explain plan:

LogicalFilter(condition=[>($5, 2)])
  LogicalTableScan(table=[[default, userAttributes]])

Is saying that the filter is applying the condition $5 > 2 which means that only the rows where the 6th column is greater than 2 will be emitted. In order to know which column is the 6th, you need to look at the schema of the table scanned.

Tips and tricks

How to know if indexes are used

As explained in explain-plan-multiple-stages, the explain plan in the multi-stage query engine does not directly indicate whether indexes are used or not.

Apache Pinot contributors are working on improving this, but it is not yet available. Meanwhile, we need an indirect approach to get that information.

First, we need to know on which stage the filter is being used. If the filter is being used in an intermediate stage, then the filter is not using indexes. In order to know the stage, you can extract stages as explained in understanding-stages.

But what about the leaf filters executed in the stage? Not all filters in the leaf stage can use indexes. The only way to know if the filter is using indexes is to use single-stage explain plan. In order to do so you need to transform the leaf stage into a single-stage query. This is a manual process that can be tedious but ends up not being so difficult once you get used to it.

See understanding-multi-stage-query for more information.

Last updated