Sort or limit

Describes the sort or limit relation operator in the multi-stage query engine.

The sort or limit operator is used to sort the input data, limit the number of rows emitted by the operator or both. This operator is generated by the multi-stage query engine when you use an order by, limit or offset operation in a query.

Implementation details

Blocking nature

Hints

None

Stats

executionTimeMs

Type: Long

The summation of time spent by all threads executing the operator. This means that the wall time spent in the operation may be smaller that this value if the parallelism is larger than 1.

emittedRows

Type: Long

The number of groups emitted by the operator.

Explain attributes

The sort or limit operator is represented in the explain plan as a LogicalSort explain node.

sort#

Type: Expression

The sort expressions used by the operator. There is one of these attributes per sort expression. The first one is sort0, the second one is sort1, and so on.

The value of this attribute is the expression used to sort the data and may contain indexed columns ($0, $1, etc) that represent the columns of the virtual row generated by the upstream.

For example, the following plan:

LogicalSort(sort0=[$0], sort1=[$2], dir0=[ASC], dir1=[ASC], fetch=[10])
  LogicalProject(userUUID=[$6], deviceOS=[$4], EXPR$2=[SUBSTRING($4, 0, 2)])
    LogicalTableScan(table=[[default, userAttributes]])

Is saying that the rows are sorted first by the column with index 0 and then by the column with index 2 in the virtual row generated by the upstream. That column is generated by a projection whose first column (index 0) is userUUID, the second (index 1) is deviceOS and third (index 2) is the result of the SUBSTRING($4, 0, 2) expression. As we know $4 in this project is deviceOS, we can infer that the third column is the first two characters of the deviceOS column.

dir#

Type: ASC or DESC

The direction of the sort. There is one of these attributes per sort expression.

fetch

Type: Long

The number of rows to emit. This is the equivalent to LIMIT in SQL. Remember that the limit can be applied without sorting, in which case the order on which the rows are emitted is undefined.

offset

Type: Long

The number of rows to skip before emitting the rows. This is the equivalent to OFFSET in SQL.

Tips and tricks

Limit and offset can prevent filter pushdown

In SQL, usually limit and offset are used in the last stage of the query. But when being used in the middle of the query (like in a subquery or a CTE), it can prevent filter pushdown optimization.

For example, imagine the following query:

select 
a.* 
from userAttributes as a
join userGroups as g
on a.userUUID = g.userUUID
where a.deviceOS = 'windows'

This query may generate the plan:

LogicalProject(daysSinceFirstTrip=[$0], deviceOS=[$1], totalTrips=[$2], userUUID=[$3])
  LogicalJoin(condition=[=($3, $4)], joinType=[inner])
    PinotLogicalExchange(distribution=[hash[3]])
      LogicalProject(daysSinceFirstTrip=[$3], deviceOS=[$4], totalTrips=[$5], userUUID=[$6])
        LogicalFilter(condition=[=($4, _UTF-8'windows')])
          LogicalTableScan(table=[[default, userAttributes]])
    PinotLogicalExchange(distribution=[hash[0]])
      LogicalProject(userUUID=[$4])
        LogicalTableScan(table=[[default, userGroups]])

We can see that the filter deviceOS = 'windows' is pushed down to the leaf stage. This reduce the amount of data that needs to be scanned and can improve the query performance, specially if there is an inverted index in the deviceOS column.

But if we modify the query to add a limit to the userAttributes table scan:

select 
a.* 
from (select * from userAttributes limit 10) as a
join userGroups as g
on a.userUUID = g.userUUID
where a.deviceOS = 'windows'

The generated plan will be:

LogicalProject(daysSinceFirstTrip=[$0], deviceOS=[$1], totalTrips=[$2], userUUID=[$3])
  LogicalJoin(condition=[=($3, $4)], joinType=[inner])
    PinotLogicalExchange(distribution=[hash[3]])
      LogicalFilter(condition=[=($1, _UTF-8'windows')])
        LogicalSort(offset=[0], fetch=[10])
          PinotLogicalSortExchange(distribution=[hash], collation=[[]], isSortOnSender=[false], isSortOnReceiver=[false])
            LogicalSort(fetch=[10])
              LogicalProject(daysSinceFirstTrip=[$3], deviceOS=[$4], totalTrips=[$5], userUUID=[$6])
                LogicalTableScan(table=[[default, userAttributes]])
    PinotLogicalExchange(distribution=[hash[0]])
      LogicalProject(userUUID=[$4])
        LogicalTableScan(table=[[default, userGroups]])

Here we can see that the filter deviceOS = 'windows' is not pushed down leaf stage, which means that the engine will need to scan all the data in the userAttributes table and then apply the filter.

The reason why the filter is not pushed down is that the limit operation must be applied before the filter in order to not break the semantics, which in this case are saying that we want 10 rows of the userAttributes table without considering their deviceOS value.

In cases where you actually want to apply the filter before the limit, you can specify the where clause in the subquery. For example:

select 
a.* 
from (select * from userAttributes where deviceOS = 'windows' limit 10) as a
join userGroups as g
on a.userUUID = g.userUUID

Which will produce the following plan:

LogicalProject(daysSinceFirstTrip=[$0], deviceOS=[$1], totalTrips=[$2], userUUID=[$3])
  LogicalJoin(condition=[=($3, $4)], joinType=[inner])
    PinotLogicalExchange(distribution=[hash[3]])
      LogicalSort(offset=[0], fetch=[10])
        PinotLogicalSortExchange(distribution=[hash], collation=[[]], isSortOnSender=[false], isSortOnReceiver=[false])
          LogicalSort(fetch=[10])
            LogicalProject(daysSinceFirstTrip=[$3], deviceOS=[$4], totalTrips=[$5], userUUID=[$6])
              LogicalFilter(condition=[=($4, _UTF-8'windows')])
                LogicalTableScan(table=[[default, userAttributes]])
    PinotLogicalExchange(distribution=[hash[0]])
      LogicalProject(userUUID=[$4])
        LogicalTableScan(table=[[default, userGroups]])

As you can see, the filter is pushed down to leaf stage, which will reduce the amount of data

Do not abuse offset pagination

Although OFFSET and LIMIT are a very simple way to paginate results, they can be very inefficient. It is almost always better to paginate using a WHERE clause that uses a range of values instead of using OFFSET.

The reason is that in order to apply an OFFSET the engine must generate these rows and then discard them. Instead, if you use a WHERE clause with a range of values, the engine can apply different techniques like indexes or pruning to avoid reading the rows that are not needed.

This is not a Pinot specific issue, but a general one. See for example Paging Through Results (external link) or Pagination, You Are Probably Doing It Wrong (external link).

Last updated