1 of 1

Forward Index

The values for every column are stored in a forward index, of which there are three types:

Dictionary encoded forward index Builds a dictionary mapping 0 indexed ids to each unique value in a column and a forward index that contains the bit-compressed ids.
Sorted forward index Builds a dictionary mapping from each unique value to a pair of start and end document id and a forward index on top of the dictionary encoding.
Builds a forward index of the column's values.

To save segment storage space the forward index can now be while creating new tables.

Dictionary-encoded forward index with bit compression (default)

Each unique value from a column is assigned an id and a dictionary is built that maps the id to the value. The forward index stores bit-compressed ids instead of the values. If you have few unique values, dictionary-encoding can significantly improve space efficiency.

The below diagram shows the dictionary encoding for two columns with integer and string types. ForcolA, dictionary encoding saved a significant amount of space for duplicated values.

On the other hand, colB has no duplicated data. Dictionary encoding will not compress much data in this case where there are a lot of unique values in the column. For the string type, we pick the length of the longest value and use it as the length for the dictionary’s fixed-length value array. The padding overhead can be high if there are a large number of unique values for a column.

Sorted forward index with run-length encoding

When a column is physically sorted, Pinot uses a sorted forward index with run-length encoding on top of the dictionary-encoding. Instead of saving dictionary ids for each document id, Pinot will store a pair of start and end document ids for each value.

(For simplicity, this diagram does not include the dictionary encoding layer.)

The Sorted forward index has the advantages of both good compression and data locality. The Sorted forward index can also be used as an inverted index.

Real-time tables

A sorted index can be configured for a table by setting it in the table config:

Note: A Pinot table can only have 1 sorted column

Real-time data ingestion will sort data by the sortedColumn when generating segments - you don't need to pre-sort the data.

When a segment is committed, Pinot will do a pass over the data in each column and create a sorted index for any other columns that contain sorted data, even if they aren't specified as the sortedColumn.

Offline tables

For offline data ingestion, Pinot will do a pass over the data in each column and create a sorted index for columns that contain sorted data.

This means that if you want a column to have a sorted index, you will need to sort the data by that column before ingesting it into Pinot.

If you are ingesting multiple segments you will need to make sure that data is sorted within each segment - you don't need to sort the data across segments.

Checking sort status

You can check the sorted status of a column in a segment by running the following:

Alternatively, for offline tables and for committed segments in real-time tables, you can retrieve the sorted status from the getServerMetadata endpoint. The following example is based on the :

Raw value forward index

The raw value forward index directly stores values instead of ids.

Without the dictionary, the dictionary lookup step can be skipped for each value fetch. The index can also take advantage of the good locality of the values, thus improving the performance of scanning a large number of values.

The raw value forward index works well for columns that have a large number of unique values where a dictionary does not provide much compression.

As seen in the above diagram, using dictionary encoding will require a lot of random accesses of memory to do those dictionary look-ups. With a raw value forward index, we can scan values sequentially, which can result in improved query performance when applied appropriately.

A raw value forward index can be configured for a table by configuring the , as shown below:

Dictionary encoded vs raw value

When working out whether a column should use dictionary encoded or raw value encoding, the following comparison table may help:

Dictionary

Raw Value

Disabling the forward index

Traditionally the forward index has been a mandatory index for all columns in the on-disk segment file format.

However, certain columns may only be used as a filter in the WHERE clause for all queries. In such scenarios the forward index is not necessary as essentially other indexes and structures in the segments can provide the required SQL query functionality. Forward index just takes up extra storage space for such scenarios and can ideally be freed up.

Thus, to provide users an option to save storage space, a knob to disable the forward index is now available.

Forward index on one or more columns(s) in your Pinot table can be disabled with the following limitations:

Only supported for immutable (offline) segments.
Inverted index must be enabled on the particular column(s)
Dictionary must be enabled on the particular column(s)

Sorted columns will allow the forward index to be disabled, but this operation will be treated as a no-op and the index (which acts as both a forward index and inverted index) will be created.

To disable the forward index for a given column the fieldConfigList can be modified within the , as shown below:

A table reload operation must be performed for the above config to take effect. Enabling / disabling other indexes on the column can be done via the usual options.

The forward index can also be regenerated for a column where it is disabled by removing the property forwardIndexDisabled from the fieldConfigList properties bucket and reloading the segment.

Warning:

For multi-value (MV) columns the following invariants cannot be maintained after regenerating the forward index for a forward index disabled column:

Examples of queries which will fail after disabling the forward index for an example column, columnA, can be found below:

Select

Forward index disabled columns cannot be present in the SELECT clause even if filters are added on it.

Group By Order By

Forward index disabled columns cannot be present in the GROUP BY and ORDER BY clauses. They also cannot be part of the HAVING clause.

Aggregation Queries

A subset of the aggregation functions do work when the forward index is disabled such as MIN, MAX, DISTINCTCOUNT, DISTINCTCOUNTHLL and more. Some of the other aggregation functions will not work such as the below:

Distinct

Forward index disabled columns cannot be present in the SELECT DISTINCT clause.

Range Queries

To run queries on single-value columns where the filter clause contains operators such as >, <, >=, <= a version 2 range index must be present. Without the range index such queries will fail as shown below:

Forward Index

The values for every column are stored in a forward index, of which there are three types:

Dictionary encoded forward index Builds a dictionary mapping 0 indexed ids to each unique value in a column and a forward index that contains the bit-compressed ids.
Sorted forward index Builds a dictionary mapping from each unique value to a pair of start and end document id and a forward index on top of the dictionary encoding.
Builds a forward index of the column's values.

To save segment storage space the forward index can now be while creating new tables.

Dictionary-encoded forward index with bit compression (default)

The below diagram shows the dictionary encoding for two columns with integer and string types. ForcolA, dictionary encoding saved a significant amount of space for duplicated values.

Sorted forward index with run-length encoding

(For simplicity, this diagram does not include the dictionary encoding layer.)

The Sorted forward index has the advantages of both good compression and data locality. The Sorted forward index can also be used as an inverted index.

Real-time tables

A sorted index can be configured for a table by setting it in the table config:

Note: A Pinot table can only have 1 sorted column

Real-time data ingestion will sort data by the sortedColumn when generating segments - you don't need to pre-sort the data.

Offline tables

For offline data ingestion, Pinot will do a pass over the data in each column and create a sorted index for columns that contain sorted data.

This means that if you want a column to have a sorted index, you will need to sort the data by that column before ingesting it into Pinot.

If you are ingesting multiple segments you will need to make sure that data is sorted within each segment - you don't need to sort the data across segments.

Checking sort status

You can check the sorted status of a column in a segment by running the following:

Alternatively, for offline tables and for committed segments in real-time tables, you can retrieve the sorted status from the getServerMetadata endpoint. The following example is based on the :

Raw value forward index

The raw value forward index directly stores values instead of ids.

The raw value forward index works well for columns that have a large number of unique values where a dictionary does not provide much compression.

A raw value forward index can be configured for a table by configuring the , as shown below:

Dictionary encoded vs raw value

When working out whether a column should use dictionary encoded or raw value encoding, the following comparison table may help:

Dictionary

Raw Value

Disabling the forward index

Traditionally the forward index has been a mandatory index for all columns in the on-disk segment file format.

Thus, to provide users an option to save storage space, a knob to disable the forward index is now available.

Forward index on one or more columns(s) in your Pinot table can be disabled with the following limitations:

Only supported for immutable (offline) segments.
Inverted index must be enabled on the particular column(s)
Dictionary must be enabled on the particular column(s)

Sorted columns will allow the forward index to be disabled, but this operation will be treated as a no-op and the index (which acts as both a forward index and inverted index) will be created.

To disable the forward index for a given column the fieldConfigList can be modified within the , as shown below:

A table reload operation must be performed for the above config to take effect. Enabling / disabling other indexes on the column can be done via the usual options.

The forward index can also be regenerated for a column where it is disabled by removing the property forwardIndexDisabled from the fieldConfigList properties bucket and reloading the segment.

Warning:

For multi-value (MV) columns the following invariants cannot be maintained after regenerating the forward index for a forward index disabled column:

Examples of queries which will fail after disabling the forward index for an example column, columnA, can be found below:

Select

Forward index disabled columns cannot be present in the SELECT clause even if filters are added on it.

Group By Order By

Forward index disabled columns cannot be present in the GROUP BY and ORDER BY clauses. They also cannot be part of the HAVING clause.

Aggregation Queries

Distinct

Forward index disabled columns cannot be present in the SELECT DISTINCT clause.

Forward Index

hashtagDictionary-encoded forward index with bit compression (default)

hashtagSorted forward index with run-length encoding

hashtagReal-time tables

hashtagOffline tables

hashtagChecking sort status

hashtagRaw value forward index

hashtagDictionary encoded vs raw value

hashtagDisabling the forward index

hashtagSelect

hashtagGroup By Order By

hashtagAggregation Queries

hashtagDistinct

hashtagRange Queries

Forward Index

hashtagDictionary-encoded forward index with bit compression (default)

hashtagSorted forward index with run-length encoding

hashtagReal-time tables

hashtagOffline tables

hashtagChecking sort status

hashtagRaw value forward index

hashtagDictionary encoded vs raw value

hashtagDisabling the forward index

hashtagSelect

hashtagGroup By Order By

hashtagAggregation Queries

hashtagDistinct

hashtagRange Queries

Dictionary-encoded forward index with bit compression (default)

Sorted forward index with run-length encoding

Real-time tables

Offline tables

Checking sort status

Raw value forward index

Dictionary encoded vs raw value

Disabling the forward index

Select

Group By Order By

Aggregation Queries

Distinct

Range Queries

Dictionary-encoded forward index with bit compression (default)

Sorted forward index with run-length encoding

Real-time tables

Offline tables

Checking sort status

Raw value forward index

Dictionary encoded vs raw value

Disabling the forward index

Select

Group By Order By

Aggregation Queries

Distinct

Range Queries