Null Value Support

Need for special NULL value handling

By default, Pinot transforms null values coming from the data source to a default value determined by the type of the corresponding column (or as specified in the schema). Eg: for INT column, the default will be 0 and for STRING column, the default is "null". This transformation is necessary to ensure all the indices can be built correctly during segment creation. However, we're now unable to keep track of the null values in the Pinot table and hence cannot support queries such as:

select count(*) from my_table where column IS NOT NULL

There is a workaround by matching with default values in the filter predicate. However, this is error prone since oftentimes it's difficult to distinguish valid values from the default null values. Therefore, we added first class NULL value support in Pinot for overcoming this limitation. As of today, the latest version supports NULL filter predicates only. Generic support for NULL handling in query execution is in progress (eg: within aggregation functions such as count or sum).

High Level Architecture

To turn on NULL handling, simply enable the boolean flag in the table index config called as nullHandlingEnabled (please see tableIndexConfig section). Please note - this will cause Pinot to use additional memory and disk space per segment. The details are as follows:

Ingestion Phase

During data ingestion (either realtime/offline) eachGenericRow object derived from the original data source record keeps track of all the column names containing null values. This is done as part of the NullValueTransformer. For each such column, the segment creation logic updates a NULL value vector (implemented by a roaring bitmap) with the corresponding document ID. Effectively, at the end of the segment creation process we get a per column NULL value vector which can give us the set of document IDs containing null values for that column. This per column vector is then exposed through the DataSource interface for use in query execution.

Query Phase

During Query execution, if the query includes a IS NULL or IS NOT NULL predicate as shown above, we fetch the NULL value vector for the corresponding column within FilterPlanNode and retrieve the corresponding bitmap which represents all document IDs containing NULL values for that column. This bitmap is then used to create a BitmapBasedFilterOperatorwhich does the actual filtering operation.

Last updated