LogoLogo
release-1.3.0
release-1.3.0
  • Introduction
  • Basics
    • Concepts
      • Pinot storage model
      • Architecture
      • Components
        • Cluster
          • Tenant
          • Server
          • Controller
          • Broker
          • Minion
        • Table
          • Segment
            • Deep Store
            • Segment threshold
            • Segment retention
          • Schema
          • Time boundary
        • Pinot Data Explorer
    • Getting Started
      • Running Pinot locally
      • Running Pinot in Docker
      • Quick Start Examples
      • Running in Kubernetes
      • Running on public clouds
        • Running on Azure
        • Running on GCP
        • Running on AWS
      • Create and update a table configuration
      • Batch import example
      • Stream ingestion example
      • HDFS as Deep Storage
      • Troubleshooting Pinot
      • Frequently Asked Questions (FAQs)
        • General
        • Pinot On Kubernetes FAQ
        • Ingestion FAQ
        • Query FAQ
        • Operations FAQ
    • Import Data
      • From Query Console
      • Batch Ingestion
        • Spark
        • Flink
        • Hadoop
        • Backfill Data
        • Dimension table
      • Stream ingestion
        • Ingest streaming data from Apache Kafka
        • Ingest streaming data from Amazon Kinesis
        • Ingest streaming data from Apache Pulsar
        • Configure indexes
      • Stream ingestion with Upsert
      • Segment compaction on upserts
      • Stream ingestion with Dedup
      • Stream ingestion with CLP
      • File Systems
        • Amazon S3
        • Azure Data Lake Storage
        • HDFS
        • Google Cloud Storage
      • Input formats
        • Complex Type (Array, Map) Handling
        • Complex Type Examples
        • Ingest records with dynamic schemas
      • Reload a table segment
      • Upload a table segment
    • Indexing
      • Bloom filter
      • Dictionary index
      • Forward index
      • FST index
      • Geospatial
      • Inverted index
      • JSON index
      • Native text index
      • Range index
      • Star-tree index
      • Text search support
      • Timestamp index
      • Vector index
    • Release notes
      • 1.3.0
      • 1.2.0
      • 1.1.0
      • 1.0.0
      • 0.12.1
      • 0.12.0
      • 0.11.0
      • 0.10.0
      • 0.9.3
      • 0.9.2
      • 0.9.1
      • 0.9.0
      • 0.8.0
      • 0.7.1
      • 0.6.0
      • 0.5.0
      • 0.4.0
      • 0.3.0
      • 0.2.0
      • 0.1.0
    • Recipes
      • Connect to Streamlit
      • Connect to Dash
      • Visualize data with Redash
      • GitHub Events Stream
  • For Users
    • Query
      • Querying Pinot
      • Query Syntax
        • Aggregation Functions
        • Array Functions
        • Cardinality Estimation
        • Explain Plan (Single-Stage)
        • Filtering with IdSet
        • Funnel Analysis
        • GapFill Function For Time-Series Dataset
        • Grouping Algorithm
        • Hash Functions
        • JOINs
        • Lookup UDF Join
        • Querying JSON data
        • Transformation Functions
        • URL Functions
        • Window Functions
      • Query Options
      • Query Quotas
      • Query using Cursors
      • Multi-stage query
        • Understanding Stages
        • Stats
        • Optimizing joins
        • Join strategies
          • Random + broadcast join strategy
          • Query time partition join strategy
          • Colocated join strategy
          • Lookup join strategy
        • Hints
        • Operator Types
          • Aggregate
          • Filter
          • Join
          • Intersect
          • Leaf
          • Literal
          • Mailbox receive
          • Mailbox send
          • Minus
          • Sort or limit
          • Transform
          • Union
          • Window
        • Stage-Level Spooling
      • User-Defined Functions (UDFs)
      • Explain plan
    • APIs
      • Broker Query API
        • Query Response Format
      • Controller Admin API
      • Controller API Reference
    • External Clients
      • JDBC
      • Java
      • Python
      • Golang
    • Tutorials
      • Use OSS as Deep Storage for Pinot
      • Ingest Parquet Files from S3 Using Spark
      • Creating Pinot Segments
      • Use S3 as Deep Storage for Pinot
      • Use S3 and Pinot in Docker
      • Batch Data Ingestion In Practice
      • Schema Evolution
  • For Developers
    • Basics
      • Extending Pinot
        • Writing Custom Aggregation Function
        • Segment Fetchers
      • Contribution Guidelines
      • Code Setup
      • Code Modules and Organization
      • Dependency Management
      • Update documentation
    • Advanced
      • Data Ingestion Overview
      • Ingestion Aggregations
      • Ingestion Transformations
      • Null value support
      • Use the multi-stage query engine (v2)
      • Advanced Pinot Setup
    • Plugins
      • Write Custom Plugins
        • Input Format Plugin
        • Filesystem Plugin
        • Batch Segment Fetcher Plugin
        • Stream Ingestion Plugin
    • Design Documents
      • Segment Writer API
  • For Operators
    • Deployment and Monitoring
      • Set up cluster
      • Server Startup Status Checkers
      • Set up table
      • Set up ingestion
      • Decoupling Controller from the Data Path
      • Segment Assignment
      • Instance Assignment
      • Rebalance
        • Rebalance Servers
        • Rebalance Brokers
        • Rebalance Tenant
      • Separating data storage by age
        • Using multiple tenants
        • Using multiple directories
      • Pinot managed Offline flows
      • Minion merge rollup task
      • Consistent Push and Rollback
      • Access Control
      • Monitoring
      • Tuning
        • Tuning Default MMAP Advice
        • Real-time
        • Routing
        • Query Routing using Adaptive Server Selection
        • Query Scheduling
      • Upgrading Pinot with confidence
      • Managing Logs
      • OOM Protection Using Automatic Query Killing
      • Pause ingestion based on resource utilization
    • Command-Line Interface (CLI)
    • Configuration Recommendation Engine
    • Tutorials
      • Authentication
        • Basic auth access control
        • ZkBasicAuthAccessControl
      • Configuring TLS/SSL
      • Build Docker Images
      • Running Pinot in Production
      • Kubernetes Deployment
      • Amazon EKS (Kafka)
      • Amazon MSK (Kafka)
      • Monitor Pinot using Prometheus and Grafana
      • Performance Optimization Configurations
      • Segment Operations Throttling
  • Configuration Reference
    • Cluster
    • Controller
    • Broker
    • Server
    • Table
    • Ingestion
    • Schema
    • Ingestion Job Spec
    • Monitoring Metrics
    • Functions
      • ABS
      • ADD
      • ago
      • EXPR_MIN / EXPR_MAX
      • ARRAY_AGG
      • arrayConcatDouble
      • arrayConcatFloat
      • arrayConcatInt
      • arrayConcatLong
      • arrayConcatString
      • arrayContainsInt
      • arrayContainsString
      • arrayDistinctInt
      • arrayDistinctString
      • arrayIndexOfInt
      • arrayIndexOfString
      • ARRAYLENGTH
      • arrayRemoveInt
      • arrayRemoveString
      • arrayReverseInt
      • arrayReverseString
      • arraySliceInt
      • arraySliceString
      • arraySortInt
      • arraySortString
      • arrayUnionInt
      • arrayUnionString
      • AVGMV
      • Base64
      • caseWhen
      • ceil
      • CHR
      • codepoint
      • concat
      • count
      • COUNTMV
      • COVAR_POP
      • COVAR_SAMP
      • day
      • dayOfWeek
      • dayOfYear
      • DISTINCT
      • DISTINCTAVG
      • DISTINCTAVGMV
      • DISTINCTCOUNT
      • DISTINCTCOUNTBITMAP
      • DISTINCTCOUNTBITMAPMV
      • DISTINCTCOUNTHLL
      • DISTINCTCOUNTSMARTHLL
      • DISTINCTCOUNTHLLPLUS
      • DISTINCTCOUNTHLLMV
      • DISTINCTCOUNTMV
      • DISTINCTCOUNTRAWHLL
      • DISTINCTCOUNTRAWHLLMV
      • DISTINCTCOUNTRAWTHETASKETCH
      • DISTINCTCOUNTTHETASKETCH
      • DISTINCTCOUNTULL
      • DISTINCTSUM
      • DISTINCTSUMMV
      • DIV
      • DATETIMECONVERT
      • DATETRUNC
      • exp
      • FIRSTWITHTIME
      • FLOOR
      • FrequentLongsSketch
      • FrequentStringsSketch
      • FromDateTime
      • FromEpoch
      • FromEpochBucket
      • FUNNELCOUNT
      • FunnelCompleteCount
      • FunnelMaxStep
      • FunnelMatchStep
      • Histogram
      • hour
      • isSubnetOf
      • JSONFORMAT
      • JSONPATH
      • JSONPATHARRAY
      • JSONPATHARRAYDEFAULTEMPTY
      • JSONPATHDOUBLE
      • JSONPATHLONG
      • JSONPATHSTRING
      • jsonextractkey
      • jsonextractscalar
      • LAG
      • LASTWITHTIME
      • LEAD
      • length
      • ln
      • lower
      • lpad
      • ltrim
      • max
      • MAXMV
      • MD5
      • millisecond
      • min
      • minmaxrange
      • MINMAXRANGEMV
      • MINMV
      • minute
      • MOD
      • mode
      • month
      • mult
      • now
      • percentile
      • percentileest
      • percentileestmv
      • percentilemv
      • percentiletdigest
      • percentiletdigestmv
      • percentilekll
      • percentilerawkll
      • percentilekllmv
      • percentilerawkllmv
      • quarter
      • regexpExtract
      • regexpReplace
      • remove
      • replace
      • reverse
      • round
      • roundDecimal
      • ROW_NUMBER
      • rpad
      • rtrim
      • second
      • SEGMENTPARTITIONEDDISTINCTCOUNT
      • sha
      • sha256
      • sha512
      • sqrt
      • startswith
      • ST_AsBinary
      • ST_AsText
      • ST_Contains
      • ST_Distance
      • ST_GeogFromText
      • ST_GeogFromWKB
      • ST_GeometryType
      • ST_GeomFromText
      • ST_GeomFromWKB
      • STPOINT
      • ST_Polygon
      • strpos
      • ST_Union
      • SUB
      • substr
      • sum
      • summv
      • TIMECONVERT
      • timezoneHour
      • timezoneMinute
      • ToDateTime
      • ToEpoch
      • ToEpochBucket
      • ToEpochRounded
      • TOJSONMAPSTR
      • toGeometry
      • toSphericalGeography
      • trim
      • upper
      • Url
      • UTF8
      • VALUEIN
      • week
      • year
      • Extract
      • yearOfWeek
      • FIRST_VALUE
      • LAST_VALUE
      • ST_GeomFromGeoJSON
      • ST_GeogFromGeoJSON
      • ST_AsGeoJSON
    • Plugin Reference
      • Stream Ingestion Connectors
      • VAR_POP
      • VAR_SAMP
      • STDDEV_POP
      • STDDEV_SAMP
    • Dynamic Environment
  • Reference
    • Single-stage query engine (v1)
    • Multi-stage query engine (v2)
    • Troubleshooting
      • Troubleshoot issues with the multi-stage query engine (v2)
      • Troubleshoot issues with ZooKeeper znodes
  • RESOURCES
    • Community
    • Team
    • Blogs
    • Presentations
    • Videos
  • Integrations
    • Tableau
    • Trino
    • ThirdEye
    • Superset
    • Presto
    • Spark-Pinot Connector
  • Contributing
    • Contribute Pinot documentation
    • Style guide
Powered by GitBook
On this page
  • How Pinot stores null values
  • Disabled null handling
  • Store nulls at ingestion time
  • Column based null storing
  • Table based null storing
  • Null handling at query time
  • Basic null support
  • Advanced null handling support
  • Examples queries
  • Appendix: Workarounds to handle null values without storing nulls

Was this helpful?

Export as PDF
  1. For Developers
  2. Advanced

Null value support

PreviousIngestion TransformationsNextUse the multi-stage query engine (v2)

Was this helpful?

For historical reasons, null support is disabled in Apache Pinot by default. This is expected to be changed in future versions.

For historical reasons, null support is disabled by default in Apache Pinot. When null support is disabled, all columns are treated as not null. Predicates like IS NOT NULL evaluates to true, and IS NULL evaluates to false. Aggregation functions like COUNT, SUM, AVG, MODE, etc. treat all columns as not null.

For example, the predicate in the query below matches all records.

select count(*) from my_table where column IS NOT NULL

To handle null values in your data, you must:

  1. Indicate Pinot to store null values in your data before ingesting the data. See .

  2. Use one of the . By default Pinot will use a where only IS NULL and IS NOT NULL predicates are supported, but the can be enabled.

The following table summarizes the behavior of null handling support in Pinot:

disabled (default)
basic (enabled at ingestion time)
advanced (enabled at query time)

IS NULL

always false

depends on data

depends on data

IS NOT NULL

always true

depends on data

depends on data

Transformation functions

use default value

use default value

null aware

Null aware aggregations

use default value

use default value

null aware

How Pinot stores null values

Pinot always stores column values in a . Forward index never stores null values but have to store a value for each row. Therefore independent of the null handling configuration, Pinot always stores a default value for nulls rows in the forward index. The default value used in a column can be specified in the configuration by setting the defaultNullValue field spec. The defaultNullValue depends on the type of data.

Remember that in the JSON used as table configuration, defaultNullValue must always be a String. If the column type is not String, Pinot will convert that value to the column type automatically.

Disabled null handling

By default, Pinot does not store null values at all. This means that by default whenever a null value is ingested, Pinot stores the default null value (defined above) instead.

In order to store null values the table has to be configured to do so as explained below.

Store nulls at ingestion time

When null storing is enabled, Pinot creates a new index called the null index or null vector index. This index stores the document IDs of the rows that have null values for the column.

Although null storing can be enabled after data has been ingested, data ingested before this mode is enabled will not store the null index and therefore it will be treated as not null.

Null support is configured per table. You can configure one table to store nulls, and configure another table to not store nulls. There are two ways to define null storing support in Pinot:

Remember that Column based null storing has priority over Table based null storing. In case both modes are enabled, Column based null storing will be used.

Column based null storing

We recommend configuring column based null storing, which lets you specify null handling per column and supports null handling in the multi-stage query engine.

To enable column based null handling:

  1. Then specify which columns are not nullable using the notNull field spec, which defaults to false.

{
  "schemaName": "my_table",
  "enableColumnBasedNullHandling": true,
  "dimensionFieldSpecs": [
    {
      "name": "notNullColumn",
      "dataType": "STRING",
      "notNull": true
    },
    {
      "name": "explicitNullableColumn",
      "dataType": "STRING",
      "notNull": false
    },
    {
      "name": "implicitNullableColumn",
      "dataType": "STRING"
    }
  ]
}

Table based null storing

This is the only way to enable null storing in Pinot before 1.1.0, but it is deprecated since then. Table based null storing is more expensive in terms of disk space and query performance than column based null storing. Also, it is not possible to support null handling in multi-stage query engine using table based null storing.

When table based null storing is enabled, all columns will be considered nullable. To enable this mode you need to:

As an example:

{
  "tableIndexConfig": {
    "nullHandlingEnabled": true
  }
}

Null handling at query time

The multi-stage query engine requires column based null storing. Tables with table based null storing are considered not nullable.

If you are converting from null support for the single-stage query engine, you can modify your schema to set enableColumnBasedNullHandling. There is no need to change your table config to remove or set nullHandlingEnabled to false. In fact we recommend to keep it as true to make it clear that the table may contain nulls. Also, when converting:

  • No reingestion is needed.

  • If the columns are changed from nullable to not nullable and there is a value that was previously null, the default value will be used instead.

Basic null support

In this mode, Pinot is able to handle simple predicates like IS NULL or IS NOT NULL. Other transformation functions (like CASE, COALESCE, +, etc.) and aggregations functions (like COUNT, SUM, AVG, etc.) will use the default value specified in the schema for null values.

For example, in the following table:

rowId
col1

0

null

1

1

2

2

3

2

4

null

If the default value for col1 is 1, the following query:

select $docId as rowId, col1 from my_table where col1 IS NOT NULL

Will return the following result:

rowId
col1

1

1

2

2

3

2

While

select $docId as rowId, col1 + 1 as result from my_table

While return the following:

rowId
col1

0

2

1

2

2

3

3

3

4

2

And queries like

select $docId as rowId, col1 from my_table where col1 = 1

Will return

rowId
col1

0

null

1

1

4

null

Also

select count(col1)  as count, mode(col1) as mode from my_table
count
mode

5

1

Given that neither count or mode function will ignore null values as expected but read instead the default value (in this case 1) stored in the forward index.

Advanced null handling support

Advanced null handling has two requirements:

The later can be done in one of the following ways:

  • Set enableNullHandling=true at the beginning of the query.

  • If using JDBC, set the connection option enableNullHandling=true (either in the URL or as a property).

Alternatively, if you want to enable advanced null handling for all queries by default, the broker configuration pinot.broker.query.enable.null.handling can be set to true. Individual queries can override this to false using the enableNullHandling query option if required.

Even though they have similar names, the nullHandlingEnabled table configuration and the enableNullHandling query option are different. Remember that the nullHandlingEnabled table configuration modifies how segments are stored and the enableNullHandling query option modifies how queries are executed.

When the enableNullHandling option is set to true, the Pinot query engine uses a different execution path that interprets nulls in a standard SQL way. This means that IS NULL and IS NOT NULL predicates will evaluate to true or false according to whether a null is detected (like in basic null support mode) but also aggregation functions like COUNT, SUM, AVG, MODE, etc. will deal with null values as expected (usually ignoring null values).

In this mode, some indexes may not be usable, and queries may be significantly more expensive. Performance degradation impacts all the columns in the table, including columns in the query that do not contain null values. This degradation happens even when table uses column based null storing.

Examples queries

Select Query

Filter Query

Aggregate Query

Aggregate Filter Query

Group By Query

Order By Query

Transform Query

Appendix: Workarounds to handle null values without storing nulls

If you're not able to generate the null index for your use case, you may filter for null values using a default value specified in your schema or a specific value included in your query.

The following example queries work when the null value is not used in a dataset. Unexpected values may be returned if the specified null value is a valid value in the dataset.

Filter for default null value(s) specified in your schema

  1. Ingest the data.

  2. To filter out the specified default null value, for example, you could write a query like the following:

    select count(*) from my_table where column <> 'default_null_value'

Filter for a specific value in your query

Filter for a specific value in your query that will not be included in the dataset. For example, to calculate the average age, use -1 to indicate the value of Age is null.

  • Rewrite the following query:

    select avg(Age) from my_table
  • To cover null values as follows:

    select avg(Age) from my_table WHERE Age <> -1

, where each column in a table is configured as nullable or not nullable. We recommend enabling null storing support by column. This is the only way to support null handling in the .

, where all columns in the table are considered nullable. This is how null values were handled before Pinot 1.1.0 and now deprecated.

Set to true in the schema configuration before ingesting data.

Enable the nullHandlingEnabled configuration in

Disable in the schema.

Remember nullHandlingEnabled table configuration enables table based null handling while enableNullHandling is the query option that enables advanced null handling at query time. See for more information.

To enable basic null handling by at query time, enable Pinot to . Advanced null handling support can be optionally enabled.

The basic null support is automatically enabled when null values are stored on a segment (see ).

Segments must store null values (see ).

The query must enable null handling by setting the enableNullHandling to true.

Specify a default null value (defaultNullValue) in your for dimension fields, (dimensionFieldSpecs), metric fields (metricFieldSpecs), and date time fields (dateTimeFieldSpecs).

tableIndexConfig.nullHandlingEnabled
schema
forward index
schema
Store nulls at ingestion time
null handling modes at query time
basic support mode
advanced null handling support
multi-stage query engine
Column based null storing
Table based null storing
advanced null handling support
store nulls at ingestion time
storing nulls at ingestion time
storing nulls at ingestion time
enableColumnBasedNullHandling
enableColumnBasedNullHandling
query option