githubEdit

Dimension table

Batch ingestion of data into Apache Pinot using dimension tables.

Dimension tables are a special kind of offline table designed for join-like enrichment of fact data at query time. They are used together with the lookup UDF (single-stage engine) or the lookup join strategyarrow-up-right (multi-stage engine) to decorate query results with reference data.

When to use dimension tables

Use a dimension table when you need to enrich a large fact table with attributes from a small, relatively static reference dataset at query time. Common examples include:

  • Looking up a human-readable team name from a team ID.

  • Enriching clickstream events with product catalog attributes.

  • Decorating transaction records with customer or store metadata.

If any of the following apply, a regular offline or real-time table is a better fit:

  • The reference data is large (hundreds of millions of rows or multiple gigabytes).

  • The data changes frequently and requires real-time ingestion.

  • You need time-based partitioning, retention policies, or a hybrid table setup.

  • You need to query the reference data with complex aggregations independently.

How dimension tables work

When a table is marked as a dimension table, Pinot replicates all of its segments to every server in the tenant. On each server the data is loaded into an in-memory hash map keyed by the table's primary key, which enables constant-time lookups during query execution.

Because the data is fully replicated and held in memory, dimension tables must be small enough to fit comfortably in each server's heap. They are not intended for large datasets.

Memory loading modes

Pinot supports two loading modes controlled by the disablePreload setting in dimensionTableConfig:

Mode

disablePreload

Memory usage

Lookup speed

Description

Fast lookup (default)

false

Higher

Faster

All rows are fully materialized into an in-memory hash map (Object[] -> Object[]). Every column value is stored in the map for constant-time retrieval.

Memory-optimized

true

Lower

Slightly slower

Only the primary key and a segment/docId reference are stored in the hash map. Column values are read from the segment on each lookup. This trades lookup speed for lower heap usage.

Choose the memory-optimized mode when the dimension table is relatively large and you want to reduce heap pressure, at the cost of slightly slower lookups.

Size limits and memory considerations

  • Cluster-level maximum size: The controller configuration property controller.dimTable.maxSize sets the maximum storage quota allowed for any single dimension table. The default is 200 MB. Table creation fails if the requested quota.storage exceeds this limit.

  • Heap impact: In fast-lookup mode, the entire table is materialized in Java heap on every server. A table that is 100 MB on disk may consume significantly more memory after deserialization. Monitor server heap usage when adding or growing dimension tables.

  • Replication overhead: Because every server in the tenant holds a full copy, adding a dimension table multiplies its memory footprint by the number of servers.

circle-exclamation

Configuration

Table configuration

Mark a table as a dimension table by setting the following properties in the table config:

Property
Required
Description

isDimTable

Yes

Set to true to designate the table as a dimension table.

ingestionConfig.batchIngestionConfig.segmentIngestionType

Yes

Must be set to REFRESH. Dimension tables use segment replacement rather than append semantics so that the in-memory hash map is rebuilt with the latest data.

quota.storage

Recommended

Storage quota for the table. Must not exceed the cluster-level controller.dimTable.maxSize (default 200 MB).

dimensionTableConfig.disablePreload

No

Set to true to use memory-optimized mode (store only primary key and segment reference instead of full rows). Defaults to false (fast lookup).

dimensionTableConfig.errorOnDuplicatePrimaryKey

No

Set to true to fail segment loading if duplicate primary keys are detected across segments. Defaults to false (last-loaded segment wins).

Schema configuration

Dimension table schemas use dimensionFieldSpecs instead of metricFieldSpecs. A primaryKeyColumns array is required -- it defines the key used for lookups.

Example table configuration

Example schema configuration

Querying with the LOOKUP function

The primary way to use a dimension table is through the LOOKUP UDF in the single-stage query engine. This function performs a primary-key lookup against the dimension table and returns a column value.

Syntax

  • dimTable -- name of the dimension table (string literal).

  • dimColToLookUp -- column to retrieve from the dimension table (string literal).

  • dimJoinKey / factJoinKey -- pairs of join keys: the dimension table column name (string literal) and the corresponding fact table column expression.

Single-key lookup

Composite-key lookup

When the dimension table has a composite primary key, provide multiple key pairs in the same order as primaryKeyColumns in the schema:

Multi-stage engine

In the multi-stage query engine (v2), use a standard JOIN with the lookup join strategy hint instead of the LOOKUP UDF:

For details, see lookup join strategyarrow-up-right.

Refresh and update strategies

Because dimension tables use segmentIngestionType: REFRESH, uploading a new segment replaces the existing segment and triggers a full reload of the in-memory hash map on every server. There is no incremental update mechanism.

Typical refresh patterns:

  • Scheduled batch job: Run a periodic ingestion job (e.g., daily or hourly) that rebuilds the segment from the source of truth and uploads it to Pinot.

  • On-demand refresh: Trigger a segment upload through the Pinot REST API whenever the reference data changes.

circle-info

During a refresh, the old hash map remains active for lookups until the new one is fully loaded. There is no query downtime during a refresh, but there is a brief period where the old data is served.

Handling duplicate primary keys

When multiple segments contain the same primary key, the default behavior is last-loaded-segment-wins (segments are ordered by creation time). Set errorOnDuplicatePrimaryKey: true in dimensionTableConfig to fail fast if duplicates are detected. With REFRESH ingestion, there is typically only one segment, so duplicates across segments are uncommon.

Performance best practices

  • Keep tables small. Dimension tables are loaded entirely into memory on every server. Target thousands to low hundreds of thousands of rows.

  • Use narrow schemas. Include only the columns needed for lookups to reduce memory consumption.

  • Choose the right loading mode. Use fast lookup (default) for the best query performance. Switch to memory-optimized mode (disablePreload: true) only if heap usage is a concern.

  • Set a storage quota. Always configure quota.storage to prevent accidentally uploading oversized data.

  • Minimize refresh frequency. Each refresh triggers a full reload of the hash map. Avoid refreshing more often than necessary.

  • Monitor server heap. After adding a dimension table, check server JVM heap metrics to confirm adequate headroom.

Limitations

  • Offline only. Dimension tables must be offline tables. They cannot be real-time or hybrid tables.

  • Full replication. All segments are replicated to every server in the tenant, so memory usage scales with the number of servers.

  • No incremental updates. The entire segment must be replaced on each refresh; row-level updates are not supported.

  • Primary key required. The schema must define primaryKeyColumns. Lookups without a primary key are not supported.

  • Single-stage LOOKUP UDF limitations. Dimension table column references in the LOOKUP function must be string literals, not column identifiers, because they reference a table that is not part of the query's FROM clause.

  • No time-based partitioning or retention. Dimension tables do not support segment retention policies or time-based partitioning.

Last updated

Was this helpful?