Pauseless Consumption
Operate and monitor pauseless consumption for real-time tables in Apache Pinot.
Overview
In the standard Apache Pinot real-time ingestion architecture, data consumption pauses while the previous segment is being built and uploaded to deep store. Depending on segment size and cluster load, this build-and-upload phase can take several minutes, creating a gap during which the most recent data is not available for queries.
Pauseless consumption, introduced in Pinot 1.4.0, eliminates this gap by allowing servers to continue ingesting data into a new segment while the previous segment is still being built and uploaded. This parallel processing approach significantly reduces the latency between data ingestion and query availability.
Pauseless consumption is different from the Pause ingestion based on resource utilization feature, which pauses and resumes ingestion based on disk usage thresholds.
How it works
Standard commit protocol
In the standard commit protocol, the server follows this sequence:
Consume data into a mutable segment until a threshold is reached.
Build the segment (convert from mutable to immutable format).
Send
COMMIT_STARTto the controller.Upload the segment to deep store.
Send
COMMIT_END_METADATAto the controller.Begin consuming into a new segment.
During steps 2 through 5, no new data is consumed, causing a delay in data freshness.
Pauseless commit protocol
With pauseless consumption enabled, the sequence changes to:
Consume data into a mutable segment until a threshold is reached.
Send
COMMIT_STARTto the controller.The controller creates ZK metadata for a new consuming segment and updates the ideal state so that the committing segment is marked
ONLINEand the new segment is markedCONSUMING.The server immediately begins consuming into the new segment.
In parallel, the server builds and uploads the previous segment.
Send
COMMIT_END_METADATAto the controller to finalize.
Because the new consuming segment is created at COMMIT_START (before the build and upload), the server can ingest data continuously without pausing.
For non-committing replicas, the controller responds with KEEP instead of HOLD after COMMIT_START, allowing those replicas to continue ingestion as well.
Prerequisites
Before enabling pauseless consumption, ensure the following:
Peer segment download scheme is configured. Pauseless consumption requires
peerSegmentDownloadSchemeto be set in the table configuration. This is enforced by validation and is necessary because segments may be markedONLINEin the ideal state before they are uploaded to deep store. Slow replicas rely on peer download to fetch the segment from the committing server.Pinot version 1.4.0 or later. Pauseless consumption is not available in earlier versions.
Enabling pauseless consumption
To enable pauseless consumption on a REALTIME table, set pauselessConsumptionEnabled to true in the streamIngestionConfig section of the table configuration:
You must set peerSegmentDownloadScheme (for example, "http" or "https") in segmentsConfig. Table creation or update will fail validation if pauseless consumption is enabled without this setting.
Configuration options
Segment flush thresholds
Pauseless consumption supports the same segment flush thresholds as standard real-time ingestion:
realtime.segment.flush.threshold.rows
Number of rows after which the segment is committed.
realtime.segment.flush.threshold.time
Time duration after which the segment is committed (for example, 1h, 30m).
realtime.segment.flush.threshold.segment.size
Target segment size. When set, Pinot automatically tunes the row threshold based on previous segment sizes.
Size-based threshold
Pinot 1.4.0 also added support for a size-based flush threshold that works with pauseless consumption. When realtime.segment.flush.threshold.segment.size is configured, the FlushThresholdUpdater updates statistics from committing segments and computes the threshold for new consuming segments. This allows pauseless tables to automatically tune their flush thresholds based on actual segment sizes.
Disaster recovery
Pauseless consumption introduces new failure scenarios because segments can be marked ONLINE in the ideal state before they have been built and uploaded. If the committing server crashes during the build or upload phase, and no other replica has the segment, the segment enters an ERROR state with no data on disk.
Recoverable failures
Failures where at least one server still has the segment data on disk are handled automatically by the RealtimeSegmentValidationManager. This covers scenarios such as upload failures and incomplete commit protocol executions. The validation manager identifies which step of the commit protocol failed and takes corrective action.
Non-recoverable failures (reingestion)
When no server has the segment on disk, Pinot can recover by re-ingesting the data from the upstream source (for example, Kafka). The reingestion flow works as follows:
The
RealtimeSegmentValidationManagerdetects anERRORsegment with no completed replica.The controller selects an alive server from the ideal state for that segment.
The controller calls
POST /reingestSegmenton the selected server with the table name and segment name.The server re-consumes data from the stream for the offset range of the failed segment, builds the segment, and uploads it.
The controller copies the segment to deep store, updates ZK metadata to
DONE, and resets the segment toONLINE.
Disaster recovery modes
For tables with dedup or upsert enabled, reingestion may conflict with data consistency requirements. Dedup tables require strictly ordered ingestion, and re-ingesting a past segment can violate this constraint. To handle this, Pinot provides a DisasterRecoveryMode setting in streamIngestionConfig:
DEFAULT
Skips the disaster recovery (reingestion) job for dedup and upsert tables, prioritizing data consistency over availability. Failed segments must be resolved manually.
ALWAYS
Always runs the disaster recovery job, even for dedup and upsert tables. This prioritizes availability over strict dedup/upsert metadata correctness. Use this when fast recovery from data loss is more important than temporary dedup inconsistency.
When using ALWAYS mode with dedup tables, dedup metadata may be temporarily inconsistent until a full metadata rebuild is performed after recovery.
Observability
Metrics
The following metrics have been added for monitoring pauseless consumption:
pauselessConsumptionEnabled
Gauge (Server, Controller)
Indicates whether pauseless consumption is enabled. Useful for verifying the feature is active.
segmentBuildFailures
Counter
Tracks segment build failures during the commit process. In standard ingestion, build failures halt consumption and are immediately visible. In pauseless mode, ingestion continues even after a build failure, so this metric is essential for detecting issues.
reingestionFailures
Counter
Tracks failures during the reingestion (disaster recovery) process. Applicable only to pauseless tables.
segmentsInErrorState
Gauge
Number of segments currently in ERROR state. Helps identify segments that need recovery.
segmentsMissingDownloadUrl
Gauge
Number of segments without a download URL in ZK metadata. Indicates COMMIT_END_METADATA failures where the segment was built but the metadata was not finalized.
Key things to monitor
Segment build failures: Since ingestion continues after a build failure in pauseless mode, monitor
segmentBuildFailuresto detect problems that would otherwise go unnoticed.Error segments: Watch
segmentsInErrorStateto identify segments that require manual intervention or reingestion.Reingestion failures: Monitor
reingestionFailuresto ensure disaster recovery is functioning correctly.Missing download URLs: A high count of
segmentsMissingDownloadUrlmay indicate systemic issues with the commit protocol.
Compatibility with upsert and dedup tables
Pauseless consumption is compatible with upsert and dedup tables, with the following considerations:
Upsert tables: Pauseless consumption works with upsert tables. During the build and upload phase, the consuming segment and the committing segment can coexist, and upsert metadata is maintained correctly.
Dedup tables: Pauseless consumption works with dedup tables, but reingestion is disabled by default (
DEFAULTdisaster recovery mode) because dedup requires in-order consumption. Re-ingesting a segment out of order could produce incorrect dedup results. SetdisasterRecoveryModetoALWAYSif you need automatic disaster recovery at the cost of temporary dedup inconsistency.Partial upsert tables: Consumption during build is also supported for partial upsert tables.
For dedup tables with pauseless consumption, if a disaster occurs and reingestion is disabled (the default), you may need to perform a manual bulk delete of affected segments and restart ingestion. This is operationally heavy but preserves dedup correctness.
Rebalance considerations
When rebalancing pauseless tables, exercise caution with the following settings:
downtime=true: If a segment has not yet been uploaded to deep store and its only replica is moved, the data is permanently lost. Avoid usingdowntime=trueon pauseless tables unless you understand the risk.minAvailableReplicas=0: Similar todowntime=true, setting this to0can result in data loss for segments that exist only in memory on a single server.Replication factor of 1: With a single replica, any server failure before upload completion causes irrecoverable data loss. Consider using a replication factor of at least 2 for pauseless tables.
Pinot 1.4.0 includes pre-checks that warn operators about these risky configurations during rebalance operations.
Troubleshooting
Segments stuck in COMMITTING state
If a segment remains in COMMITTING state for an extended period, the server may have failed during the build or upload phase. The RealtimeSegmentValidationManager periodic task will detect and attempt to resolve this automatically.
Segments in ERROR state
Segments in ERROR state with no completed replica on any server indicate a non-recoverable failure. If disasterRecoveryMode is set to DEFAULT for a dedup or upsert table, these segments will not be automatically recovered. Options include:
Set
disasterRecoveryModetoALWAYSto enable automatic reingestion (with the caveat of temporary metadata inconsistency for dedup/upsert tables).Manually delete the affected segments and allow Pinot to re-create them.
Slow replicas and segment download
Non-committing replicas rely on downloading the segment from the committing server or deep store. Pauseless consumption implements a waited download mechanism that repeatedly checks for the download URL in ZK metadata and attempts peer download. If downloads are consistently slow, consider:
Increasing the download timeout.
Ensuring
peerSegmentDownloadSchemeis configured correctly.Verifying network connectivity between servers.
Rollback procedure
To disable pauseless consumption and revert to standard ingestion:
Update the table configuration to set
pauselessConsumptionEnabledtofalse:
Apply the updated table configuration using the Pinot controller API:
Wait for currently committing segments to complete their commit cycle. New segments will follow the standard commit protocol.
Verify that the
pauselessConsumptionEnabledmetric reports0on all servers hosting the table.
Rollback does not require a server restart. The change takes effect for new segment commits after the table configuration is updated.
References
Last updated
Was this helpful?

