RefreshSegmentTask
The RefreshSegmentTask is a Minion task that automatically reprocesses existing segments when the table configuration or schema changes. It ensures that segments stay up to date with the latest index definitions, column additions, and compatible data type changes without requiring a manual reload.
Overview
When you update a table's schema (for example, adding a new column or changing a data type) or modify the table config (for example, adding a new index), existing segments may become stale. The RefreshSegmentTask detects these out-of-date segments and rebuilds them using the current table config and schema so that all segments benefit from the latest configuration.
Key Features
Automatic staleness detection: Compares each segment's last-refresh timestamp against the table config and schema modification times to determine which segments need reprocessing
New column handling: Adds columns that exist in the schema but are missing from the segment
Index management: Adds or removes indexes based on updated table config (for example, inverted index, range index)
Compatible data type changes: Refreshes segments when a column's data type changes to a compatible type
Inverted index creation: Enables inverted index creation during segment generation, which is disabled by default during initial ingestion
Concurrency control: Limits the number of concurrent refresh tasks per table to avoid overwhelming the cluster
Configuration
Table Configuration
To enable RefreshSegmentTask on a table, add it to the table's task configuration:
{
"task": {
"taskTypeConfigsMap": {
"RefreshSegmentTask": {
"tableMaxNumTasks": "10"
}
}
}
}Configuration Parameters
tableMaxNumTasks
Maximum number of concurrent refresh tasks per table per scheduling run
20
schedule
Cron expression for automatic task scheduling
None (manual only)
How It Works
Task Generation: The RefreshSegmentTaskGenerator identifies segments eligible for refresh by checking:
Whether the table config or schema has been modified since the segment was last refreshed
Whether a task is already running for the segment
Whether the maximum number of tasks per table has been reached
Task Execution: The RefreshSegmentTaskExecutor:
Downloads the segment to process
Loads the current table config and schema from the controller
Determines if the segment needs preprocessing (new indexes, column changes)
Identifies columns that need refresh (new columns, data type changes)
Rebuilds the segment from scratch using the updated schema and table config
Uploads the refreshed segment and updates the ZK metadata with a timestamp
Staleness Check: A segment is considered stale when:
The table config modification time is newer than the segment's last refresh timestamp
The schema modification time is newer than the segment's last refresh timestamp
Example Usage
Adding a New Index
After adding an inverted index to your table config:
The RefreshSegmentTask will detect that the table config has changed and rebuild all existing segments with the new inverted index.
Adding a New Column
After adding a new column to the schema, existing segments will not contain this column. The RefreshSegmentTask will rebuild these segments to include the new column with its default value.
Scheduling
Manual Scheduling
Use the Controller REST API to manually trigger RefreshSegmentTask:
Automatic Scheduling
Configure automatic scheduling using cron expressions:
Important Considerations
Table types: Supports both OFFLINE and REALTIME tables. For REALTIME tables, only completed (non-consuming) segments are processed.
Segment rebuild: The task rebuilds segments from scratch using the existing data and the updated schema/config. This means it re-reads all records from the segment and creates a new segment file.
Creation time preservation: The original segment creation time and time intervals are preserved during refresh.
CRC-based deduplication: If the refreshed segment has the same CRC as the original, only the ZK metadata is updated (no segment upload occurs).
Resource usage: Rebuilding segments is CPU and I/O intensive. Use
tableMaxNumTasksto control concurrency and schedule during off-peak hours.
Monitoring
RefreshSegmentTask generates standard Minion metrics for monitoring:
Task execution time and success/failure rates
Number of tasks in progress and queued
Segment processing statistics (incomplete rows, dropped rows, sanitized rows)
Use the Pinot UI Task Manager to monitor RefreshSegmentTask execution and troubleshoot issues.
Last updated
Was this helpful?

