1 of 1

Minion merge rollup task

The Minion merge rollup task lets you merge small segments into larger ones. This helps to improve query performance and disk storage by aggregating data at a courser granularity to reduce the data processed during query execution.

This task is supported for the following use cases:

OFFLINE tables, APPEND only
REALTIME tables, without upsert or dedup

Task overview

The Minion merge rollup task merges all segments of segment K time buckets (default 1) from the oldest to the newest records. After processing, the segments are time aligned by bucket.

For example, if the table has hourly records starting with 11-01-2021T13:56:00, and is configured to use bucket time of 1 day, the Merge rollup task merges the records for the window \[11-01-2021, 11-02-2021) in the first run, followed by \[11-02-2021, 11-03-2021) in the next run, followed by \[11-03-2021, 11-04-2021) in the next run, and so on.

Multi-level merge is supported to apply different compressions for different time ranges. For example, for 24 hours you can retain hourly records of data, rollup data from 1 week ago to 1 day ago into daily granularity, and rollup data older than a week to monthly granularity.

This feature uses the following metadata in Zookeeper:

CustomMap of SegmentZKMetadata: Keeps the mapping of { "MergeRollupTask.mergeLevel" : {mergeLevel} }. Indicates that the segment is the result of a merge rollup task. Used to skip time buckets that have all merged segments to avoid reprocessing.
MergeRollupTaskMetadata: Stored in the path: MINION\_TASK\_METADATA/MergeRollupTask/{tableNameWithType}. This metadata keeps the mapping from mergeLevel to waterMarkMs. Used to determine when to schedule the next merge rollup task run. The watermark is the start time of current processing buckets. All data before the watermark is merged and time aligned.

This feature uses the pinot-minions and the Helix Task Executor framework, which consists of 2 parts:

MergeRollupTaskGenerator: The minion task scheduler, which schedules tasks of type MergeRollupTask. This task is scheduled by the controller periodic task, PinotTaskManager. For each mergeLevel from the highest to the lowest granularity (hourly -> daily -> monthly):
- Time buckets calculation: Starting from the watermark, calculate up to k time buckets that has un-merged segments at best effort. Bump up the watermark if necessary.

Configure the Minion merge rollup task

Start a pinot-minion.
Set up your OFFLINE table. Add "MergeRollupTask" in the task configs, like this:

Enable PinotTaskManager (disabled by default) by adding the controller.task properties below to your , and then restart the controller (required).

(Optional) Add the following advanced configurations as needed:

For detail about these advanced configurations, see the following table:

Property

Description

Default

Metrics

mergeRollupTaskDelayInNumBuckets.{tableNameWithType}.{mergeLevel}

This metric keeps track of the task delay in the number of time buckets. For example, if we see this number is 7, and the merge task is configured with "bucketTimePeriod = 1d", this means that we have 7 days of delay. Useful to monitor if the merge task is stuck in production.

Original design doc:

Issue:

Minion merge rollup task

This task is supported for the following use cases:

OFFLINE tables, APPEND only
REALTIME tables, without upsert or dedup

Task overview

The Minion merge rollup task merges all segments of segment K time buckets (default 1) from the oldest to the newest records. After processing, the segments are time aligned by bucket.

This feature uses the following metadata in Zookeeper:

CustomMap of SegmentZKMetadata: Keeps the mapping of { "MergeRollupTask.mergeLevel" : {mergeLevel} }. Indicates that the segment is the result of a merge rollup task. Used to skip time buckets that have all merged segments to avoid reprocessing.
MergeRollupTaskMetadata: Stored in the path: MINION\_TASK\_METADATA/MergeRollupTask/{tableNameWithType}. This metadata keeps the mapping from mergeLevel to waterMarkMs. Used to determine when to schedule the next merge rollup task run. The watermark is the start time of current processing buckets. All data before the watermark is merged and time aligned.

This feature uses the pinot-minions and the Helix Task Executor framework, which consists of 2 parts:

MergeRollupTaskGenerator: The minion task scheduler, which schedules tasks of type MergeRollupTask. This task is scheduled by the controller periodic task, PinotTaskManager. For each mergeLevel from the highest to the lowest granularity (hourly -> daily -> monthly):
- Time buckets calculation: Starting from the watermark, calculate up to k time buckets that has un-merged segments at best effort. Bump up the watermark if necessary.

Configure the Minion merge rollup task

Start a pinot-minion.
Set up your OFFLINE table. Add "MergeRollupTask" in the task configs, like this:

Enable PinotTaskManager (disabled by default) by adding the controller.task properties below to your , and then restart the controller (required).

(Optional) Add the following advanced configurations as needed:

For detail about these advanced configurations, see the following table:

Property

Description

Default

Metrics

mergeRollupTaskDelayInNumBuckets.{tableNameWithType}.{mergeLevel}

Original design doc:

Issue:

Minion merge rollup task

hashtagTask overview

hashtagConfigure the Minion merge rollup task

hashtagMetrics

hashtagmergeRollupTaskDelayInNumBuckets.{tableNameWithType}.{mergeLevel}

Minion merge rollup task

hashtagTask overview

hashtagConfigure the Minion merge rollup task

hashtagMetrics

hashtagmergeRollupTaskDelayInNumBuckets.{tableNameWithType}.{mergeLevel}

Task overview

Configure the Minion merge rollup task

Metrics

mergeRollupTaskDelayInNumBuckets.{tableNameWithType}.{mergeLevel}

Task overview

Configure the Minion merge rollup task

Metrics

mergeRollupTaskDelayInNumBuckets.{tableNameWithType}.{mergeLevel}