githubEdit

Upsert Compaction Task

The Upsert Compaction Task allows you reclaim disk space occupied by older version of your records.

This task is only supported for REALTIME tables with upsert enabled.

Task overview

The Upsert Compaction Task selects completed segments for compaction based on the provided task configuration and generates a replacement segment for each segment that meets the selection criteria.

However, if a completed segment only contains older records, then it is immediately deleted and no compaction task is generated.

The Upsert Compaction Task uses the Minion Task Framework, and therefore consists of Generator and Executor classes.

  • UpsertCompactionTaskGenerator: Invoked by the Pinot Controller according the specified schedule. It’s generateTasks method:

    • Retrieves segment metadata for the table’s completed segments.

    • Retrieves validDocIds from the servers hosting the completed segments.

    • Processes validDocIds to determine which segments to compact or delete.

    • Generates a task for every completed segment.

  • UpsertCompactionTaskExecutor: Invoked by a Pinot Minion.

    • Retrieves validDocIds for the segment specified in the task config.

    • Uses a CompactedRecordReader to generate a new segment with only the valid records.

Configuration

  1. Start a Pinot Minion.

  2. Set up your REALTIME table. Add "UpsertCompactionTask" in the task configs, like this:

  1. Enable PinotTaskManager (disabled by default) by adding the controller.task properties below to your controller confarrow-up-right, and then restart the controller (required).

Property
Description
Default

bufferTimePeriod

The minimum amount of time that has elapsed since the segment was consuming

7d

invalidRecordsThresholdPercent

A limit to the amount of older records allowed in the completed segment represented as a percentage of the total number of records in the segment (i.e. old records / total records). Must be configured if invalidRecordsThresholdCount isn’t configured.

0

invalidRecordsThresholdCount

A limit to the amount of older records allowed in the completed segment represented as a record count. Must be configured if invalidRecordsThresholdPercent isn’t configured.

0

Last updated

Was this helpful?