githubEdit

PurgeTask

The PurgeTask is a Minion task designed to remove or modify specific records from segments based on configurable criteria. It's essential for data retention management, compliance requirements (such as GDPR data deletion requests), and data quality maintenance.

Overview

PurgeTask processes segments by applying custom logic to identify records for removal or modification, then generates new segments with the changes applied. The task automatically handles segment download, processing, upload, and metadata updates.

Key Features

  • Data Retention Management: Remove old or obsolete data from segments

  • Compliance Support: Delete records to meet regulatory requirements (e.g., GDPR)

  • Data Quality: Remove invalid or corrupted records

  • In-Place Modification: Modify records without removing them

  • Smart Scheduling: Only processes segments when sufficient time has passed since last purge

  • Zero-Record Cleanup: Automatically deletes segments with no remaining records

  • Resource Management: Configurable limits on concurrent tasks

Configuration

Table Configuration

To enable PurgeTask on a table, add it to the table's task configuration:

Configuration Parameters

Parameter
Description
Default
Example

lastPurgeTimeThresholdPeriod

Minimum time between purge operations on the same segment (default: 14 days)

"14d"

"7d", "1h", "30d"

tableMaxNumTasks

Maximum number of concurrent purge tasks per table

Integer.MAX_VALUE

"3", "10"

Implementation Requirements

PurgeTask requires custom implementation of one or both of these interfaces:

RecordPurger Interface

Implement this interface to identify records that should be removed:

RecordModifier Interface

Implement this interface to modify records in-place:

How It Works

  1. Task Generation: The PurgeTaskGenerator identifies segments eligible for purging based on:

    • Time since last purge (must exceed lastPurgeTimeThresholdPeriod)

    • Segment availability and status

    • Configured task limits

  2. Task Execution: The PurgeTaskExecutor:

    • Downloads the segment to process

    • Applies purge and/or modification logic to each record

    • Creates a new segment if changes were made

    • Uploads the new segment and updates metadata

  3. Record Processing: For each record in the segment:

    • Calls RecordPurger.shouldPurge() if configured

    • Calls RecordModifier.modifyRecord() if configured

    • Builds the output segment with remaining/modified records

Example Usage

Basic Configuration

Custom Purge Logic

To implement custom purge logic, you need to create plugins that implement the RecordPurger and/or RecordModifier interfaces. These plugins follow Pinot's standard plugin architecture and must be packaged and deployed with your Pinot cluster.

For more information on developing Pinot plugins, see the Plugin Architecture documentation.

Scheduling

Manual Scheduling

Use the Controller REST API to manually trigger PurgeTask:

Automatic Scheduling

Configure automatic scheduling using cron expressions:

Important Considerations

  • Custom Logic Required: You must implement either RecordPurger or RecordModifier interfaces

  • Resource Intensive: Processing large segments requires sufficient Minion resources

  • Segment Replacement: Creates entirely new segments when changes are made

  • Metadata Preservation: Maintains original segment creation time and time intervals

  • Time-based Throttling: Built-in protection against excessive processing of the same segments

Monitoring

PurgeTask generates standard Minion metrics for monitoring:

  • Task execution time and success/failure rates

  • Number of tasks in progress and queued

  • Resource utilization during segment processing

Use the Pinot UI Task Manager to monitor PurgeTask execution and troubleshoot issues.

Last updated

Was this helpful?