Workload-Based Query Resource Isolation
Isolate query resources across workloads to prevent noisy-neighbor problems and enforce resource budgets.
This feature was introduced in Apache Pinot 1.4.0 (PR #15109).
Overview
In multi-tenant Pinot deployments, different query workloads often compete for shared resources such as CPU and memory on broker and server instances. A heavy ad-hoc analytics query can starve latency-sensitive production queries, causing a "noisy-neighbor" problem.
Workload-based query resource isolation addresses this by allowing administrators to:
Define named workloads with CPU and memory budgets.
Classify incoming queries into workloads using query options.
Enforce resource budgets so that one workload cannot consume more than its allocated share.
Reject or throttle queries that exceed their workload budget within a configurable time window.
Pinot provides two scheduler implementations for workload isolation:
BinaryWorkloadScheduler
binary_workload
Splits queries into two classes: primary (unbounded) and secondary (thread-limited). Simple and effective when you only need to protect production queries from ad-hoc traffic.
WorkloadScheduler
workload
Uses the WorkloadBudgetManager to enforce per-workload CPU and memory budgets. Supports an arbitrary number of named workloads via the workloadName query option.
Configuration
Prerequisites
Workload cost collection relies on thread-level CPU and memory sampling. Enable the following server and broker configs before using workload isolation:
# Accounting factory (required)
pinot.server.accounting.factory.name=org.apache.pinot.core.accounting.ResourceUsageAccountantFactory
# Thread sampling (required)
pinot.server.accounting.enable.thread.cpu.sampling=true
pinot.server.accounting.enable.thread.memory.sampling=true
pinot.server.accounting.enable.thread.sampling.mse=true
# Instance-level JVM measurement (required)
pinot.server.instance.enableThreadCpuTimeMeasurement=true
pinot.server.instance.enableThreadAllocatedBytesMeasurement=trueSelecting a Scheduler
Set the query scheduler on each server instance:
Workload Budget Configs (WorkloadScheduler)
These configs apply when using the workload scheduler. They are set on each server/broker instance under the accounting.workload.* prefix:
accounting.workload.enable.cost.collection
false
Enable workload cost collection for CPU and memory.
accounting.workload.enable.cost.enforcement
false
Enable enforcement of workload budgets. When enabled, queries exceeding their workload budget are rejected.
accounting.workload.enforcement.window.ms
60000 (1 minute)
Duration of the enforcement window in milliseconds. Budgets are reset at the end of each window.
accounting.workload.sleep.time.ms
100
Sleep interval in milliseconds for the accounting thread.
Secondary Workload Configs
For a simpler setup that only distinguishes between primary and secondary queries:
accounting.secondary.workload.name
defaultSecondary
Name assigned to the secondary workload budget.
accounting.secondary.workload.cpu.percentage
0.0
Fraction of total CPU capacity allocated to the secondary workload (e.g., 0.1 for 10%). The budget is computed as enforcementWindow * 1,000,000 * availableProcessors * percentage. Set to a value greater than 0 to activate.
BinaryWorkloadScheduler Configs
When using the binary_workload scheduler, these additional configs control secondary query behavior:
binarywlm.maxSecondaryRunnerThreads
5
Maximum number of runner threads dedicated to processing secondary queries concurrently.
binarywlm.maxPendingSecondaryQueries
20
Maximum number of secondary queries that can be queued. Queries beyond this limit receive an out-of-capacity error.
binarywlm.secondaryQueueQueryTimeout
40 (seconds)
Time in seconds before a queued secondary query expires and is removed from the queue.
binarywlm.queueWakeupMs
1
Interval in milliseconds at which the scheduler checks for schedulable secondary queries.
Defining Workloads via REST API
Named workload configs (with CPU/memory budgets) can be managed through the controller REST API:
A workload config defines resource budgets and propagation rules. Example:
Workload configs are stored in ZooKeeper under /CONFIGS/QUERYWORKLOAD and propagated automatically to brokers and servers via a QueryWorkloadRefreshMessage.
Propagation schemes control which instances receive a workload config:
TABLE: Config is pushed to all instances serving the specified tables.
TENANT: Config is pushed to all instances belonging to the specified tenants.
Cost splitting: The controller evenly divides a workload's total resource budget across all eligible instances. For example, if a workload has a CPU budget of 1000 ns and there are 4 servers, each server receives a budget of 250 ns.
Query Options
workloadName
Assigns a query to a named workload for budget enforcement. Use this with the workload scheduler.
When a query specifies a workloadName, the WorkloadScheduler checks the corresponding workload budget before admitting the query. If the budget is exhausted, the query is rejected with an out-of-capacity error.
Queries that do not specify a workloadName are treated as belonging to the default workload and are always admitted (no budget enforcement).
isSecondaryWorkload
Marks a query as belonging to the secondary workload. Use this with either the binary_workload or workload scheduler.
With the binary_workload scheduler, secondary queries are routed to a dedicated queue with limited runner threads and bounded parallelism. With the workload scheduler, the isSecondaryWorkload flag maps the query to the configured secondary workload name (default: defaultSecondary) for budget enforcement.
How It Works
BinaryWorkloadScheduler
The BinaryWorkloadScheduler classifies all queries into exactly two categories:
Primary queries (default): Executed immediately using unbounded resources. These are your production, latency-sensitive queries.
Secondary queries (tagged with
isSecondaryWorkload=true): Placed into a dedicatedSecondaryWorkloadQueueand executed with constrained resources:Limited number of concurrent runner threads (
binarywlm.maxSecondaryRunnerThreads).Limited number of worker threads per query and across all in-progress secondary queries.
A bounded pending queue (
binarywlm.maxPendingSecondaryQueries) with a timeout for stale queries.
This two-tier approach is straightforward and effective when you need to protect production traffic from exploratory or ad-hoc queries.
WorkloadScheduler
The WorkloadScheduler provides finer-grained control using the WorkloadBudgetManager:
Each workload has a CPU and memory budget for a configurable enforcement window (default: 60 seconds).
When a query arrives, the scheduler checks
canAdmitQuery(workloadName).If the workload's remaining CPU or memory budget is positive, the query is admitted.
If the budget is exhausted, the query is rejected and a
WORKLOAD_BUDGET_EXCEEDEDmetric is emitted.At the end of each enforcement window, all workload budgets are automatically reset.
Resource usage (CPU time, allocated bytes) is tracked at the thread level and charged against the workload budget during query execution.
Metrics and Monitoring
BinaryWorkloadScheduler Metrics
NUM_SECONDARY_QUERIES
Meter (per-table)
Number of secondary queries received.
NUM_SECONDARY_QUERIES_SCHEDULED
Meter (per-table)
Number of secondary queries that were dequeued and scheduled for execution.
SECONDARY_Q_WAIT_TIME_MS
Timer (per-table)
Time a secondary query spent waiting in the queue before being scheduled.
WorkloadScheduler Metrics
WORKLOAD_BUDGET_EXCEEDED
Meter (per-workload, per-table, global)
Number of queries rejected because the workload budget was exhausted.
Monitor these metrics to detect workload saturation and tune budgets accordingly.
Tuning Guidance
Choosing a Scheduler
Use
binary_workloadwhen you have a simple split between production and ad-hoc queries. It requires minimal configuration and provides strong isolation through thread limiting.Use
workloadwhen you need to isolate multiple distinct workloads with individual CPU and memory budgets.
Sizing Budgets
Start with the
accounting.workload.enable.cost.collection=trueflag andaccounting.workload.enable.cost.enforcement=falseto observe resource usage patterns before enforcing budgets.Monitor the
WORKLOAD_BUDGET_EXCEEDEDmetric. A high rate of rejections indicates the budget is too low or the enforcement window is too short.The enforcement window (
accounting.workload.enforcement.window.ms) controls the granularity of budget enforcement. Shorter windows provide tighter isolation but may cause more bursty rejections. The default of 60 seconds works well for most deployments.
BinaryWorkloadScheduler Tuning
Increase
binarywlm.maxSecondaryRunnerThreadsif secondary queries are timing out in the queue but server CPU is not saturated.Increase
binarywlm.maxPendingSecondaryQueriesif you see frequent out-of-capacity errors for secondary queries.Decrease
binarywlm.secondaryQueueQueryTimeoutif stale secondary queries are consuming queue capacity.
Secondary Workload CPU Budget
The
accounting.secondary.workload.cpu.percentageconfig allocates a percentage of total CPU to secondary queries. For example, setting it to0.1(10%) on a 16-core server with a 60-second window yields a budget of approximately96 billionCPU nanoseconds per window.Set this to
0.0(the default) to disable secondary workload budgeting entirely.
Last updated
Was this helpful?

