githubEdit

Monitoring

This section covers how to observe and troubleshoot an Apache Pinot cluster -- metrics collection, alerting, JVM diagnostics, and dashboard setup.

Why monitoring matters

Pinot clusters serve real-time analytics workloads where latency spikes, ingestion delays, and segment failures directly affect end users. Proactive monitoring lets you catch problems before they become incidents.

What Pinot exposes

Every Pinot component (controller, broker, server, minion) publishes metrics via Dropwizard Metricsarrow-up-right in three forms:

Metric type
What it measures
Example

Gauge

Point-in-time value

Segment count, JVM heap usage, ingestion delay

Meter

Rate per unit of time

Queries per second, exceptions per second

Timer

Duration with percentiles

Query latency p50/p95/p99

Metrics are available at global scope (per-instance) and table-level scope (per-table).

Metrics export paths

Method
Best for
How it works

JMX (default)

Development, ad-hoc inspection

Metrics published via JmxReporterMetricsRegistryRegistrationListener; view with JConsole or VisualVM

Prometheus via JMX Exporter

Production Kubernetes and bare-metal

Attach the JMX Exporter Java agent to each component; Prometheus scrapes the /metrics endpoint

Custom reporter

Datadog, InfluxDB, or other backends

Implement MetricsRegistryRegistrationListener and register via config

Key metrics to watch

A concise summary of the most important metrics per component:

  • Broker: query rate (QUERIES), partial server responses, processing exceptions, query latency percentiles, heap usage

  • Server: real-time ingestion delay, consumption health per partition, segment download failures, documents scanned, heap and off-heap usage

  • Controller: segment availability percentage, segments in error state, ZooKeeper reconnects, stream data loss, missing consuming segments

  • Minion: task failure count, task queue time, task execution time

For the complete list of metrics, alert thresholds, and diagnosis patterns, see the Monitoring guide.

JVM diagnostics with Continuous JFR

For low-overhead, always-on JVM profiling, Pinot supports Continuous Java Flight Recorder (JFR). JFR captures CPU, memory, GC, thread, and lock events into .jfr files. Pinot provides cluster-level runtime control through ContinuousJfrStarter -- operators can toggle recording on/off or adjust settings without restarting processes.

Key configuration: set pinot.jfr.enabled=true in cluster config. Start with configuration=default for production safety; use configuration=profile only during active investigations.

For the full runbook, see Continuous JFR.

Setting up Prometheus and Grafana

The recommended production monitoring stack is Prometheus for metrics collection and Grafana for dashboards. The setup involves:

  1. Attach the JMX Exporter Java agent to each Pinot component's JVM options

  2. Configure Prometheus scrape targets (or use Kubernetes pod annotations for auto-discovery)

  3. Import a Pinot dashboard into Grafana

For a complete Kubernetes walkthrough, see Monitor Pinot using Prometheus and Grafana.

Prerequisites

  • Pinot cluster deployed and running

  • For Prometheus: JMX Exporter agent JAR and Pinot-specific JMX config (pinot.yml)

  • For Grafana: a running Grafana instance with Prometheus configured as a data source

  • For JFR: JDK 11+ (JFR is included in OpenJDK since Java 11)

Child pages

Page
Description

Critical metrics reference with alert thresholds and diagnosis patterns for every component

Runbook for always-on Java Flight Recorder profiling with dynamic cluster-level control

Step-by-step Kubernetes setup for Prometheus scraping and Grafana dashboards

Next step

With monitoring in place, tune your cluster for optimal performance. Continue to Performance tuning.

Last updated

Was this helpful?