Realtime Ingestion Stopped
Last updated
Was this helpful?
Last updated
Was this helpful?
When observed certain kafka partitioned stopped ingestion due to the segment commit failure.
Sample errrors:
Usually here are the steps that a partition got stopped ingestion:
servers tell controller to commit,
controller ack and ask the lead server to commit
lead server failed to commit due to many reason( segment build time longer than the controller lease, server oom, etc)
the other server got permission to build and try to commit and also failed
you got a partition completely stopped
To mitigate, we suggest below steps to ensure your setup is scalable and stable.
Ensure Pinot server directly save segments to deep store, avoid controller in the critical data path. Ref link: . This is the most critical fix as it will remove controller as the bottleneck for data commit:
controller receive segment tarball from pinot server
uncompress it
extract segment metadata
upload segment tarball to deep store
update zookeeper segment metadata
complete the protocol
For large realtime segment, suggest to use DOWNLOAD
for completionMode
so the other server replicas won't waste CPU cycles to build segments. Ref:
Limit the concurrent realtime segment build by configure Pinot servers:
This will reduce each segment build time to relief the segment commit timeout situation, as well the concurrent pressure on the controller side. Ref: