1 of 7

Streaming

Pluggable Streams

Note

This section is a pre-read if you are planning to develop plug-ins for streams other than Kafka. Pinot supports Kafka out of the box.

Prior to commit ba9f2d, Pinot was only able to support consuming from Kafka stream.

Pinot now enables its users to write plug-ins to consume from pub-sub streams other than Kafka. (Please refer to )

Some of the streams for which plug-ins can be added are:

You may encounter some limitations either in Pinot or in the stream system while developing plug-ins. Please feel free to get in touch with us when you start writing a stream plug-in, and we can help you out. We are open to receiving PRs in order to improve these abstractions if they do not work for a certain stream implementation.

Refer to for details on how Pinot consumes streaming data.

Creating Pinot Segments

Realtime segment generation

To consume in realtime, we simply need to create a table with the same name as the schema and point to the Kafka topic to consume from, using a table definition such as this one:

First, we’ll start a local instance of Kafka and start streaming data into it:Untitled

This will stream one event per second from the Avro file to the Kafka topic. Then, we’ll create a realtime table, which will start consuming from the Kafka topic.

Write your stream

This page describes how to write your own streams to plug to Pinot. Two modes are available: high and low level.

Requirements to support Stream Level (High Level) consumers

The stream should provide the following guarantees:

Exactly once delivery (unless restarting from a checkpoint) for each consumer of the stream.
(Optionally) support mechanism to split events (in some arbitrary fashion) so that each event in the stream is delivered exactly to one host out of set of hosts.
Provide ways to save a checkpoint for the data consumed so far. If the stream is partitioned, then this checkpoint is a vector of checkpoints for events consumed from individual partitions.
The checkpoints should be recorded only when Pinot makes a call to do so.
The consumer should be able to start consumption from one of:
- latest avaialble data
- earliest available data

Requirements to support Partition Level (Low Level) consumers

While consuming rows at a partition level, the stream should support the following properties:

Stream should provide a mechanism to get the current number of partitions.
Each event in a partition should have a unique offset that is not more than 64 bits long.
Refer to a partition as a number not exceeding 32 bits long.

In addition, we have an operational requirement that the number of partitions should not be reduced over time.

Stream plug-in implementation

In order to add a new type of stream (say, Foo) implement the following classes:

FooConsumerFactory extends
FooPartitionLevelConsumer implements
FooStreamLevelConsumer implements

Depending on stream level or partition level, your implementation needs to include StreamLevelConsumer or PartitionLevelConsumer.

The properties for the stream implementation are to be set in the table configuration, inside section.

Use the streamType property to define the stream type. For example, for the implementation of stream foo, set the property "streamType" : "foo".

The rest of the configuration properties for your stream should be set with the prefix "stream.foo". Be sure to use the same suffix for: (see examples below):

topic
consumer type
stream consumer factory

All values should be strings. For example:

You can have additional properties that are specific to your stream. For example:

In addition to these properties, you can define thresholds for the consuming segments:

rows threshold
time threshold

The properties for the thresholds are as follows:

An example of this implementation can be found in the , which is an implementation for the kafka stream.

Kafka

This page describes how to connect Kafka to Pinot

Kafka 2.x Plugin

Pinot provides stream plugin support for Kafka 2.x version. Although the version used in this implementation is kafka 2.0.0, it’s possible to compile it with higher kafka lib version, e.g. 2.1.1.

How to build and release Pinot package with Kafka 2.x connector

How to use Kafka 2.x connector

Use Kafka Stream(High) Level Consumer

Below is a sample streamConfigs used to create a realtime table with Kafka Stream(High) level consumer.

Kafka 2.x HLC consumer uses org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory in config stream.kafka.consumer.factory.class.name.

Use Kafka Partition(Low) Level Consumer

Below is a sample table config used to create a realtime table with Kafka Partition(Low) level consumer:

Please note:

Config replicasPerPartition under segmentsConfig is required to specify table replication.
Config stream.kafka.consumer.type should be specified as LowLevel to use partition level consumer. (The use of simple instead of LowLevel is deprecated)

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server into tableIndexConfig.streamConfigs. This config should be the URI of Kafka broker lists, e.g.

How to use this plugin with higher Kafka version?

This connector is also suitable for Kafka lib version higher than 2.0.0. In pinot-connector-kafka-2.0/pom.xml change the kafka.lib.version from 2.0.0 to 2.1.1 will make this Connector working with Kafka 2.1.1.

Azure EventHub

Amazon Kinesis

Google Pub/Sub

Write your stream

This page describes how to write your own streams to plug to Pinot. Two modes are available: high and low level.

Requirements to support Stream Level (High Level) consumers

The stream should provide the following guarantees:

Exactly once delivery (unless restarting from a checkpoint) for each consumer of the stream.
(Optionally) support mechanism to split events (in some arbitrary fashion) so that each event in the stream is delivered exactly to one host out of set of hosts.
Provide ways to save a checkpoint for the data consumed so far. If the stream is partitioned, then this checkpoint is a vector of checkpoints for events consumed from individual partitions.
The checkpoints should be recorded only when Pinot makes a call to do so.
The consumer should be able to start consumption from one of:
- latest avaialble data
- earliest available data

Requirements to support Partition Level (Low Level) consumers

While consuming rows at a partition level, the stream should support the following properties:

Stream should provide a mechanism to get the current number of partitions.
Each event in a partition should have a unique offset that is not more than 64 bits long.
Refer to a partition as a number not exceeding 32 bits long.

In addition, we have an operational requirement that the number of partitions should not be reduced over time.

Stream plug-in implementation

In order to add a new type of stream (say, Foo) implement the following classes:

FooConsumerFactory extends
FooPartitionLevelConsumer implements
FooStreamLevelConsumer implements

Depending on stream level or partition level, your implementation needs to include StreamLevelConsumer or PartitionLevelConsumer.

The properties for the stream implementation are to be set in the table configuration, inside section.

Use the streamType property to define the stream type. For example, for the implementation of stream foo, set the property "streamType" : "foo".

The rest of the configuration properties for your stream should be set with the prefix "stream.foo". Be sure to use the same suffix for: (see examples below):

topic
consumer type
stream consumer factory

All values should be strings. For example:

You can have additional properties that are specific to your stream. For example:

In addition to these properties, you can define thresholds for the consuming segments:

rows threshold
time threshold

The properties for the thresholds are as follows:

An example of this implementation can be found in the , which is an implementation for the kafka stream.

Kafka

This page describes how to connect Kafka to Pinot

Kafka 2.x Plugin

Pinot provides stream plugin support for Kafka 2.x version. Although the version used in this implementation is kafka 2.0.0, it’s possible to compile it with higher kafka lib version, e.g. 2.1.1.

How to build and release Pinot package with Kafka 2.x connector

How to use Kafka 2.x connector

Use Kafka Stream(High) Level Consumer

Below is a sample streamConfigs used to create a realtime table with Kafka Stream(High) level consumer.

Kafka 2.x HLC consumer uses org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory in config stream.kafka.consumer.factory.class.name.

Use Kafka Partition(Low) Level Consumer

Below is a sample table config used to create a realtime table with Kafka Partition(Low) level consumer:

Please note:

Config replicasPerPartition under segmentsConfig is required to specify table replication.
Config stream.kafka.consumer.type should be specified as LowLevel to use partition level consumer. (The use of simple instead of LowLevel is deprecated)

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server into tableIndexConfig.streamConfigs. This config should be the URI of Kafka broker lists, e.g.

Streaming

hashtagPluggable Streams

Creating Pinot Segments

hashtagRealtime segment generation

Write your stream

hashtagRequirements to support Stream Level (High Level) consumers

hashtagRequirements to support Partition Level (Low Level) consumers

hashtagStream plug-in implementation

Kafka

hashtagKafka 2.x Plugin

hashtagHow to build and release Pinot package with Kafka 2.x connector

hashtagHow to use Kafka 2.x connector

hashtagUpgrade from Kafka 0.9 connector to Kafka 2.x connector

hashtagHow to use this plugin with higher Kafka version?

Azure EventHub

Amazon Kinesis

Google Pub/Sub

Streaming

hashtagPluggable Streams

Google Pub/Sub

Write your stream

hashtagRequirements to support Stream Level (High Level) consumers

hashtagRequirements to support Partition Level (Low Level) consumers

hashtagStream plug-in implementation

Creating Pinot Segments

hashtagRealtime segment generation

Amazon Kinesis

Azure EventHub

Kafka

hashtagKafka 2.x Plugin

hashtagHow to build and release Pinot package with Kafka 2.x connector

hashtagHow to use Kafka 2.x connector

hashtagUpgrade from Kafka 0.9 connector to Kafka 2.x connector

hashtagHow to use this plugin with higher Kafka version?

Pluggable Streams

Realtime segment generation

Requirements to support Stream Level (High Level) consumers

Requirements to support Partition Level (Low Level) consumers

Stream plug-in implementation

Kafka 2.x Plugin

How to build and release Pinot package with Kafka 2.x connector

How to use Kafka 2.x connector

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

How to use this plugin with higher Kafka version?

Pluggable Streams

Requirements to support Stream Level (High Level) consumers

Requirements to support Partition Level (Low Level) consumers

Stream plug-in implementation

Realtime segment generation

Kafka 2.x Plugin

How to build and release Pinot package with Kafka 2.x connector

How to use Kafka 2.x connector

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

How to use this plugin with higher Kafka version?