1 of 4

Stream ingestion

Apache Pinot lets users consume data from streams and push it directly into the database, in a process known as stream ingestion. Stream Ingestion makes it possible to query data within seconds of publication.

Stream Ingestion provides support for checkpoints for preventing data loss.

Setting up Stream ingestion involves the following steps:

Create schema configuration

Apache Kafka

This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.

Introduction

In this guide, you'll learn how to import data into Pinot using Apache Kafka for real-time stream ingestion. Pinot has out-of-the-box real-time ingestion support for Kafka.

Let's setup a demo Kafka cluster locally, and create a sample topic transcript-topic

Start Kafka

Create a Kafka Topic

Start Kafka

Start Kafka cluster on port 9876 using the same Zookeeper from the .

Create a Kafka topic

Download the latest . Create a topic.

Creating Schema Configuration

We will publish the data in the same format as mentioned in the docs. So you can use the same schema mentioned under .

Creating a table configuration

The real-time table configuration for the transcript table described in the schema from the previous step.

For Kafka, we use streamType as kafka . Currently only JSON format is supported but you can easily write your own decoder by extending the StreamMessageDecoder interface. You can then access your decoder class by putting the jar file in plugins directory

The lowLevel consumer reads data per partition whereas the highLevel consumer utilises Kafka high level consumer to read data from the whole stream. It doesn't have the control over which partition to read at a particular momemt.

For Kafka versions below 2.X, use org.apache.pinot.plugin.stream.kafka09.KafkaConsumerFactory

For Kafka version 2.X and above, use org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory

You can set the offset to -

smallest to start consumer from the earliest offset
largest to start consumer from the latest offset
timestamp in format yyyy-MM-dd'T'HH:mm:ss.SSSZ

The resulting configuration should look as follows -

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server into tableIndexConfig.streamConfigs. This config should be the URI of Kafka broker lists, e.g.

How to consume from higher Kafka version?

This connector is also suitable for Kafka lib version higher than 2.0.0. In , change the kafka.lib.version from 2.0.0 to 2.1.1 will make this Connector working with Kafka 2.1.1.

How to consume transactional-committed Kafka messages

The connector with Kafka lib 2.0+ supports Kafka transactions. The transaction support is controlled by config kafka.isolation.level in Kafka stream config, which can be read_committed or read_uncommitted (default). Setting it to read_committed will ingest transactionally committed messages in Kafka stream only.

Upload schema and table

Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the real-time table is created, it will begin ingesting available records from the Kafka topic.

Add sample data to the Kafka topic

We will publish data in the following format to Kafka. Let us save the data in a file named as transcript.json.

Push sample JSON into the transcript-topic Kafka topic, using the Kafka console producer. This will add 12 records to the topic described in the transcript.json file.

Ingesting streaming data

As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the to checkout the real-time data.

Some More kafka ingestion configs

Use Kafka Partition(Low) Level Consumer with SSL

Here is an example config which uses SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, ones starting with ssl. are for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.

Ingest transactionally committed messages only from Kafka

With Kafka consumer 2.0, you can ingest transactionally committed messages only by configuring kafka.isolation.level to read_committed. For example,

Note that the default value of this config read_uncommitted to read all messages. Also, this config supports low-level consumer only.

Use Kafka Level Consumer with SASL_SSL

Here is an example config which uses SASL_SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, some for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.

Post release 0.10.0, we have started shading kafka packages inside Pinot. If you are using our latest tagged docker images or master build, you should replace org.apache.kafka with shaded.org.apache.kafka in your table config.

Amazon Kinesis

To ingest events from an Amazon Kinesis stream into Pinot, set the following configs into the table config

where the Kinesis specific properties are:

Property

Description

streamType

This should be set to "kinesis"

stream.kinesis.topic.name

Kinesis stream name

Kinesis supports authentication using the . The credential provider looks for the credentials in the following order -

Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey

You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups. You can also specify other aws fields such as AWS_SESSION_TOKEN as environment variables and config and it will work.

Limitations

ShardID is of the format "shardId-000000000001". We use the numeric part as partitionId. Our partitionId variable is integer. If shardIds grow beyond Integer.MAX_VALUE, we will overflow
Segment size based thresholds for segment completion will not work. It assumes that partition "0" always exists. However, once the shard 0 is split/merged, we will no longer have partition 0.

Apache Pulsar

Pinot supports consuming data from Apache Pulsar via pinot-pulsar plugin. You need to enable this plugin so that Pulsar specific libraries are present in the classpath.

You can enable pulsar plugin with the following config at the time of Pinot setup -Dplugins.include=pinot-pulsar

pinot-pulsar plugin is not part of official 0.10.0 binary. You can download the plugin from and add it to libs or plugins directory in pinot.

Set up Pulsar table

A sample Pulsar stream config to ingest data should look as follows. You can use the streamConfigs section from this sample and make changes for your corresponding table.

Pulsar configuration options

You can change the following Pulsar specifc configurations for your tables

Property

Description

Authentication

Pinot-Pulsar connector supports authentication using the security tokens. You can generate the token by following the . Once generated, you can add the following property to streamConfigs to add auth token for each request

TLS support

Pinot-pulsar connecor also supports TLS for encrypted connections. You can follow to enable TLS on your pulsar cluster. Once done, you can enable TLS in pulsar connector by providing the trust certificate file location generated in the previous step.

Also, make sure to change the brokers url from pulsar://localhost:6650 to pulsar+ssl://localhost:6650 so that secure connections are used.

For other table and stream configurations, you can headover to

Supported Pulsar versions

PInot currently relies on Pulsar client version 2.7.2. Users should make sure the Pulsar broker is compatible with the this client version.

Amazon Kinesis

To ingest events from an Amazon Kinesis stream into Pinot, set the following configs into the table config

{
  "tableName": "kinesisTable",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "timestamp",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kinesis",
      "stream.kinesis.topic.name": "<your kinesis stream name>",
      "region": "<your region>",
      "accessKey": "<your access key>",
      "secretKey": "<your secret key>",
      "shardIteratorType": "AFTER_SEQUENCE_NUMBER",
      "stream.kinesis.consumer.type": "lowlevel",
      "stream.kinesis.fetch.timeout.millis": "30000",
      "stream.kinesis.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kinesis.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kinesis.KinesisConsumerFactory",
      "realtime.segment.flush.threshold.rows": "1000000",
      "realtime.segment.flush.threshold.time": "6h"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}

where the Kinesis specific properties are:

Property

Description

streamType

This should be set to "kinesis"

stream.kinesis.topic.name

Kinesis stream name

Kinesis supports authentication using the . The credential provider looks for the credentials in the following order -

Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey

Limitations

ShardID is of the format "shardId-000000000001". We use the numeric part as partitionId. Our partitionId variable is integer. If shardIds grow beyond Integer.MAX_VALUE, we will overflow
Segment size based thresholds for segment completion will not work. It assumes that partition "0" always exists. However, once the shard 0 is split/merged, we will no longer have partition 0.

Apache Kafka

This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.

Introduction

In this guide, you'll learn how to import data into Pinot using Apache Kafka for real-time stream ingestion. Pinot has out-of-the-box real-time ingestion support for Kafka.

Let's setup a demo Kafka cluster locally, and create a sample topic transcript-topic

Start Kafka

Create a Kafka Topic

Start Kafka

Start Kafka cluster on port 9876 using the same Zookeeper from the .

Create a Kafka topic

Download the latest . Create a topic.

Creating Schema Configuration

We will publish the data in the same format as mentioned in the docs. So you can use the same schema mentioned under .

Creating a table configuration

The real-time table configuration for the transcript table described in the schema from the previous step.

For Kafka versions below 2.X, use org.apache.pinot.plugin.stream.kafka09.KafkaConsumerFactory

For Kafka version 2.X and above, use org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory

You can set the offset to -

smallest to start consumer from the earliest offset
largest to start consumer from the latest offset
timestamp in format yyyy-MM-dd'T'HH:mm:ss.SSSZ

The resulting configuration should look as follows -

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server into tableIndexConfig.streamConfigs. This config should be the URI of Kafka broker lists, e.g.

How to consume from higher Kafka version?

This connector is also suitable for Kafka lib version higher than 2.0.0. In , change the kafka.lib.version from 2.0.0 to 2.1.1 will make this Connector working with Kafka 2.1.1.

How to consume transactional-committed Kafka messages

Upload schema and table

Add sample data to the Kafka topic

We will publish data in the following format to Kafka. Let us save the data in a file named as transcript.json.

Push sample JSON into the transcript-topic Kafka topic, using the Kafka console producer. This will add 12 records to the topic described in the transcript.json file.

Ingesting streaming data

As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the to checkout the real-time data.

Some More kafka ingestion configs

Use Kafka Partition(Low) Level Consumer with SSL

Ingest transactionally committed messages only from Kafka

With Kafka consumer 2.0, you can ingest transactionally committed messages only by configuring kafka.isolation.level to read_committed. For example,

Note that the default value of this config read_uncommitted to read all messages. Also, this config supports low-level consumer only.

Stream ingestion

Apache Kafka

hashtagIntroduction

hashtagCreating Schema Configuration

hashtagCreating a table configuration

hashtagUpgrade from Kafka 0.9 connector to Kafka 2.x connector

hashtagHow to consume from higher Kafka version?

hashtagHow to consume transactional-committed Kafka messages

hashtagUpload schema and table

hashtagAdd sample data to the Kafka topic

hashtagIngesting streaming data

hashtagSome More kafka ingestion configs

hashtagUse Kafka Partition(Low) Level Consumer with SSL

hashtagIngest transactionally committed messages only from Kafka

hashtagUse Kafka Level Consumer with SASL_SSL

Amazon Kinesis

hashtagLimitations

Apache Pulsar

hashtagSet up Pulsar table

hashtagPulsar configuration options

hashtagAuthentication

hashtagTLS support

hashtagSupported Pulsar versions

Amazon Kinesis

hashtagLimitations

Apache Pulsar

hashtagSet up Pulsar table

hashtagPulsar configuration options

hashtagAuthentication

hashtagTLS support

hashtagSupported Pulsar versions

Apache Kafka

hashtagIntroduction

hashtagCreating Schema Configuration

hashtagCreating a table configuration

hashtagUpgrade from Kafka 0.9 connector to Kafka 2.x connector

hashtagHow to consume from higher Kafka version?

hashtagHow to consume transactional-committed Kafka messages

hashtagUpload schema and table

hashtagAdd sample data to the Kafka topic

hashtagIngesting streaming data

hashtagSome More kafka ingestion configs

hashtagUse Kafka Partition(Low) Level Consumer with SSL

hashtagIngest transactionally committed messages only from Kafka

hashtagUse Kafka Level Consumer with SASL_SSL

Stream ingestion

hashtagCreate Schema Configuration

hashtagCreate Table Configuration

hashtagUpload schema and table config

hashtagCustom Ingestion Support

hashtagPause Stream Ingestion

Introduction

Creating Schema Configuration

Creating a table configuration

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

How to consume from higher Kafka version?

How to consume transactional-committed Kafka messages

Upload schema and table

Add sample data to the Kafka topic

Ingesting streaming data

Some More kafka ingestion configs

Use Kafka Partition(Low) Level Consumer with SSL

Ingest transactionally committed messages only from Kafka

Use Kafka Level Consumer with SASL_SSL

Limitations

Set up Pulsar table

Pulsar configuration options

Authentication

TLS support

Supported Pulsar versions

Limitations

Set up Pulsar table

Pulsar configuration options

Authentication

TLS support

Supported Pulsar versions

Introduction

Creating Schema Configuration

Creating a table configuration

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

How to consume from higher Kafka version?

How to consume transactional-committed Kafka messages

Upload schema and table

Add sample data to the Kafka topic

Ingesting streaming data

Some More kafka ingestion configs

Use Kafka Partition(Low) Level Consumer with SSL

Ingest transactionally committed messages only from Kafka

Use Kafka Level Consumer with SASL_SSL

Create Schema Configuration

Create Table Configuration

Upload schema and table config

Custom Ingestion Support

Pause Stream Ingestion