1 of 4

Stream ingestion

Apache Pinot lets users consume data from streams and push it directly into the database, in a process known as stream ingestion. Stream Ingestion makes it possible to query data within seconds of publication.

Stream Ingestion provides support for checkpoints for preventing data loss.

Setting up Stream ingestion involves the following steps:

Create schema configuration
Create table configuration
Upload table and schema spec

Let's take a look at each of the steps in more detail.

Let us assume the data to be ingested is in the following format:

Create Schema Configuration

Schema defines the fields along with their data types. The schema also defines whether fields serve as dimensions , metrics or timestamp. For more details on schema configuration, see .

For our sample data, the schema configuration looks like this:

Create Table Configuration

The next step is to create a table where all the ingested data will flow and can be queried. Unlike batch ingestion, table configuration for real-time ingestion also triggers the data ingestion job. For a more detailed overview of tables, see the reference.

The real-time table configuration consists of the following fields:

tableName - The name of the table where the data should flow
tableType - The internal type for the table. Should always be set to REALTIME for realtime ingestion
segmentsConfig -

Config key

Description

Supported values

The following flush threshold settings are also supported:

Config key

Description

Supported values

You can also specify additional configs for the consumer by prefixing the key with stream.[streamType] where streamType is the name of the streaming platform. e.g. kafka

For our sample data and schema, the table config will look like this:

Upload schema and table config

Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the configs are uploaded, pinot will start ingesting available records from the topic.

Custom Ingestion Support

We are working on support for other ingestion platforms, but you can also write your own ingestion plugin if it is not supported out of the box. For a walkthrough, see .

Apache Kafka

This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.

Introduction

In this guide, you'll learn how to import data into Pinot using Apache Kafka for real-time stream ingestion. Pinot has out-of-the-box real-time ingestion support for Kafka.

Let's setup a demo Kafka cluster locally, and create a sample topic transcript-topic

Start Kafka

Create a Kafka Topic

Start Kafka

Start Kafka cluster on port 9876 using the same Zookeeper from the .

Create a Kafka topic

Download the latest . Create a topic.

Creating Schema Configuration

We will publish the data in the same format as mentioned in the docs. So you can use the same schema mentioned under .

Creating a table configuration

The real-time table configuration for the transcript table described in the schema from the previous step.

For Kafka, we use streamType as kafka . Currently only JSON format is supported but you can easily write your own decoder by extending the StreamMessageDecoder interface. You can then access your decoder class by putting the jar file in plugins directory

The lowLevel consumer reads data per partition whereas the highLevel consumer utilises Kafka high level consumer to read data from the whole stream. It doesn't have the control over which partition to read at a particular momemt.

For Kafka versions below 2.X, use org.apache.pinot.plugin.stream.kafka09.KafkaConsumerFactory

For Kafka version 2.X and above, use org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory

You can set the offset to -

smallest to start consumer from the earliest offset
largest to start consumer from the latest offset
timestamp in milliseconds

The resulting configuration should look as follows -

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server into tableIndexConfig.streamConfigs. This config should be the URI of Kafka broker lists, e.g.

How to consume from higher Kafka version?

This connector is also suitable for Kafka lib version higher than 2.0.0. In , change the kafka.lib.version from 2.0.0 to 2.1.1 will make this Connector working with Kafka 2.1.1.

How to consume transactional-committed Kafka messages

The connector with Kafka lib 2.0+ supports Kafka transactions. The transaction support is controlled by config kafka.isolation.level in Kafka stream config, which can be read_committed or read_uncommitted (default). Setting it to read_committed will ingest transactionally committed messages in Kafka stream only.

Upload schema and table

Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the real-time table is created, it will begin ingesting available records from the Kafka topic.

Add sample data to the Kafka topic

We will publish data in the following format to Kafka. Let us save the data in a file named as transcript.json.

Push sample JSON into the transcript-topic Kafka topic, using the Kafka console producer. This will add 12 records to the topic described in the transcript.json file.

Ingesting streaming data

As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the to checkout the real-time data.

Some More kafka ingestion configs

Use Kafka Partition(Low) Level Consumer with SSL

Here is an example config which uses SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, ones starting with ssl. are for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.

Ingest transactionally committed messages only from Kafka

With Kafka consumer 2.0, you can ingest transactionally committed messages only by configuring kafka.isolation.level to read_committed. For example,

Note that the default value of this config read_uncommitted to read all messages. Also, this config supports low-level consumer only.

Use Kafka Level Consumer with SASL_SSL

Here is an example config which uses SASL_SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, some for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.

Amazon Kinesis

To ingest events from an Amazon Kinesis stream into Pinot, set the following configs into the table config

where the Kinesis specific properties are:

Property

Description

Apache Pulsar

Pinot supports consuming data from Apache Pulsar via pinot-pulsar plugin. You need to enable this plugin so that Pulsar specific libraries are present in the classpath.

You can enable pulsar plugin with the following config at the time of Pinot setup -Dplugins.include=pinot-pulsar

pinot-pulsar plugin is not part of official 0.10.0 binary. You can download the plugin from and add it to libs or plugins directory in pinot.

Set up Pulsar table

A sample Pulsar stream config to ingest data should look as follows. You can use the streamConfigs section from this sample and make changes for your corresponding table.

Pulsar configuration options

You can change the following Pulsar specifc configurations for your tables

Property

Description

Authentication

Pinot-Pulsar connector supports authentication using the security tokens. You can generate the token by following the . Once generated, you can add the following property to streamConfigs to add auth token for each request

TLS support

Pinot-pulsar connecor also supports TLS for encrypted connections. You can follow to enable TLS on your pulsar cluster. Once done, you can enable TLS in pulsar connector by providing the trust certificate file location generated in the previous step.

Also, make sure to change the brokers url from pulsar://localhost:6650 to pulsar+ssl://localhost:6650 so that secure connections are used.

For other table and stream configurations, you can headover to

Supported Pulsar versions

PInot currently relies on Pulsar client version 2.7.2. Users should make sure the Pulsar broker is compatible with the this client version.

Apache Kafka

This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.

Introduction

In this guide, you'll learn how to import data into Pinot using Apache Kafka for real-time stream ingestion. Pinot has out-of-the-box real-time ingestion support for Kafka.

Let's setup a demo Kafka cluster locally, and create a sample topic transcript-topic

Start Kafka

Create a Kafka Topic

Start Kafka

Start Kafka cluster on port 9876 using the same Zookeeper from the .

Create a Kafka topic

Download the latest . Create a topic.

Creating Schema Configuration

We will publish the data in the same format as mentioned in the docs. So you can use the same schema mentioned under .

Creating a table configuration

The real-time table configuration for the transcript table described in the schema from the previous step.

For Kafka versions below 2.X, use org.apache.pinot.plugin.stream.kafka09.KafkaConsumerFactory

For Kafka version 2.X and above, use org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory

You can set the offset to -

smallest to start consumer from the earliest offset
largest to start consumer from the latest offset
timestamp in milliseconds

The resulting configuration should look as follows -

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server into tableIndexConfig.streamConfigs. This config should be the URI of Kafka broker lists, e.g.

How to consume from higher Kafka version?

This connector is also suitable for Kafka lib version higher than 2.0.0. In , change the kafka.lib.version from 2.0.0 to 2.1.1 will make this Connector working with Kafka 2.1.1.

How to consume transactional-committed Kafka messages

Upload schema and table

Add sample data to the Kafka topic

We will publish data in the following format to Kafka. Let us save the data in a file named as transcript.json.

Push sample JSON into the transcript-topic Kafka topic, using the Kafka console producer. This will add 12 records to the topic described in the transcript.json file.

Ingesting streaming data

As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the to checkout the real-time data.

Some More kafka ingestion configs

Use Kafka Partition(Low) Level Consumer with SSL

Ingest transactionally committed messages only from Kafka

With Kafka consumer 2.0, you can ingest transactionally committed messages only by configuring kafka.isolation.level to read_committed. For example,

Note that the default value of this config read_uncommitted to read all messages. Also, this config supports low-level consumer only.

Use Kafka Level Consumer with SASL_SSL

Stream ingestion

Stream Ingestion provides support for checkpoints for preventing data loss.

Setting up Stream ingestion involves the following steps:

Create schema configuration
Create table configuration
Upload table and schema spec

Let's take a look at each of the steps in more detail.

Let us assume the data to be ingested is in the following format:

Create Schema Configuration

Schema defines the fields along with their data types. The schema also defines whether fields serve as dimensions , metrics or timestamp. For more details on schema configuration, see .

For our sample data, the schema configuration looks like this:

Create Table Configuration

The real-time table configuration consists of the following fields:

tableName - The name of the table where the data should flow
tableType - The internal type for the table. Should always be set to REALTIME for realtime ingestion
segmentsConfig -

Config key

Description

Supported values

The following flush threshold settings are also supported:

Config key

Description

Supported values

You can also specify additional configs for the consumer by prefixing the key with stream.[streamType] where streamType is the name of the streaming platform. e.g. kafka

For our sample data and schema, the table config will look like this:

Upload schema and table config

Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the configs are uploaded, pinot will start ingesting available records from the topic.

Custom Ingestion Support

We are working on support for other ingestion platforms, but you can also write your own ingestion plugin if it is not supported out of the box. For a walkthrough, see .

Stream ingestion

hashtagCreate Schema Configuration

hashtagCreate Table Configuration

hashtagUpload schema and table config

hashtagCustom Ingestion Support

Apache Kafka

hashtagIntroduction

hashtagCreating Schema Configuration

hashtagCreating a table configuration

hashtagUpgrade from Kafka 0.9 connector to Kafka 2.x connector

hashtagHow to consume from higher Kafka version?

hashtagHow to consume transactional-committed Kafka messages

hashtagUpload schema and table

hashtagAdd sample data to the Kafka topic

hashtagIngesting streaming data

hashtagSome More kafka ingestion configs

hashtagUse Kafka Partition(Low) Level Consumer with SSL

hashtagIngest transactionally committed messages only from Kafka

hashtagUse Kafka Level Consumer with SASL_SSL

Amazon Kinesis

Apache Pulsar

hashtagSet up Pulsar table

hashtagPulsar configuration options

hashtagAuthentication

hashtagTLS support

hashtagSupported Pulsar versions

Apache Kafka

hashtagIntroduction

hashtagCreating Schema Configuration

hashtagCreating a table configuration

hashtagUpgrade from Kafka 0.9 connector to Kafka 2.x connector

hashtagHow to consume from higher Kafka version?

hashtagHow to consume transactional-committed Kafka messages

hashtagUpload schema and table

hashtagAdd sample data to the Kafka topic

hashtagIngesting streaming data

hashtagSome More kafka ingestion configs

hashtagUse Kafka Partition(Low) Level Consumer with SSL

hashtagIngest transactionally committed messages only from Kafka

hashtagUse Kafka Level Consumer with SASL_SSL

Apache Pulsar

hashtagSet up Pulsar table

hashtagPulsar configuration options

hashtagAuthentication

hashtagTLS support

hashtagSupported Pulsar versions

Amazon Kinesis

hashtagLimitations

Stream ingestion

hashtagCreate Schema Configuration

hashtagCreate Table Configuration

hashtagUpload schema and table config

hashtagCustom Ingestion Support

Create Schema Configuration

Create Table Configuration

Upload schema and table config

Custom Ingestion Support

Introduction

Creating Schema Configuration

Creating a table configuration

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

How to consume from higher Kafka version?

How to consume transactional-committed Kafka messages

Upload schema and table

Add sample data to the Kafka topic

Ingesting streaming data

Some More kafka ingestion configs

Use Kafka Partition(Low) Level Consumer with SSL

Ingest transactionally committed messages only from Kafka

Use Kafka Level Consumer with SASL_SSL

Set up Pulsar table

Pulsar configuration options

Authentication

TLS support

Supported Pulsar versions

Introduction

Creating Schema Configuration

Creating a table configuration

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

How to consume from higher Kafka version?

How to consume transactional-committed Kafka messages

Upload schema and table

Add sample data to the Kafka topic

Ingesting streaming data

Some More kafka ingestion configs

Use Kafka Partition(Low) Level Consumer with SSL

Ingest transactionally committed messages only from Kafka

Use Kafka Level Consumer with SASL_SSL

Set up Pulsar table

Pulsar configuration options

Authentication

TLS support

Supported Pulsar versions

Limitations

Create Schema Configuration

Create Table Configuration

Upload schema and table config

Custom Ingestion Support