Apache Pinot Docs
Search…
Apache Kafka
This guide shows you how to ingest a stream of records from an Apache Kafka topic into a Pinot table.

Introduction

In this guide, you'll learn how to import data into Pinot using Apache Kafka for real-time stream ingestion. Pinot has out-of-the-box real-time ingestion support for Kafka.
Let's setup a demo Kafka cluster locally, and create a sample topic transcript-topic
Docker
Using launcher scripts
Start Kafka
1
docker run \
2
--network pinot-demo --name=kafka \
3
-e KAFKA_ZOOKEEPER_CONNECT=pinot-quickstart:2123/kafka \
4
-e KAFKA_BROKER_ID=0 \
5
-e KAFKA_ADVERTISED_HOST_NAME=kafka \
6
-d wurstmeister/kafka:latest
Copied!
Create a Kafka Topic
1
docker exec \
2
-t kafka \
3
/opt/kafka/bin/kafka-topics.sh \
4
--zookeeper pinot-quickstart:2123/kafka \
5
--partitions=1 --replication-factor=1 \
6
--create --topic transcript-topic
Copied!
Start Kafka
Start Kafka cluster on port 9876 using the same Zookeeper from the quick-start examples.
1
bin/pinot-admin.sh StartKafka -zkAddress=localhost:2123/kafka -port 9876
Copied!
Create a Kafka topic
Download the latest Kafka. Create a topic.
1
bin/kafka-topics.sh --create --bootstrap-server localhost:9876 --replication-factor 1 --partitions 1 --topic transcript-topic
Copied!

Creating Schema Configuration

We will publish the data in the same format as mentioned in the Stream ingestion docs. So you can use the same schema mentioned under Create Schema Configuration.

Creating a table configuration

The real-time table configuration for the transcript table described in the schema from the previous step.
For Kafka, we use streamType as kafka . Currently only JSON format is supported but you can easily write your own decoder by extending the StreamMessageDecoder interface. You can then access your decoder class by putting the jar file in plugins directory
The lowLevel consumer reads data per partition whereas the highLevel consumer utilises Kafka high level consumer to read data from the whole stream. It doesn't have the control over which partition to read at a particular momemt.
For Kafka versions below 2.X, use org.apache.pinot.plugin.stream.kafka09.KafkaConsumerFactory
For Kafka version 2.X and above, use org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory
You can set the offset to -
    smallest to start consumer from the earliest offset
    largest to start consumer from the latest offset
    timestamp in milliseconds to start the consumer from the offset after the timestamp.
The resulting configuration should look as follows -
/tmp/pinot-quick-start/transcript-table-realtime.json
1
{
2
"tableName": "transcript",
3
"tableType": "REALTIME",
4
"segmentsConfig": {
5
"timeColumnName": "timestamp",
6
"timeType": "MILLISECONDS",
7
"schemaName": "transcript",
8
"replicasPerPartition": "1"
9
},
10
"tenants": {},
11
"tableIndexConfig": {
12
"loadMode": "MMAP",
13
"streamConfigs": {
14
"streamType": "kafka",
15
"stream.kafka.consumer.type": "lowlevel",
16
"stream.kafka.topic.name": "transcript-topic",
17
"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
18
"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
19
"stream.kafka.broker.list": "localhost:9876",
20
"realtime.segment.flush.threshold.time": "3600000",
21
"realtime.segment.flush.threshold.size": "50000",
22
"stream.kafka.consumer.prop.auto.offset.reset": "smallest"
23
}
24
},
25
"metadata": {
26
"customConfigs": {}
27
}
28
}
Copied!

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

    Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
    If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server into tableIndexConfig.streamConfigs. This config should be the URI of Kafka broker lists, e.g. localhost:9092.

How to consume from higher Kafka version?

This connector is also suitable for Kafka lib version higher than 2.0.0. In Kafka 2.0 connector pom.xml, change the kafka.lib.version from 2.0.0 to 2.1.1 will make this Connector working with Kafka 2.1.1.

Upload schema and table

Now that we have our table and schema configurations, let's upload them to the Pinot cluster. As soon as the real-time table is created, it will begin ingesting available records from the Kafka topic.
Docker
Launcher Script
1
docker run \
2
--network=pinot-demo \
3
-v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
4
--name pinot-streaming-table-creation \
5
apachepinot/pinot:latest AddTable \
6
-schemaFile /tmp/pinot-quick-start/transcript-schema.json \
7
-tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
8
-controllerHost pinot-quickstart \
9
-controllerPort 9000 \
10
-exec
Copied!
1
bin/pinot-admin.sh AddTable \
2
-schemaFile /tmp/pinot-quick-start/transcript-schema.json \
3
-tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
4
-exec
Copied!

Add sample data to the Kafka topic

We will publish data in the following format to Kafka. Let us save the data in a file named as transcript.json.
transcript.json
1
{"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"Maths","score":3.8,"timestamp":1571900400000}
2
{"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"History","score":3.5,"timestamp":1571900400000}
3
{"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Maths","score":3.2,"timestamp":1571900400000}
4
{"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Chemistry","score":3.6,"timestamp":1572418800000}
5
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Geography","score":3.8,"timestamp":1572505200000}
6
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"English","score":3.5,"timestamp":1572505200000}
7
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Maths","score":3.2,"timestamp":1572678000000}
8
{"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Physics","score":3.6,"timestamp":1572678000000}
9
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"Maths","score":3.8,"timestamp":1572678000000}
10
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"English","score":3.5,"timestamp":1572678000000}
11
{"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"History","score":3.2,"timestamp":1572854400000}
12
{"studentID":212,"firstName":"Nick","lastName":"Young","gender":"Male","subject":"History","score":3.6,"timestamp":1572854400000}
Copied!
Push sample JSON into the transcript-topic Kafka topic, using the Kafka console producer. This will add 12 records to the topic described in the transcript.json file.
1
bin/kafka-console-producer.sh \
2
--broker-list localhost:9876 \
3
--topic transcript-topic < transcript.json
Copied!

Ingesting streaming data

As soon as data flows into the stream, the Pinot table will consume it and it will be ready for querying. Head over to the Query Console to checkout the real-time data.
1
SELECT * FROM transcript
Copied!

Some More kafka ingestion configs

Use Kafka Partition(Low) Level Consumer with SSL

Here is an example config which uses SSL based authentication to talk with kafka and schema-registry. Notice there are two sets of SSL options, ones starting with ssl. are for kafka consumer and ones with stream.kafka.decoder.prop.schema.registry. are for SchemaRegistryClient used by KafkaConfluentSchemaRegistryAvroMessageDecoder.
1
{
2
"tableName": "transcript",
3
"tableType": "REALTIME",
4
"segmentsConfig": {
5
"timeColumnName": "timestamp",
6
"timeType": "MILLISECONDS",
7
"schemaName": "transcript",
8
"replicasPerPartition": "1"
9
},
10
"tenants": {},
11
"tableIndexConfig": {
12
"loadMode": "MMAP",
13
"streamConfigs": {
14
"streamType": "kafka",
15
"stream.kafka.consumer.type": "LowLevel",
16
"stream.kafka.topic.name": "transcript-topic",
17
"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
18
"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
19
"stream.kafka.zk.broker.url": "localhost:2191/kafka",
20
"stream.kafka.broker.list": "localhost:9876",
21
"schema.registry.url": "",
22
"security.protocol": "",
23
"ssl.truststore.location": "",
24
"ssl.keystore.location": "",
25
"ssl.truststore.password": "",
26
"ssl.keystore.password": "",
27
"ssl.key.password": "",
28
"stream.kafka.decoder.prop.schema.registry.rest.url": "",
29
"stream.kafka.decoder.prop.schema.registry.ssl.truststore.location": "",
30
"stream.kafka.decoder.prop.schema.registry.ssl.keystore.location": "",
31
"stream.kafka.decoder.prop.schema.registry.ssl.truststore.password": "",
32
"stream.kafka.decoder.prop.schema.registry.ssl.keystore.password": "",
33
"stream.kafka.decoder.prop.schema.registry.ssl.keystore.type": "",
34
"stream.kafka.decoder.prop.schema.registry.ssl.truststore.type": "",
35
"stream.kafka.decoder.prop.schema.registry.ssl.key.password": "",
36
"stream.kafka.decoder.prop.schema.registry.ssl.protocol": "",
37
}
38
},
39
"metadata": {
40
"customConfigs": {}
41
}
42
}
Copied!

Ingest transactionally committed messages only from Kafka

With Kafka consumer 2.0, you can ingest transactionally committed messages only by configuring kafka.isolation.level to read_committed. For example,
1
{
2
"tableName": "transcript",
3
"tableType": "REALTIME",
4
"segmentsConfig": {
5
"timeColumnName": "timestamp",
6
"timeType": "MILLISECONDS",
7
"schemaName": "transcript",
8
"replicasPerPartition": "1"
9
},
10
"tenants": {},
11
"tableIndexConfig": {
12
"loadMode": "MMAP",
13
"streamConfigs": {
14
"streamType": "kafka",
15
"stream.kafka.consumer.type": "LowLevel",
16
"stream.kafka.topic.name": "transcript-topic",
17
"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
18
"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
19
"stream.kafka.zk.broker.url": "localhost:2191/kafka",
20
"stream.kafka.broker.list": "localhost:9876",
21
"stream.kafka.isolation.level": "read_committed"
22
}
23
},
24
"metadata": {
25
"customConfigs": {}
26
}
27
}
Copied!
Note that the default value of this config read_uncommitted to read all messages. Also, this config supports low-level consumer only.
Last modified 1mo ago