1 of 15

Ingest Data

How to turn on the water valve

There are two ways to get data ingested into Pinot:

Batch
Streaming

Batch

Segment Fetchers

When pinot segment files are created in external systems (hadoop/spark/etc), there are several ways to push those data to pinot Controller and Server:

push segment to shared NFS and let pinot pull segment files from the location of that NFS.
push segment to a Web server and let pinot pull segment files from the Web server with http/https link.
push segment to HDFS and let pinot pull segment files from HDFS with hdfs location uri.
push segment to other system and implement your own segment fetcher to pull data from those systems.

The first two options should be supported out of the box with pinot package. As long your remote jobs send Pinot controller with the corresponding URI to the files it will pick up the file and allocate it to proper Pinot Servers and brokers. To enable Pinot support for HDFS, you will need to provide Pinot Hadoop configuration and proper Hadoop dependencies.

Creating Pinot Segments

Creating Pinot segments

Pinot segments can be created offline on Hadoop, or via command line from data files. Controller REST endpoint can then be used to add the segment to the table to which the segment belongs. Pinot segments can also be created by ingesting data from realtime resources (such as Kafka).

Creating segments using hadoop

Offline Pinot workflow

To create Pinot segments on Hadoop, a workflow can be created to complete the following steps:

Pre-aggregate, clean up and prepare the data, writing it as Avro format files in a single HDFS directory
Create segments
Upload segments to the Pinot cluster

Step one can be done using your favorite tool (such as Pig, Hive or Spark), Pinot provides two MapReduce jobs to do step two and three.

Configuring the job

Create a job properties configuration file, such as one below:

Executing the job

The Pinot Hadoop module contains a job that you can incorporate into your workflow to generate Pinot segments.

You can then use the SegmentTarPush job to push segments via the controller REST API.

Creating Pinot segments outside of Hadoop

Here is how you can create Pinot segments from standard formats like CSV/JSON/AVRO.

Follow the steps described in the section on to build pinot. Locate pinot-admin.sh in pinot-tools/target/pinot-tools=pkg/bin/pinot-admin.sh.
Create a top level directory containing all the CSV/JSON/AVRO files that need to be converted into segments.

Run the pinot-admin command to generate the segments. The command can be invoked as follows. Options within “[ ]” are optional. For -format, the default value is AVRO.

To configure various parameters for CSV a config file in JSON format can be provided. This file is optional, as are each of its parameters. When not provided, default values used for these parameters are described below:

fileFormat: Specify one of the following. Default is EXCEL.
1. EXCEL
2. MYSQL

Below is a sample config file.

Sample Schema:

Pushing offline segments to Pinot

You can use curl to push a segment to pinot:

Alternatively you can use the pinot-admin.sh utility to upload one or more segments:

The command uploads all the segments found in segmentDirectoryPath. The segments could be either tar-compressed (in which case it is a file under segmentDirectoryPath) or uncompressed (in which case it is a directory under segmentDirectoryPath).

Write your batch

Implement your own segment fetcher for other systems

You can also implement your own segment fetchers for other file systems and load into Pinot system with an external jar. All you need to do is to implement a class that extends the interface of and provides config to Pinot Controller and Server as follows:

You can also provide other configs to your fetcher under config-root pinot.server.segment.fetcher.<protocol>

HDFS

HDFS segment fetcher configs

In your Pinot controller/server configuration, you will need to provide the following configs:

pinot.controller.segment.fetcher.hdfs.hadoop.conf.path=`<file path to hadoop conf folder>

pinot.server.segment.fetcher.hdfs.hadoop.conf.path=`<file path to hadoop conf folder>

This path should point the local folder containing core-site.xml and hdfs-site.xml files from your Hadoop installation

These two configs should be the corresponding Kerberos configuration if your Hadoop installation is secured with Kerberos. Please check Hadoop Kerberos guide on how to generate Kerberos security identification.

You will also need to provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.

Push HDFS segment to Pinot Controller

To push HDFS segment files to Pinot controller, you just need to ensure you have proper Hadoop configuration as we mentioned in the previous part. Then your remote segment creation/push job can send the HDFS path of your newly created segment files to the Pinot Controller and let it download the files.

For example, the following curl requests to Controller will notify it to download segment files to the proper table:

AWS S3

Azure Storage

Google Cloud Storage

Streaming

Pluggable Streams

Note

This section is a pre-read if you are planning to develop plug-ins for streams other than Kafka. Pinot supports Kafka out of the box.

Prior to commit ba9f2d, Pinot was only able to support consuming from Kafka stream.

Pinot now enables its users to write plug-ins to consume from pub-sub streams other than Kafka. (Please refer to )

Some of the streams for which plug-ins can be added are:

You may encounter some limitations either in Pinot or in the stream system while developing plug-ins. Please feel free to get in touch with us when you start writing a stream plug-in, and we can help you out. We are open to receiving PRs in order to improve these abstractions if they do not work for a certain stream implementation.

Refer to for details on how Pinot consumes streaming data.

Creating Pinot Segments

Realtime segment generation

To consume in realtime, we simply need to create a table with the same name as the schema and point to the Kafka topic to consume from, using a table definition such as this one:

First, we’ll start a local instance of Kafka and start streaming data into it:Untitled

This will stream one event per second from the Avro file to the Kafka topic. Then, we’ll create a realtime table, which will start consuming from the Kafka topic.

Write your stream

This page describes how to write your own streams to plug to Pinot. Two modes are available: high and low level.

Requirements to support Stream Level (High Level) consumers

The stream should provide the following guarantees:

Kafka

This page describes how to connect Kafka to Pinot

Kafka 2.x Plugin

Pinot provides stream plugin support for Kafka 2.x version. Although the version used in this implementation is kafka 2.0.0, it’s possible to compile it with higher kafka lib version, e.g. 2.1.1.

How to build and release Pinot package with Kafka 2.x connector

How to use Kafka 2.x connector

Use Kafka Stream(High) Level Consumer

Below is a sample streamConfigs used to create a realtime table with Kafka Stream(High) level consumer.

Kafka 2.x HLC consumer uses org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory in config stream.kafka.consumer.factory.class.name.

Use Kafka Partition(Low) Level Consumer

Below is a sample table config used to create a realtime table with Kafka Partition(Low) level consumer:

Please note:

Config replicasPerPartition under segmentsConfig is required to specify table replication.
Config stream.kafka.consumer.type should be specified as LowLevel to use partition level consumer. (The use of simple instead of LowLevel is deprecated)

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server into tableIndexConfig.streamConfigs. This config should be the URI of Kafka broker lists, e.g.

How to use this plugin with higher Kafka version?

This connector is also suitable for Kafka lib version higher than 2.0.0. In pinot-connector-kafka-2.0/pom.xml change the kafka.lib.version from 2.0.0 to 2.1.1 will make this Connector working with Kafka 2.1.1.

Azure EventHub

Amazon Kinesis

Google Pub/Sub

Creating Pinot Segments

Creating Pinot segments

Creating segments using hadoop

Offline Pinot workflow

To create Pinot segments on Hadoop, a workflow can be created to complete the following steps:

Pre-aggregate, clean up and prepare the data, writing it as Avro format files in a single HDFS directory
Create segments
Upload segments to the Pinot cluster

Step one can be done using your favorite tool (such as Pig, Hive or Spark), Pinot provides two MapReduce jobs to do step two and three.

Configuring the job

Create a job properties configuration file, such as one below:

Executing the job

The Pinot Hadoop module contains a job that you can incorporate into your workflow to generate Pinot segments.

You can then use the SegmentTarPush job to push segments via the controller REST API.

Creating Pinot segments outside of Hadoop

Here is how you can create Pinot segments from standard formats like CSV/JSON/AVRO.

Follow the steps described in the section on to build pinot. Locate pinot-admin.sh in pinot-tools/target/pinot-tools=pkg/bin/pinot-admin.sh.
Create a top level directory containing all the CSV/JSON/AVRO files that need to be converted into segments.

Run the pinot-admin command to generate the segments. The command can be invoked as follows. Options within “[ ]” are optional. For -format, the default value is AVRO.

fileFormat: Specify one of the following. Default is EXCEL.
1. EXCEL
2. MYSQL

Below is a sample config file.

Sample Schema:

Pushing offline segments to Pinot

You can use curl to push a segment to pinot:

Alternatively you can use the pinot-admin.sh utility to upload one or more segments:

Kafka

This page describes how to connect Kafka to Pinot

Kafka 2.x Plugin

Pinot provides stream plugin support for Kafka 2.x version. Although the version used in this implementation is kafka 2.0.0, it’s possible to compile it with higher kafka lib version, e.g. 2.1.1.

How to build and release Pinot package with Kafka 2.x connector

How to use Kafka 2.x connector

Use Kafka Stream(High) Level Consumer

Below is a sample streamConfigs used to create a realtime table with Kafka Stream(High) level consumer.

Kafka 2.x HLC consumer uses org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory in config stream.kafka.consumer.factory.class.name.

Use Kafka Partition(Low) Level Consumer

Below is a sample table config used to create a realtime table with Kafka Partition(Low) level consumer:

Please note:

Config replicasPerPartition under segmentsConfig is required to specify table replication.
Config stream.kafka.consumer.type should be specified as LowLevel to use partition level consumer. (The use of simple instead of LowLevel is deprecated)

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

Update table config for both high level and low level consumer: Update config: stream.kafka.consumer.factory.class.name from org.apache.pinot.core.realtime.impl.kafka.KafkaConsumerFactory to org.apache.pinot.core.realtime.impl.kafka2.KafkaConsumerFactory.
If using Stream(High) level consumer: Please also add config stream.kafka.hlc.bootstrap.server into tableIndexConfig.streamConfigs. This config should be the URI of Kafka broker lists, e.g.

Ingest Data

Batch

hashtagSegment Fetchers

Creating Pinot Segments

hashtagCreating Pinot segments

hashtagCreating segments using hadoop

hashtagConfiguring the job

hashtagExecuting the job

hashtagCreating Pinot segments outside of Hadoop

hashtagPushing offline segments to Pinot

Write your batch

hashtagImplement your own segment fetcher for other systems

HDFS

hashtagHDFS segment fetcher configs

hashtagPush HDFS segment to Pinot Controller

AWS S3

Azure Storage

Google Cloud Storage

Streaming

hashtagPluggable Streams

Creating Pinot Segments

hashtagRealtime segment generation

Write your stream

hashtagRequirements to support Stream Level (High Level) consumers

Kafka

hashtagKafka 2.x Plugin

hashtagHow to build and release Pinot package with Kafka 2.x connector

hashtagHow to use Kafka 2.x connector

hashtagUpgrade from Kafka 0.9 connector to Kafka 2.x connector

hashtagHow to use this plugin with higher Kafka version?

Azure EventHub

Amazon Kinesis

Google Pub/Sub

Batch

hashtagSegment Fetchers

Ingest Data

Write your batch

hashtagImplement your own segment fetcher for other systems

Azure Storage

Creating Pinot Segments

hashtagCreating Pinot segments

hashtagCreating segments using hadoop

hashtagConfiguring the job

hashtagExecuting the job

hashtagCreating Pinot segments outside of Hadoop

hashtagPushing offline segments to Pinot

Amazon Kinesis

Google Pub/Sub

Azure EventHub

AWS S3

Streaming

hashtagPluggable Streams

HDFS

hashtagHDFS segment fetcher configs

hashtagPush HDFS segment to Pinot Controller

Creating Pinot Segments

hashtagRealtime segment generation

Google Cloud Storage

Kafka

hashtagKafka 2.x Plugin

hashtagHow to build and release Pinot package with Kafka 2.x connector

hashtagHow to use Kafka 2.x connector

hashtagUpgrade from Kafka 0.9 connector to Kafka 2.x connector

hashtagHow to use this plugin with higher Kafka version?

Write your stream

hashtagRequirements to support Stream Level (High Level) consumers

hashtagRequirements to support Partition Level (Low Level) consumers

hashtagStream plug-in implementation

Segment Fetchers

Creating Pinot segments

Creating segments using hadoop

Configuring the job

Executing the job

Creating Pinot segments outside of Hadoop

Pushing offline segments to Pinot

Implement your own segment fetcher for other systems

HDFS segment fetcher configs

Push HDFS segment to Pinot Controller

Pluggable Streams

Realtime segment generation

Requirements to support Stream Level (High Level) consumers

Kafka 2.x Plugin

How to build and release Pinot package with Kafka 2.x connector

How to use Kafka 2.x connector

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

How to use this plugin with higher Kafka version?

Segment Fetchers

Implement your own segment fetcher for other systems

Creating Pinot segments

Creating segments using hadoop

Configuring the job

Executing the job

Creating Pinot segments outside of Hadoop

Pushing offline segments to Pinot

Pluggable Streams

HDFS segment fetcher configs

Push HDFS segment to Pinot Controller

Realtime segment generation

Kafka 2.x Plugin

How to build and release Pinot package with Kafka 2.x connector

How to use Kafka 2.x connector

Upgrade from Kafka 0.9 connector to Kafka 2.x connector

How to use this plugin with higher Kafka version?

Requirements to support Stream Level (High Level) consumers

Requirements to support Partition Level (Low Level) consumers

Stream plug-in implementation