1 of 6

Plugins

Starting from the 0.3.X release, Pinot supports a plug-and-play architecture. This means that starting from version 0.3.0, Pinot can be easily extended to support new tools, like streaming services and storage systems.

Plugins are collected in folders, based on their purpose. The following types of plugins are supported.

Input Format

Input format is a set of plugins with the goal of reading data from files during data ingestion. It can be split into two additional types: record encoders (for batch jobs) and decoders (for ingestion). Currently supported record encoder formats are: avro, orc and parquet encoders, while for streaming: csv, json and thrift decoders.

File System

File System is a set of plugins devoted to storage purpose. Currently supported file systems are: adsl, gcs and hdfs.

Stream Ingestion

Stream Ingestion is a set of plugins targeted to ingest data from streams. Currently supported streaming services: kafka 0.9 and kafka 2.0.

Batch Ingestion

Batch Ingestion is a set of plugins targeted to ingest data from batches. Currently supported ingestion systems are: spark, hadoop and standalone jobs.

Developing Plugins

Plugins can be developed with no restriction. There are some standards that have to be followed, though. The plugin have to implement the interfaces from the link

Write Custom Plugins

Pinot allows user to easily create and use new plugins to support custom needs. The developer needs to implement a few standard interfaces specific to the the type of plugins in Java.

Once the code is ready, you can put the complete JAR file in pinot /plugins directory. All the classes in plugins directory are loaded at pinot's startup. We also encourage users to shaded commonly used dependencies such as guava, jackson

Input Format Plugin

Pinot supports multiple input formats out of the box for batch ingestion. For real-time ingestion, currently only JSON is supported. However, due to pluggable architecture of pinot you can easily use any format by implementing standard interfaces.

Batch Record Reader Plugin

All the Batch Input formats supported by Pinot utilise RecordReader to deserialize the data. You can also implement the RecordReader and RecordExtractor interface to add support for your own file formats.

To index the file into Pinot segment, simply implement the interface and plug it into the index engine - . We use a 2-passes algorithm to index the file into Pinot segment, hence the rewind() method is required for the record reader.

Generic Row

is the record abstraction which the index engine can read and index with. It is a map from column name (String) to column value (Object). For multi-valued column, the value should be an object array (Object[]).

Contracts for Record Reader

There are several contracts for record readers that developers should follow when implementing their own record readers:

The output GenericRow should follow the table schema provided, in the sense that:
- All the columns in the schema should be preserved (if column does not exist in the original record, put default value instead)
- Columns not in the schema should not be included

Stream Decoder Plugin

Pinot uses decoders to parse data available in real-time streams. Decoders are responsible for converting binary data in the streams to a GenericRow object.

You can write your own decoder by implementing the interface. You can also use the from the batch input formats to extract fields to GenericRow from the parsed object.

Filesystem Plugin

Pinot enables its users to write a PinotFS abstraction layer to store data in a data layer of their choice for real-time and offline segments.

Some examples of storage backends(other than local storage) currently supported are:

If the above two filesystems do not meet your needs, you can extend the current to customize for your needs.

New Storage Type implementation

In order to add a new type of storage backend such as Amazon S3 we need to implement the class. The interface expects common method related to accessing the file system such as mkdir, list etc.

Once the the class is ready, you can compile it and put it in the /plugins directory of pinot.

Using plugin in real-time table

You can set the configuration in real time or batch tables by using base scheme of URI (scheme://host:port/path) as suffix. e.g. for the path hdfs://user/yarn/path/to/dir , the base scheme is hdfs and all the config keys should have hdfs in them such pinot.controller.storage.factory.hdfs

The example here uses the existing org.apache.pinot.filesystem.HadoopPinotFS to store real-time segments in a HDFS filesytem. In the Pinot controller config, add the following new configs:

In the Pinot controller config, add the following new configs:

Configurations for Offline Tables

These properties for the stream implementation are to be set in your controller and server configurations.

In your controller and server configs, set the FS class you would like to support. pinot.controller.storage.factory.class.${YOUR_URI_SCHEME} to the full path of the FS class you would like to include

You also need to configure pinot.controller.local.temp.dir for the local dir on the controller machine.

For filesystem specific configs, you can pass in the following with either the pinot.controller prefix or the pinot.server prefix.

All the following configs need to be prefixed with storage.factory.

e.g. AzurePinotFS requires the following configs according to your environment:

adl.accountId, adl.authEndpoint, adl.clientId, adl.clientSecret

Sample Controller Config

Sample Server Config

You can find the parameters in your account as follows:

Also make sure to set the following config with the value adl

To see how to upload segments to different storage systems, check .

Batch Segment Fetcher Plugin

You can also implement your own segment fetchers for other file systems and load into Pinot system with an external jar.

All you need to do is to implement a class that extends the interface of SegmentFetcher and provide config to Pinot Controller and Server as follows:

pinot.controller.segment.fetcher.`<protocol>`.class =`<class path to your implementation>

pinot.server.segment.fetcher.`<protocol>`.class =`<class path to your implementation>

You can also provide other configs to your fetcher under config-root pinot.server.segment.fetcher.<protocol>

Stream Ingestion Plugin

This page describes how to write your own stream ingestion plugin for Pinot.

You can write custom stream ingestion plugins to add support for your own streaming platforms such as Pravega, Kinesis etc.

Stream Ingestion Plugins can be of two types -

High Level - Consume data without control over the partitions
Low Level - Consume data from each partition with offset management

Requirements to support Stream Level (High Level) consumers

The stream should provide the following guarantees:

Exactly once delivery (unless restarting from a checkpoint) for each consumer of the stream.
(Optionally) support mechanism to split events (in some arbitrary fashion) so that each event in the stream is delivered exactly to one host out of set of hosts.
Provide ways to save a checkpoint for the data consumed so far. If the stream is partitioned, then this checkpoint is a vector of checkpoints for events consumed from individual partitions.

Requirements to support Partition Level (Low Level) consumers

While consuming rows at a partition level, the stream should support the following properties:

Stream should provide a mechanism to get the current number of partitions.
Each event in a partition should have a unique offset that is not more than 64 bits long.
Refer to a partition as a number not exceeding 32 bits long.

In addition, we have an operational requirement that the number of partitions should not be reduced over time.

Stream plug-in implementation

In order to add a new type of stream (say, Foo) implement the following classes:

FooConsumerFactory extends
FooPartitionLevelConsumer implements
FooStreamLevelConsumer implements

Depending on stream level or partition level, your implementation needs to include StreamLevelConsumer or PartitionLevelConsumer.

The properties for the stream implementation are to be set in the table configuration, inside streamConfigs section.

Use the streamType property to define the stream type. For example, for the implementation of stream foo, set the property "streamType" : "foo".

The rest of the configuration properties for your stream should be set with the prefix "stream.foo". Be sure to use the same suffix for: (see examples below):

topic
consumer type
stream consumer factory

All values should be strings. For example:

You can have additional properties that are specific to your stream. For example:

In addition to these properties, you can define thresholds for the consuming segments:

rows threshold
time threshold

The properties for the thresholds are as follows:

An example of this implementation can be found in the , which is an implementation for the kafka stream.

Filesystem Plugin

Pinot enables its users to write a PinotFS abstraction layer to store data in a data layer of their choice for real-time and offline segments.

Some examples of storage backends(other than local storage) currently supported are:

If the above two filesystems do not meet your needs, you can extend the current to customize for your needs.

New Storage Type implementation

In order to add a new type of storage backend such as Amazon S3 we need to implement the class. The interface expects common method related to accessing the file system such as mkdir, list etc.

Once the the class is ready, you can compile it and put it in the /plugins directory of pinot.

Using plugin in real-time table

The example here uses the existing org.apache.pinot.filesystem.HadoopPinotFS to store real-time segments in a HDFS filesytem. In the Pinot controller config, add the following new configs:

In the Pinot controller config, add the following new configs:

Configurations for Offline Tables

These properties for the stream implementation are to be set in your controller and server configurations.

You also need to configure pinot.controller.local.temp.dir for the local dir on the controller machine.

For filesystem specific configs, you can pass in the following with either the pinot.controller prefix or the pinot.server prefix.

All the following configs need to be prefixed with storage.factory.

e.g. AzurePinotFS requires the following configs according to your environment:

adl.accountId, adl.authEndpoint, adl.clientId, adl.clientSecret

Sample Controller Config

Sample Server Config

You can find the parameters in your account as follows:

Also make sure to set the following config with the value adl

To see how to upload segments to different storage systems, check .

Stream Ingestion Plugin

This page describes how to write your own stream ingestion plugin for Pinot.

You can write custom stream ingestion plugins to add support for your own streaming platforms such as Pravega, Kinesis etc.

Stream Ingestion Plugins can be of two types -

High Level - Consume data without control over the partitions
Low Level - Consume data from each partition with offset management

Requirements to support Stream Level (High Level) consumers

The stream should provide the following guarantees:

Exactly once delivery (unless restarting from a checkpoint) for each consumer of the stream.
(Optionally) support mechanism to split events (in some arbitrary fashion) so that each event in the stream is delivered exactly to one host out of set of hosts.
Provide ways to save a checkpoint for the data consumed so far. If the stream is partitioned, then this checkpoint is a vector of checkpoints for events consumed from individual partitions.

Requirements to support Partition Level (Low Level) consumers

While consuming rows at a partition level, the stream should support the following properties:

Stream should provide a mechanism to get the current number of partitions.
Each event in a partition should have a unique offset that is not more than 64 bits long.
Refer to a partition as a number not exceeding 32 bits long.

In addition, we have an operational requirement that the number of partitions should not be reduced over time.

Stream plug-in implementation

In order to add a new type of stream (say, Foo) implement the following classes:

FooConsumerFactory extends
FooPartitionLevelConsumer implements
FooStreamLevelConsumer implements

Depending on stream level or partition level, your implementation needs to include StreamLevelConsumer or PartitionLevelConsumer.

The properties for the stream implementation are to be set in the table configuration, inside streamConfigs section.

Use the streamType property to define the stream type. For example, for the implementation of stream foo, set the property "streamType" : "foo".

The rest of the configuration properties for your stream should be set with the prefix "stream.foo". Be sure to use the same suffix for: (see examples below):

topic
consumer type
stream consumer factory

All values should be strings. For example:

You can have additional properties that are specific to your stream. For example:

In addition to these properties, you can define thresholds for the consuming segments:

rows threshold
time threshold

The properties for the thresholds are as follows:

An example of this implementation can be found in the , which is an implementation for the kafka stream.

Plugins

hashtagInput Format

hashtagFile System

hashtagStream Ingestion

hashtagBatch Ingestion

hashtagDeveloping Plugins

Write Custom Plugins

Input Format Plugin

hashtagBatch Record Reader Plugin

hashtagGeneric Row

hashtagContracts for Record Reader

hashtagStream Decoder Plugin

Filesystem Plugin

hashtagNew Storage Type implementation

hashtagUsing plugin in real-time table

hashtagConfigurations for Offline Tables

Batch Segment Fetcher Plugin

Stream Ingestion Plugin

hashtagRequirements to support Stream Level (High Level) consumers

hashtagRequirements to support Partition Level (Low Level) consumers

hashtagStream plug-in implementation

Write Custom Plugins

Plugins

hashtagInput Format

hashtagFile System

hashtagStream Ingestion

hashtagBatch Ingestion

hashtagDeveloping Plugins

Batch Segment Fetcher Plugin

Input Format Plugin

hashtagBatch Record Reader Plugin

hashtagGeneric Row

hashtagContracts for Record Reader

hashtagStream Decoder Plugin

Filesystem Plugin

hashtagNew Storage Type implementation

hashtagUsing plugin in real-time table

hashtagConfigurations for Offline Tables

Stream Ingestion Plugin

hashtagRequirements to support Stream Level (High Level) consumers

hashtagRequirements to support Partition Level (Low Level) consumers

hashtagStream plug-in implementation

Input Format

File System

Stream Ingestion

Batch Ingestion

Developing Plugins

Batch Record Reader Plugin

Generic Row

Contracts for Record Reader

Stream Decoder Plugin

New Storage Type implementation

Using plugin in real-time table

Configurations for Offline Tables

Requirements to support Stream Level (High Level) consumers

Requirements to support Partition Level (Low Level) consumers

Stream plug-in implementation

Input Format

File System

Stream Ingestion

Batch Ingestion

Developing Plugins

Batch Record Reader Plugin

Generic Row

Contracts for Record Reader

Stream Decoder Plugin

New Storage Type implementation

Using plugin in real-time table

Configurations for Offline Tables

Requirements to support Stream Level (High Level) consumers

Requirements to support Partition Level (Low Level) consumers

Stream plug-in implementation