1 of 5

File Systems

This section contains a collection of short guides to show you how to import data from a Pinot-supported file system.

FileSystem is an abstraction provided by Pinot to access data stored in distributed file systems (DFS).

Pinot uses distributed file systems for the following purposes:

Batch ingestion job: To read the input data (CSV, Avro, Thrift, etc.) and to write generated segments to DFS.
Controller: When a segment is uploaded to the controller, the controller saves it in the configured DFS.
Server:- When a server(s) is notified of a new segment, the server copies the segment from remote DFS to their local node using the DFS abstraction.

Supported file systems

Pinot lets you choose a distributed file system provider. The following file systems are supported by Pinot:

Enabling a file system

To use a distributed file system, you need to enable plugins. To do that, specify the plugin directory and include the required plugins:

You can change the file system in the controller and server configuration. In the following configuration example, the URI is s3://bucket/path/to/file and scheme refers to the file system URI prefix s3.

You can also change the file system during ingestion. In the ingestion job spec, specify the file system with the following configuration:

Amazon S3

This guide shows you how to import data from files stored in Amazon S3.

Enable the Amazon S3 file system backend by including the pinot-s3 plugin. In the controller or server configuration, add the config:

By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...

You can configure the S3 file system using the following options:

Configuration

Description

Each of these properties should be prefixed by pinot.[node].storage.factory.s3. where node is either controller or server depending on the config

e.g.

S3 Filesystem supports authentication using the . The credential provider looks for the credentials in the following order -

Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey

You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups.

Examples

Job spec

Controller config

Server config

Minion config

Azure Data Lake Storage

This guide shows you how to import data from files stored in Azure Data Lake Storage Gen2 (ADLS Gen2)

Enable the Azure Data Lake Storage using the pinot-adls plugin. In the controller or server, add the config:

Azure Blob Storage provides the following options:

accountName: Name of the Azure account under which the storage is created.
accessKey: Access key required for the authentication.
fileSystemName

Each of these properties should be prefixed by pinot.[node].storage.factory.class.adl2. where node is either controller or server depending on the config, like this:

Examples

Job spec

Controller config

Server config

Minion config

HDFS

This guide shows you how to import data from HDFS.

Enable the Hadoop distributed file system (HDFS) using the pinot-hdfs plugin. In the controller or server, add the config:

HDFS implementation provides the following options:

hadoop.conf.path: Absolute path of the directory containing Hadoop XML configuration files, such as hdfs-site.xml, core-site.xml .
hadoop.write.checksum: Create checksum while pushing an object. Default is false

Each of these properties should be prefixed by pinot.[node].storage.factory.class.hdfs. where node is either controller or server depending on the config

The kerberos configs should be used only if your Hadoop installation is secured with Kerberos. Refer to the for information on how to secure Hadoop using Kerberos.

You must provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.

Push HDFS segment to Pinot Controller

To push HDFS segment files to Pinot controller, send the HDFS path of your newly created segment files to the Pinot Controller. The controller will download the files.

This curl example requests tells the controller to download segment files to the proper table:

Examples

Job spec

Standalone Job:

Hadoop Job:

Controller config

Server config

Minion config

Google Cloud Storage

This guide shows you how to import data from GCP (Google Cloud Platform).

Enable the Google Cloud Storage using the pinot-gcs plugin. In the controller or server, add the config:

GCP file systems provides the following options:

projectId - The name of the Google Cloud Platform project under which you have created your storage bucket.
gcpKey - Location of the json file containing GCP keys. You can refer to download the keys.

Each of these properties should be prefixed by pinot.[node].storage.factory.class.gs. where node is either controller or server depending on the configuration, like this:

Examples

Job spec

Controller config

Server config

Minion config

Amazon S3

This guide shows you how to import data from files stored in Amazon S3.

Enable the Amazon S3 file system backend by including the pinot-s3 plugin. In the controller or server configuration, add the config:

-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-s3

You can configure the S3 file system using the following options:

Configuration

Description

Each of these properties should be prefixed by pinot.[node].storage.factory.s3. where node is either controller or server depending on the config

e.g.

S3 Filesystem supports authentication using the . The credential provider looks for the credentials in the following order -

Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
Java System Properties - aws.accessKeyId and aws.secretKey

You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups.

Examples

Job spec

Controller config

Server config

Minion config

HDFS

This guide shows you how to import data from HDFS.

Enable the Hadoop distributed file system (HDFS) using the pinot-hdfs plugin. In the controller or server, add the config:

-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-hdfs

HDFS implementation provides the following options:

hadoop.conf.path: Absolute path of the directory containing Hadoop XML configuration files, such as hdfs-site.xml, core-site.xml .
hadoop.write.checksum: Create checksum while pushing an object. Default is false

Each of these properties should be prefixed by pinot.[node].storage.factory.class.hdfs. where node is either controller or server depending on the config

The kerberos configs should be used only if your Hadoop installation is secured with Kerberos. Refer to the for information on how to secure Hadoop using Kerberos.

You must provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.

Push HDFS segment to Pinot Controller

To push HDFS segment files to Pinot controller, send the HDFS path of your newly created segment files to the Pinot Controller. The controller will download the files.

This curl example requests tells the controller to download segment files to the proper table:

File Systems

hashtagSupported file systems

hashtagEnabling a file system

Amazon S3

hashtagExamples

hashtagJob spec

hashtagController config

hashtagServer config

hashtagMinion config

Azure Data Lake Storage

hashtagExamples

hashtagJob spec

hashtagController config

hashtagServer config

hashtagMinion config

HDFS

hashtagPush HDFS segment to Pinot Controller

hashtagExamples

hashtagJob spec

hashtagController config

hashtagServer config

hashtagMinion config

Google Cloud Storage

hashtagExamples

hashtagJob spec

hashtagController config

hashtagServer config

hashtagMinion config

Amazon S3

hashtagExamples

hashtagJob spec

hashtagController config

hashtagServer config

hashtagMinion config

File Systems

hashtagSupported file systems

hashtagEnabling a file system

HDFS

hashtagPush HDFS segment to Pinot Controller

hashtagExamples

hashtagJob spec

hashtagController config

hashtagServer config

hashtagMinion config

Azure Data Lake Storage

hashtagExamples

hashtagJob spec

hashtagController config

hashtagServer config

hashtagMinion config

Google Cloud Storage

hashtagExamples

hashtagJob spec

hashtagController config

hashtagServer config

hashtagMinion config

Supported file systems

Enabling a file system

Examples

Job spec

Controller config

Server config

Minion config

Examples

Job spec

Controller config

Server config

Minion config

Push HDFS segment to Pinot Controller

Examples

Job spec

Controller config

Server config

Minion config

Examples

Job spec

Controller config

Server config

Minion config

Examples

Job spec

Controller config

Server config

Minion config

Supported file systems

Enabling a file system

Push HDFS segment to Pinot Controller

Examples

Job spec

Controller config

Server config

Minion config

Examples

Job spec

Controller config

Server config

Minion config

Examples

Job spec

Controller config

Server config

Minion config