This guide shows you how to import data from files stored in Azure Data Lake Storage Gen2 (ADLS Gen2)
You can enable the Azure Data Lake Storage using the plugin pinot-adls
. In the controller or server, add the config -
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include
, you need to put all the plugins you want to use, e.g. pinot-json
, pinot-avro
, pinot-kafka-2.0...
Azure Blob Storage provides the following options -
accountName
: Name of the azure account under which the storage is created
accessKey
: access key required for the authentication
fileSystemName
- name of the filesystem to use i.e. container name (container name is similar to bucket name in S3)
enableChecksum
- enable MD5 checksum for verification. Default is false
.
Each of these properties should be prefixed by pinot.[node].storage.factory.class.adl2.
where node
is either controller
or server
depending on the config
e.g.
This guide shows you how to import data from HDFS.
You can enable the Hadoop DFS using the plugin pinot-hdfs
. In the controller or server, add the config:
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include
, you need to put all the plugins you want to use, e.g. pinot-json
, pinot-avro
, pinot-kafka-2.0...
HDFS implementation provides the following options -
hadoop.conf.path
: Absolute path of the directory containing hadoop XML configuration files such as hdfs-site.xml, core-site.xml .
hadoop.write.checksum
: create checksum while pushing an object. Default is false
hadoop.kerberos.principle
hadoop.kerberos.keytab
Each of these properties should be prefixed by pinot.[node].storage.factory.class.hdfs.
where node
is either controller
or server
depending on the config
The kerberos
configs should be used only if your Hadoop installation is secured with Kerberos. Please check Hadoop Kerberos guide on how to generate Kerberos security identification.
You will also need to provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.
To push HDFS segment files to Pinot controller, you just need to ensure you have proper Hadoop configuration as we mentioned in the previous part. Then your remote segment creation/push job can send the HDFS path of your newly created segment files to the Pinot Controller and let it download the files.
For example, the following curl requests to Controller will notify it to download segment files to the proper table:
Standalone Job:
Hadoop Job:
This guide shows you how to import data from GCP (Google Cloud Platform).
You can enable the Google Cloud Storage using the plugin pinot-gcs
. In the controller or server, add the config -
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include
, you need to put all the plugins you want to use, e.g. pinot-json
, pinot-avro
, pinot-kafka-2.0...
GCP filesystems provides the following options -
projectId
- The name of the Google Cloud Platform project under which you have created your storage bucket.
gcpKey
- Location of the json file containing GCP keys. You can refer Creating and managing service account keys to download the keys.
Each of these properties should be prefixed by pinot.[node].storage.factory.class.gs.
where node
is either controller
or server
depending on the config
e.g.
This section contains a collection of short guides to show you how to import from a Pinot supported file system.
FileSystem is an abstraction provided by Pinot to access data in distributed file systems (DFS).
Pinot uses distributed file systems for the following purposes:
Batch Ingestion Job - To read the input data (CSV, Avro, Thrift, etc.) and to write generated segments to DFS
Controller - When a segment is uploaded to the controller, the controller saves it in the DFS configured.
Server - When a server(s) is notified of a new segment, the server copies the segment from remote DFS to their local node using the DFS abstraction.
Pinot lets you choose a distributed file system provider. The following file systems are supported by Pinot:
To use a distributed file system, you need to enable plugins. To do that, specify the plugin directory and include the required plugins -
Now, You can proceed to change the filesystem in the controller
and server
config as shown below:
scheme
refers to the prefix used in the URI of the filesystem. e.g. for the URI s3://bucket/path/to/file
, the scheme is s3
You can also change the filesystem during ingestion. In the ingestion job spec, specify the filesystem with the following config:
You can enable Amazon S3 Filesystem backend by including the plugin pinot-s3
.
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include
, you need to put all the plugins you want to use, e.g. pinot-json
, pinot-avro
, pinot-kafka-2.0...
You can also configure the S3 filesystem using the following options:
Configuration
Description
region
The AWS Data center region in which the bucket is located
accessKey
(Optional) AWS access key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
secretKey
(Optional) AWS secret key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
endpoint
(Optional) Override endpoint for s3 client.
disableAcl
If this is set tofalse
, bucket owner is granted full access to the objects created by pinot. Default value is true
.
serverSideEncryption
(Optional) The server-side encryption algorithm used when storing this object in Amazon S3 (Now supports aws:kms
), set to null to disable SSE.
ssekmsKeyId
(Optional, but required when serverSideEncryption=aws:kms
) Specifies the AWS KMS key ID to use for object encryption. All GET and PUT requests for an object protected by AWS KMS will fail if not made via SSL or using SigV4.
ssekmsEncryptionContext
(Optional) Specifies the AWS KMS Encryption Context to use for object encryption. The value of this header is a base64-encoded UTF-8 string holding JSON with the encryption context key-value pairs.
Each of these properties should be prefixed by pinot.[node].storage.factory.s3.
where node
is either controller
or server
depending on the config
e.g.
S3 Filesystem supports authentication using the DefaultCredentialsProviderChain. The credential provider looks for the credentials in the following order -
Environment Variables - AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
(RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY
and AWS_SECRET_KEY
(only recognized by Java SDK)
Java System Properties - aws.accessKeyId
and aws.secretKey
Web Identity Token credentials from the environment or container
Credential profiles file at the default location (~/.aws/credentials)
shared by all AWS SDKs and the AWS CLI
Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
environment variable is set and security manager has permission to access the variable,
Instance profile credentials delivered through the Amazon EC2 metadata service
You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups.