arrow-left

All pages
gitbookPowered by GitBook
1 of 5

Loading...

Loading...

Loading...

Loading...

Loading...

Amazon S3

This guide shows you how to import data from files stored in Amazon S3.

Enable the Amazon S3arrow-up-right file system backend by including the pinot-s3 plugin. In the controller or server configuration, add the config:

circle-info

By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...

You can configure the S3 file system using the following options:

Configuration
Description

Each of these properties should be prefixed by pinot.[node].storage.factory.s3. where node is either controller or server depending on the config

e.g.

S3 Filesystem supports authentication using the . The credential provider looks for the credentials in the following order -

  • Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)

  • Java System Properties - aws.accessKeyId and aws.secretKey

You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups.

hashtag
Examples

hashtag
Job spec

hashtag
Controller config

hashtag
Server config

hashtag
Minion config

Google Cloud Storage

This guide shows you how to import data from GCP (Google Cloud Platform).

Enable the using the pinot-gcs plugin. In the controller or server, add the config:

circle-info

By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...

-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-s3

serverSideEncryption

(Optional) The server-side encryption algorithm used when storing this object in Amazon S3 (Now supports aws:kms), set to null to disable SSE.

ssekmsKeyId

(Optional, but required when serverSideEncryption=aws:kms) Specifies the AWS KMS key ID to use for object encryption. All GET and PUT requests for an object protected by AWS KMS will fail if not made via SSL or using SigV4.

ssekmsEncryptionContext

(Optional) Specifies the AWS KMS Encryption Context to use for object encryption. The value of this header is a base64-encoded UTF-8 string holding JSON with the encryption context key-value pairs.

  • Web Identity Token credentials from the environment or container

  • Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI

  • Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable is set and security manager has permission to access the variable,

  • Instance profile credentials delivered through the Amazon EC2 metadata service

  • region

    The AWS Data center region in which the bucket is located

    accessKey

    (Optional) AWS access key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.

    secretKey

    (Optional) AWS secret key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.

    endpoint

    (Optional) Override endpoint for s3 client.

    disableAcl

    If this is set tofalse, bucket owner is granted full access to the objects created by pinot. Default value is true.

    DefaultCredentialsProviderChainarrow-up-right

    GCP file systems provides the following options:

    • projectId - The name of the Google Cloud Platform project under which you have created your storage bucket.

    • gcpKey - Location of the json file containing GCP keys. You can refer Creating and managing service account keysarrow-up-right to download the keys.

    Each of these properties should be prefixed by pinot.[node].storage.factory.class.gs. where node is either controller or server depending on the configuration, like this:

    hashtag
    Examples

    hashtag
    Job spec

    hashtag
    Controller config

    hashtag
    Server config

    hashtag
    Minion config

    Google Cloud Storagearrow-up-right
    pinot.controller.storage.factory.s3.region=ap-southeast-1
    executionFrameworkSpec:
        name: 'standalone'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 's3://pinot-bucket/pinot-ingestion/batch-input/'
    outputDirURI: 's3://pinot-bucket/pinot-ingestion/batch-output/'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: s3
          className: org.apache.pinot.plugin.filesystem.S3PinotFS
          configs:
            region: 'ap-southeast-1'
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    controller.data.dir=s3://path/to/data/directory/
    controller.local.temp.dir=/path/to/local/temp/directory
    controller.enable.split.commit=true
    pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
    pinot.controller.storage.factory.s3.region=ap-southeast-1
    pinot.controller.segment.fetcher.protocols=file,http,s3
    pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.server.instance.enable.split.commit=true
    pinot.server.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
    pinot.server.storage.factory.s3.region=ap-southeast-1
    pinot.server.storage.factory.s3.httpclient.maxConnections=50
    pinot.server.storage.factory.s3.httpclient.socketTimeout=30s
    pinot.server.storage.factory.s3.httpclient.connectionTimeout=2s
    pinot.server.storage.factory.s3.httpclient.connectionTimeToLive=0s
    pinot.server.storage.factory.s3.httpclient.connectionAcquisitionTimeout=10s
    pinot.server.segment.fetcher.protocols=file,http,s3
    pinot.server.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.minion.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
    pinot.minion.storage.factory.s3.region=ap-southeast-1
    pinot.minion.segment.fetcher.protocols=file,http,s3
    pinot.minion.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-gcs
    pinot.controller.storage.factory.class.gs.projectId=test-project
    executionFrameworkSpec:
        name: 'standalone'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 'gs://my-bucket/path/to/input/directory/'
    outputDirURI: 'gs://my-bucket/path/to/output/directory/'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: gs
          className: org.apache.pinot.plugin.filesystem.GcsPinotFS
          configs:
            projectId: 'my-project'
            gcpKey: 'path-to-gcp json key file'
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    controller.data.dir=gs://path/to/data/directory/
    controller.local.temp.dir=/path/to/local/temp/directory
    controller.enable.split.commit=true
    pinot.controller.storage.factory.class.gs=org.apache.pinot.plugin.filesystem.GcsPinotFS
    pinot.controller.storage.factory.gs.projectId=my-project
    pinot.controller.storage.factory.gs.gcpKey=path/to/gcp/key.json
    pinot.controller.segment.fetcher.protocols=file,http,gs
    pinot.controller.segment.fetcher.gs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.server.instance.enable.split.commit=true
    pinot.server.storage.factory.class.gs=org.apache.pinot.plugin.filesystem.GcsPinotFS
    pinot.server.storage.factory.gs.projectId=my-project
    pinot.server.storage.factory.gs.gcpKey=path/to/gcp/key.json
    pinot.server.segment.fetcher.protocols=file,http,gs
    pinot.server.segment.fetcher.gs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.minion.storage.factory.class.gs=org.apache.pinot.plugin.filesystem.GcsPinotFS
    pinot.minion.storage.factory.gs.projectId=my-project
    pinot.minion.storage.factory.gs.gcpKey=path/to/gcp/key.json
    pinot.minion.segment.fetcher.protocols=file,http,gs
    pinot.minion.segment.fetcher.gs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

    HDFS

    This guide shows you how to import data from HDFS.

    Enable the Hadoop distributed file system (HDFS)arrow-up-right using the pinot-hdfs plugin. In the controller or server, add the config:

    circle-info

    By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...

    HDFS implementation provides the following options:

    • hadoop.conf.path: Absolute path of the directory containing Hadoop XML configuration files, such as hdfs-site.xml, core-site.xml .

    • hadoop.write.checksum: Create checksum while pushing an object. Default is false

    Each of these properties should be prefixed by pinot.[node].storage.factory.class.hdfs. where node is either controller or server depending on the config

    The kerberos configs should be used only if your Hadoop installation is secured with Kerberos. Refer to the for information on how to secure Hadoop using Kerberos.

    You must provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.

    hashtag
    Push HDFS segment to Pinot Controller

    To push HDFS segment files to Pinot controller, send the HDFS path of your newly created segment files to the Pinot Controller. The controller will download the files.

    This curl example requests tells the controller to download segment files to the proper table:

    hashtag
    Examples

    hashtag
    Job spec

    Standalone Job:

    Hadoop Job:

    hashtag
    Controller config

    hashtag
    Server config

    hashtag
    Minion config

    File Systems

    This section contains a collection of short guides to show you how to import data from a Pinot-supported file system.

    FileSystem is an abstraction provided by Pinot to access data stored in distributed file systems (DFS).

    Pinot uses distributed file systems for the following purposes:

    • Batch ingestion job: To read the input data (CSV, Avro, Thrift, etc.) and to write generated segments to DFS.

    • Controller: When a segment is uploaded to the controller, the controller saves it in the configured DFS.

    • Server:- When a server(s) is notified of a new segment, the server copies the segment from remote DFS to their local node using the DFS abstraction.

    hashtag
    Supported file systems

    Pinot lets you choose a distributed file system provider. The following file systems are supported by Pinot:

    hashtag
    Enabling a file system

    To use a distributed file system, you need to enable plugins. To do that, specify the plugin directory and include the required plugins:

    You can change the file system in the controller and server configuration. In the following configuration example, the URI is s3://bucket/path/to/file and scheme refers to the file system URI prefix s3.

    You can also change the file system during ingestion. In the ingestion job spec, specify the file system with the following configuration:

    Azure Data Lake Storage

    This guide shows you how to import data from files stored in Azure Data Lake Storage Gen2 (ADLS Gen2)

    Enable the Azure Data Lake Storage using the pinot-adls plugin. In the controller or server, add the config:

    circle-info

    By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...

    Azure Blob Storage provides the following options:

    • accountName: Name of the Azure account under which the storage is created.

    • accessKey: Access key required for the authentication.

    • fileSystemName

    Each of these properties should be prefixed by pinot.[node].storage.factory.class.adl2. where node is either controller or server depending on the config, like this:

    hashtag
    Examples

    hashtag
    Job spec

    hashtag
    Controller config

    hashtag
    Server config

    hashtag
    Minion config

    -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-hdfs
    -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-adls
    hadoop.kerberos.principle
  • hadoop.kerberos.keytab

  • Hadoop in secure mode documentationarrow-up-right
    : Name of the file system to use, for example, the container name (similar to the bucket name in S3).
  • enableChecksum: Enable MD5 checksum for verification. Default is false.

  • export HADOOP_HOME=/local/hadoop/
    export HADOOP_VERSION=2.7.1
    export HADOOP_GUAVA_VERSION=11.0.2
    export HADOOP_GSON_VERSION=2.2.4
    export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
    curl -X POST -H "UPLOAD_TYPE:URI" -H "DOWNLOAD_URI:hdfs://nameservice1/hadoop/path/to/segment/file.
    executionFrameworkSpec:
        name: 'standalone'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 'hdfs:///path/to/input/directory/'
    outputDirURI: 'hdfs:///path/to/output/directory/'
    includeFileNamePath: 'glob:**/*.csv'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: hdfs
          className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
          configs:
            hadoop.conf.path: 'path/to/conf/directory/' 
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    executionFrameworkSpec:
        name: 'hadoop'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentUriPushJobRunner'
        extraConfigs:
          stagingDir: 'hdfs:///path/to/staging/directory/'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 'hdfs:///path/to/input/directory/'
    outputDirURI: 'hdfs:///path/to/output/directory/'
    includeFileNamePath: 'glob:**/*.csv'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: hdfs
          className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
          configs:
            hadoop.conf.path: '/etc/hadoop/conf/' 
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    controller.data.dir=hdfs://path/to/data/directory/
    controller.local.temp.dir=/path/to/local/temp/directory
    controller.enable.split.commit=true
    pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
    pinot.controller.storage.factory.hdfs.hadoop.conf.path=path/to/conf/directory/
    pinot.controller.segment.fetcher.protocols=file,http,hdfs
    pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
    pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
    pinot.server.instance.enable.split.commit=true
    pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
    pinot.server.storage.factory.hdfs.hadoop.conf.path=path/to/conf/directory/
    pinot.server.segment.fetcher.protocols=file,http,hdfs
    pinot.server.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
    pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
    storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
    storage.factory.hdfs.hadoop.conf.path=path/to/conf/directory
    segment.fetcher.protocols=file,http,hdfs
    segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
    segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
    pinot.controller.storage.factory.class.adl2.accountName=test-user
    executionFrameworkSpec:
        name: 'standalone'
        segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
        segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
        segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    jobType: SegmentCreationAndTarPush
    inputDirURI: 'adl2://path/to/input/directory/'
    outputDirURI: 'adl2://path/to/output/directory/'
    overwriteOutput: true
    pinotFSSpecs:
        - scheme: adl2
          className: org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
          configs:
            accountName: 'my-account'
            accessKey: 'foo-bar-1234'
            fileSystemName: 'fs-name'
    recordReaderSpec:
        dataFormat: 'csv'
        className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
        configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
    tableSpec:
        tableName: 'students'
    pinotClusterSpecs:
        - controllerURI: 'http://localhost:9000'
    controller.data.dir=adl2://path/to/data/directory/
    controller.local.temp.dir=/path/to/local/temp/directory
    controller.enable.split.commit=true
    pinot.controller.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
    pinot.controller.storage.factory.adl2.accountName=my-account
    pinot.controller.storage.factory.adl2.accessKey=foo-bar-1234
    pinot.controller.storage.factory.adl2.fileSystemName=fs-name
    pinot.controller.segment.fetcher.protocols=file,http,adl2
    pinot.controller.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinot.server.instance.enable.split.commit=true
    pinot.server.storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
    pinot.server.storage.factory.adl2.accountName=my-account
    pinot.server.storage.factory.adl2.accessKey=foo-bar-1234
    pinot.server.storage.factory.adl2.fileSystemName=fs-name
    pinot.server.segment.fetcher.protocols=file,http,adl2
    pinot.server.segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    storage.factory.class.adl2=org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
    storage.factory.adl2.accountName=my-account
    storage.factory.adl2.fileSystemName=fs-name
    storage.factory.adl2.accessKey=foo-bar-1234
    segment.fetcher.protocols=file,http,adl2
    segment.fetcher.adl2.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    Amazon S3
    Google Cloud Storage
    HDFS
    -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-plugin-to-include-1,pinot-plugin-to-include-2
    #CONTROLLER
    
    pinot.controller.storage.factory.class.[scheme]=className of the pinot file system
    pinot.controller.segment.fetcher.protocols=file,http,[scheme]
    pinot.controller.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    #SERVER
    
    pinot.server.storage.factory.class.[scheme]=className of the Pinot file system
    pinot.server.segment.fetcher.protocols=file,http,[scheme]
    pinot.server.segment.fetcher.[scheme].class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
    pinotFSSpecs
      - scheme: file
        className: org.apache.pinot.spi.filesystem.LocalPinotFS
    Azure Data Lake Storage