Amazon S3
This guide shows you how to import data from files stored in Amazon S3.
Enable the Amazon S3 file system backend by including the pinot-s3 plugin. In the controller or server configuration, add the config:
-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-s3By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...
You can configure the S3 file system using the following options:
region
The AWS Data center region in which the bucket is located
accessKey
(Optional) AWS access key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
secretKey
(Optional) AWS secret key required for authentication. This should only be used for testing purposes as we don't store these keys in secret.
endpoint
(Optional) Override endpoint for s3 client.
disableAcl
If this is set tofalse, bucket owner is granted full access to the objects created by pinot. Default value is true.
serverSideEncryption
(Optional) The server-side encryption algorithm used when storing this object in Amazon S3 (Now supports aws:kms), set to null to disable SSE.
ssekmsKeyId
(Optional, but required when serverSideEncryption=aws:kms) Specifies the AWS KMS key ID to use for object encryption. All GET and PUT requests for an object protected by AWS KMS will fail if not made via SSL or using SigV4.
ssekmsEncryptionContext
(Optional) Specifies the AWS KMS Encryption Context to use for object encryption. The value of this header is a base64-encoded UTF-8 string holding JSON with the encryption context key-value pairs.
requestChecksumCalculation
(Optional) Controls whether checksums are calculated for request payloads. Default: WHEN_SUPPORTED. Options: WHEN_SUPPORTED, WHEN_REQUIRED.
responseChecksumValidation
(Optional) Controls whether checksums are validated on response payloads. Default: WHEN_SUPPORTED. Options: WHEN_SUPPORTED, WHEN_REQUIRED.
useLegacyMd5Plugin
(Optional) When set to true, uses the LegacyMd5Plugin to restore pre-2.30.0 MD5 checksum behavior. Default: false.
Each of these properties should be prefixed by pinot.[node].storage.factory.s3. where node is either controller or server depending on the config
e.g.
pinot.controller.storage.factory.s3.region=ap-southeast-1S3 Filesystem supports authentication using the DefaultCredentialsProviderChain. The credential provider looks for the credentials in the following order -
Environment Variables -
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY(RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), orAWS_ACCESS_KEYandAWS_SECRET_KEY(only recognized by Java SDK)Java System Properties -
aws.accessKeyIdandaws.secretKeyWeb Identity Token credentials from the environment or container
Credential profiles file at the default location
(~/.aws/credentials)shared by all AWS SDKs and the AWS CLICredentials delivered through the Amazon EC2 container service if
AWS_CONTAINER_CREDENTIALS_RELATIVE_URIenvironment variable is set and security manager has permission to access the variable,Instance profile credentials delivered through the Amazon EC2 metadata service
You can also specify the accessKey and secretKey using the properties. However, this method is not secure and should be used only for POC setups.
Checksum validation
Checksum configuration is available starting in Pinot 1.4.
Starting with AWS SDK 2.30.0, the S3 client enables request and response checksum validation by default. Pinot exposes configuration properties to control this behavior.
Request and response checksums
By default, Pinot sets both requestChecksumCalculation and responseChecksumValidation to WHEN_SUPPORTED, which means the S3 client calculates checksums on uploads and validates them on downloads whenever the API supports it. This provides data integrity verification for segment files stored in your deep store.
If you want to disable automatic checksums and only use them when the S3 API strictly requires it, set both properties to WHEN_REQUIRED:
WHEN_SUPPORTED
Calculate/validate checksums whenever the API supports it (default)
WHEN_REQUIRED
Only calculate/validate checksums when the API requires it
LegacyMd5Plugin for S3-compatible stores
Some S3-compatible object stores (e.g. MinIO, Ceph, or older AWS configurations) require the legacy Content-MD5 header on requests. After the AWS SDK 2.30.0 upgrade, these stores may return errors like:
To restore the pre-2.30.0 MD5 checksum behavior, enable the useLegacyMd5Plugin option:
This adds the LegacyMd5Plugin to the S3 client, which sends the Content-MD5 header that these stores expect.
Only enable useLegacyMd5Plugin if your S3-compatible store requires the legacy MD5 header. For standard AWS S3, the default checksum behavior is recommended.
Examples
Job spec
Controller config
Server config
Minion config
Last updated
Was this helpful?

