1 of 1

Ingestion Job Spec

The ingestion job spec is used while generating, running, and pushing segments from the input files.

The Job spec can be in either YAML or JSON format (0.5.0 onwards). Property names remain the same in both formats.

To use the JSON format, add the propertyjob-spec-format=jsonin the properties file while launching the ingestion job. The properties file can be passed as follows

Template your job spec file

Users are allowed to define some variables in the job spec file to make it a template then passing the variables at runtime.

Templating is based on .

E.g. users can specify below in the job spec file:

The values for the template strings in the jobSpecFile can be passed in one of the following three ways mentioned in their order of precedence, for same key, 1 will override 2 will override 3.

Values from the -values array passed from the Cmd Line. See
Values from the environment variables
Values from the propertyFile

Still take above inputDirURI as example,

We can define a job.config file with below content:

Above properties can be override by environment variables

From the command line, user can further override those keys using flag -values,

After that the real ingestion spec passed to ingestion job will have inputDirURI as 'file:///path/to/input/2020/06/03/04'

Ingestion Job Spec

The following configurations are supported by Pinot

Top-Level Spec

Property

Description

Example

Execution Framework Spec

The configs specify the execution framework to use to ingest data. Check out for configs related to all the supported frameworks

Property

Description

Example

Pinot FS Spec

field

description

Table Spec

Table spec is used to specify the table in which data should be populated along with schema.

Property

Description

Example

Record Reader Spec

field

description

Segment Name Generator Spec

Property

Description

Example

To set the segment name to be the same as the input file name (without the trailing .gz), use:

Note that $ in the yaml file must be escaped, since Pinot uses Groovy's SimpleTemplateEngine to process the yaml file, and a raw $ is treated as a template specifier.

Pinot Cluster Spec

Property

Description

Example

Push Job Spec

Property

Description

Example

Ingestion Job Spec

The ingestion job spec is used while generating, running, and pushing segments from the input files.

The Job spec can be in either YAML or JSON format (0.5.0 onwards). Property names remain the same in both formats.

To use the JSON format, add the propertyjob-spec-format=jsonin the properties file while launching the ingestion job. The properties file can be passed as follows

pinot-admin.sh LaunchDataIngestionJob \
-jobSpecFile /path/to/job_spec.json \
-propertyFile /path/to/job.properties

Template your job spec file

Users are allowed to define some variables in the job spec file to make it a template then passing the variables at runtime.

Templating is based on .

E.g. users can specify below in the job spec file:

The values for the template strings in the jobSpecFile can be passed in one of the following three ways mentioned in their order of precedence, for same key, 1 will override 2 will override 3.

Values from the -values array passed from the Cmd Line. See
Values from the environment variables
Values from the propertyFile

Still take above inputDirURI as example,

We can define a job.config file with below content:

Above properties can be override by environment variables

From the command line, user can further override those keys using flag -values,

After that the real ingestion spec passed to ingestion job will have inputDirURI as 'file:///path/to/input/2020/06/03/04'

Ingestion Job Spec

The following configurations are supported by Pinot

Top-Level Spec

Property

Description

Example

Execution Framework Spec

The configs specify the execution framework to use to ingest data. Check out for configs related to all the supported frameworks

Property

Description

Example

Pinot FS Spec

field

description

Table Spec

Table spec is used to specify the table in which data should be populated along with schema.

Property

Description

Example

Record Reader Spec

field

description

Segment Name Generator Spec

Property

Description

Example

To set the segment name to be the same as the input file name (without the trailing .gz), use:

Note that $ in the yaml file must be escaped, since Pinot uses Groovy's SimpleTemplateEngine to process the yaml file, and a raw $ is treated as a template specifier.

Pinot Cluster Spec

Property

Description

Example

Push Job Spec

Property

Description

Example

Type of job to execute. The following types are supported

SegmentCreation
SegmentMetadataPush
SegmentTarPush
SegmentUriPush
SegmentCreationAndMetadataPush: (Recommended for production environments where Pinot deep store is configured): Use this job to bypass the controller, and send the segment payload directly to the data store.
SegmentCreationAndUriPush: (Alternative option if Pinot deep store is configured) Use this job to create the segment on the deep store, push the URI to the controller to download the segment, extract metadata from the URI, and copy the data to deep store.
SegmentCreationAndTarPush: If you use Network File System (NFS) or something that sits behind a controller, and are unable to externally copy segments to the data store, use this job to push the segment payload.

Note: For production environments where Pinot Deep Store is configured, it's recommended to use SegmentCreationAndMetadataPush

Only Files matching this pattern will be included from inputDirURI. Both glob and regex patterns are supported.

Examples:

Use 'glob:.avro'or 'regex:^..(avro)$' to include all avro files one level deep in the inputDirURI.

Alternatively, use 'glob:*/.avro' to include all the avro files in inputDirURI as well as its subdirectories - bear in mind that, with this approach, the pattern needs to match the absolute path. You can use or to test out your patterns.

executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'
  extraConfigs:
    stagingDir: hdfs://examples/batch/airlineStats/staging

# Recommended to set jobType to SegmentCreationAndMetadataPush for production environments where Pinot Deep Store is configured
jobType: SegmentCreationAndTarPush

inputDirURI: 'examples/batch/airlineStats/rawdata'
includeFileNamePattern: 'glob:**/*.avro'
searchRecursively: true
outputDirURI: 'hdfs:///examples/batch/airlineStats/segments'
overwriteOutput: true
pinotFSSpecs:
  - scheme: hdfs
    className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
tableSpec:
  tableName: 'airlineStats'
  schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
  tableConfigURI: 'http://localhost:9000/tables/airlineStats'
segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    segment.name.prefix: 'airlineStats_batch'
    exclude.sequence.id: true
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000

The type of name generator to use. If not specified, an appropriate type will be inferred based on segment generator config properties. The following values are supported -

simple - this is the default spec.
normalizedDate - use this type when the time column in your data is in the String format instead of epoch time.
fixed - configure the segment name by the user.
inputFile - supports naming the resulting segment file based on the input file name & path. Use this if your table doesn't have a time column. Ensure that input file names are unique though otherwise will lead to several issues.

executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'
  extraConfigs:
    stagingDir: hdfs://examples/batch/airlineStats/staging

# Recommended to set jobType to SegmentCreationAndMetadataPush for production environments where Pinot Deep Store is configured
jobType: SegmentCreationAndTarPush

inputDirURI: 'examples/batch/airlineStats/rawdata'
includeFileNamePattern: 'glob:**/*.avro'
searchRecursively: true
outputDirURI: 'hdfs:///examples/batch/airlineStats/segments'
overwriteOutput: true
pinotFSSpecs:
  - scheme: hdfs
    className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
tableSpec:
  tableName: 'airlineStats'
  schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
  tableConfigURI: 'http://localhost:9000/tables/airlineStats'
segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    segment.name.prefix: 'airlineStats_batch'
    exclude.sequence.id: true
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000

Ingestion Job Spec

hashtagTemplate your job spec file

hashtagIngestion Job Spec

hashtagTop-Level Spec

hashtagExample

hashtagExecution Framework Spec

hashtagExample

hashtagPinot FS Spec

hashtagTable Spec

hashtagExample

hashtagRecord Reader Spec

hashtagSegment Name Generator Spec

hashtagExample

hashtagPinot Cluster Spec

hashtagExample

hashtagPush Job Spec

hashtagExample

Ingestion Job Spec

hashtagTemplate your job spec file

hashtagIngestion Job Spec

hashtagTop-Level Spec

hashtagExample

hashtagExecution Framework Spec

hashtagExample

hashtagPinot FS Spec

hashtagTable Spec

hashtagExample

hashtagRecord Reader Spec

hashtagSegment Name Generator Spec

hashtagExample

hashtagPinot Cluster Spec

hashtagExample

hashtagPush Job Spec

hashtagExample

Template your job spec file

Ingestion Job Spec

Top-Level Spec

Example

Execution Framework Spec

Example

Pinot FS Spec

Table Spec

Example

Record Reader Spec

Segment Name Generator Spec

Example

Pinot Cluster Spec

Example

Push Job Spec

Example

Template your job spec file

Ingestion Job Spec

Top-Level Spec

Example

Execution Framework Spec

Example

Pinot FS Spec

Table Spec

Example

Record Reader Spec

Segment Name Generator Spec

Example

Pinot Cluster Spec

Example

Push Job Spec

Example