Ingestion Job Spec

Ingestion job spec is used while generating, running and pushing segments from the input files.

The Job spec can be in either YAML or JSON format (0.5.0 onwards) . Property names remain same in both the formats.

To use the JSON format, add job-spec-format=json property in the properties file while launching ingestions job. The properties file can be passed as follows

pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /path/to/job_spec.json -propertyFile /path/to/job.properties

The following configurations are supported by Pinot

Property

Description

Example

executionFrameworkSpec

Contains config related to the executor to use to ingest data. See Execution Config

jobType

Type of job to execute. The following types are supported

  • SegmentCreation -

  • SegmentTarPush

  • SegmentUriPush

  • SegmentCreationAndTarPush

  • SegmentCreationAndUriPush

jobType: SegmentCreationAndTarPush

inputDirURI

Absolute Path along with scheme of the directory containing all the files to be ingested,

inputDirURI: s3://bucket/path/to/input

inputDirURI: /path/to/local/input

includeFileNamePattern

Files to include from the input dir. Supports both glob and regex

includeFileNamePattern: glob:**/*.csv

includeFileNamePattern: regex:^.*\.(csv)$

excludeFileNamePattern

Files to exclude from the input dir. Supports both glob and regex

excludeFileNamePattern: glob:**/*.csv

excludeFileNamePattern: regex:^.*\.(csv)$

outputDirURI

Absolute Path along with scheme of the directory where to output all the segments.

ouputDirURI: s3://bucket/path/to/output

ouputDirURI: /path/to/local/ouptut

overwriteOutput

Set to true to overwrite segments if already present in output directory, false otherwise

overwriteOutput: true

pinotFSSpecs

List of all the filesystems to be used for ingestions. You can mention multiple values in case input and output directories are present in different filesystems.

See File systems for all the supported filesystems along with their configs

tableSpec

Configs related to the table in which to populate data.

See Table Spec

recordReaderSpec

Parser to use to read and decode input data.

See Input formats for all the supported formats along with configurations

segmentNameGeneratorSpec

Configs to use while naming segment files.

See Segment Name Generator Spec for details.

pinotClusterSpecs

Cluster to which the job request are to be sent.

See Pinot Cluster Spec

pushJobSpec

Configuration for the job that will push the segments to desired file systems.

See Push Job Spec

Example

executionFrameworkSpec:
name: 'spark'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
extraConfigs:
stagingDir: hdfs://examples/batch/airlineStats/staging
jobType: SegmentCreationAndTarPush
inputDirURI: 'examples/batch/airlineStats/rawdata'
includeFileNamePattern: 'glob:**/*.avro'
outputDirURI: 'hdfs:///examples/batch/airlineStats/segments'
overwriteOutput: true
pinotFSSpecs:
- scheme: hdfs
className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
tableSpec:
tableName: 'airlineStats'
schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
tableConfigURI: 'http://localhost:9000/tables/airlineStats'
segmentNameGeneratorSpec:
type: normalizedDate
configs:
segment.name.prefix: 'airlineStats_batch'
exclude.sequence.id: true
pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'
pushJobSpec:
pushParallelism: 2
pushAttempts: 2
pushRetryIntervalMillis: 1000

Execution Config

The configs specifies the execution framework to use to ingest data. Check out Batch Ingestion for configs related to all the supported frameworks

Property

Description

name

name of the execution framework. can be one of spark,hadoop or standalone

segmentGenerationJobRunnerClassName

The name of class of the framework to run the job

segmentTarPushJobRunnerClassName

The name of class of the framework to push the TAR file

segmentUriPushJobRunnerClassName

The name of class of the framework to run the job

extraConfigs

Key-value pairs of configs related to the executions framework

Example

executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'

Table Spec

Table spec is used to specify the table in which data should be populated along with schema.

Property

Description

tableName

name of the table in which to populate the data

schemaURI

location from which to read the schema for the table. Supports both File systems as well as HTTP URI

tableConfigURI

location from which to read the config for the table. Supports both File systems as well as HTTP URI

Example

tableSpec:
tableName: 'airlineStats'
schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
tableConfigURI: 'http://localhost:9000/tables/airlineStats'

Segment Name Generator Spec

Property

Description

type

the type of name generator to use. Following values are supported -

  • simple - this is the default spec.

  • normalizedDate - use this type when the time column in your data is in the String format instead of epoch time

configs

Key-value pairs related to the name generator. Following configs are supported -

  • segment.name.prefix- prefix to use for segment name

  • segment.name.postfix - suffix to use for segment name

  • exclude.sequence.id - set to true to include the sequence id in segment name. false otherwise

Example

segmentNameGeneratorSpec:
type: normalizedDate
configs:
segment.name.prefix: 'airlineStats_batch'
exclude.sequence.id: true

Pinot Cluster Spec

Property

Description

controllerURI

URI to use to fetch table/schema information and push data

Example

pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'

Push Job Spec

Property

Description

pushAttempts

Number of attempts for push job. Default is 1, which means no retry

pushParallelism

Workers to use for push job. Default is 1

pushRetryIntervalMillis

Time in milliseconds to wait for between retry attempts Default is 1 second.

segmentUriPrefix

append this string before the path of the push destination. Generally, it is the scheme of the filesystem e.g. s3:// , file:// etc.

segmentUriSuffix

append this string after the path of the push destination.

Example

pushJobSpec:
pushParallelism: 2
pushAttempts: 2
pushRetryIntervalMillis: 1000
segmentUriPrefix : 'file://'
segmentUriSuffix : my-dir/