The ingestion job spec is used while generating, running, and pushing segments from the input files.
The Job spec can be in either YAML or JSON format (0.5.0 onwards). Property names remain the same in both formats.
To use the JSON format, add the propertyjob-spec-format=json
in the properties file while launching the ingestion job. The properties file can be passed as follows
Copy pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /path/to/job_spec.json -propertyFile /path/to/job.properties
The values for the template strings in the jobSpecFile can be passed in one of the following three ways mentioned in their order of precedence,
Values from the environment variable
Values from the propertyFile
The following configurations are supported by Pinot
Top-Level Spec
Example
Copy executionFrameworkSpec:
name: 'spark'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'
extraConfigs:
stagingDir: hdfs://examples/batch/airlineStats/staging
# Recommended to set jobType to SegmentCreationAndMetadataPush for production environments where Pinot Deep Store is configured
jobType: SegmentCreationAndTarPush
inputDirURI: 'examples/batch/airlineStats/rawdata'
includeFileNamePattern: 'glob:**/*.avro'
searchRecursively: true
outputDirURI: 'hdfs:///examples/batch/airlineStats/segments'
overwriteOutput: true
pinotFSSpecs:
- scheme: hdfs
className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
tableSpec:
tableName: 'airlineStats'
schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
tableConfigURI: 'http://localhost:9000/tables/airlineStats'
segmentNameGeneratorSpec:
type: normalizedDate
configs:
segment.name.prefix: 'airlineStats_batch'
exclude.sequence.id: true
pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'
pushJobSpec:
pushParallelism: 2
pushAttempts: 2
pushRetryIntervalMillis: 1000
Execution Framework Spec
The configs specify the execution framework to use to ingest data. Check out Batch Ingestion for configs related to all the supported frameworks
Example
Copy executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'
Pinot FS Spec
Table Spec
Table spec is used to specify the table in which data should be populated along with schema.
Example
Copy tableSpec:
tableName: 'airlineStats'
schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
tableConfigURI: 'http://localhost:9000/tables/airlineStats'
Record Reader Spec
Segment Name Generator Spec
Example
Copy segmentNameGeneratorSpec:
type: normalizedDate
configs:
segment.name.prefix: 'airlineStats_batch'
exclude.sequence.id: true
To set the segment name to be the same as the input file name (without the trailing .gz
), use:
Copy segmentNameGeneratorSpec:
type: inputFile
configs:
file.path.pattern: '.+/(.+)\.gz'
segment.name.template: '\${filePathPattern:\1}'
Note that $
in the yaml file must be escaped, since Pinot uses Groovy's SimpleTemplateEngine to process the yaml file, and a raw $
is treated as a template specifier.
Pinot Cluster Spec
Example
Copy pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'
Push Job Spec
Example
Copy pushJobSpec:
pushParallelism: 2
pushAttempts: 2
pushRetryIntervalMillis: 1000
segmentUriPrefix : 'file://'
segmentUriSuffix : my-dir/
pushFileNamePattern : glob:\*\*2023-01\*