Ingestion Job Spec
Last updated
Last updated
The ingestion job spec is used while generating, running, and pushing segments from the input files.
The Job spec can be in either YAML or JSON format (0.5.0 onwards). Property names remain the same in both formats.
To use the JSON format, add the propertyjob-spec-format=json
in the properties file while launching the ingestion job. The properties file can be passed as follows
The following configurations are supported by Pinot
The configs specify the execution framework to use to ingest data. Check out Batch Ingestion for configs related to all the supported frameworks
Table spec is used to specify the table in which data should be populated along with schema.
Property
Description
executionFrameworkSpec
Contains config related to the executor to use to ingest data. See Execution Framework Spec
jobType
Type of job to execute. The following types are supported
SegmentCreation -
SegmentTarPush
SegmentUriPush
SegmentMetadataPush
SegmentCreationAndTarPush
SegmentCreationAndUriPush
SegmentCreationAndMetadataPush
Note:
For production environments where Pinot Deep Store is configured, it's recommended to use
SegmentCreationAndMetadataPush
inputDirURI
Absolute Path along with scheme of the directory containing all the files to be ingested, e.g. s3://bucket/path/to/input
, /path/to/local/input
includeFileNamePattern
Include file name pattern, supported glob and regex patterns. E.g.
'glob:*.avro'
or 'regex:^.*\.(avro)$'
will include all avro files just under the inputDirURI, not sub directories
'glob:**/*.avro'
will include all the avro files under inputDirURI recursively.
excludeFileNamePattern
Exclude file name pattern, supported glob pattern. Similar usage as includeFilePatternName
outputDirURI
Absolute Path along with scheme of the directory where to output all the segments.
overwriteOutput
Set to true
to overwrite segments if already present in the output directory. Or set tofalse
to raise exceptions.
pinotFSSpecs
List of all the filesystems to be used for ingestions. You can mention multiple values in case input and output directories are present in different filesystems. For more details, scroll down to Pinot FS Spec.
tableSpec
Defines table name and where to fetch corresponding table config and table schema. For more details, scroll down to Table Spec.
recordReaderSpec
Parser to use to read and decode input data. For more details, scroll down to Record Reader Spec.
segmentNameGeneratorSpec
Defines how the names of the segments will be. For more details, scroll down to Segment Name Generator Spec.
pinotClusterSpecs
Defines the Pinot Cluster Access Point. For more details, scroll down to Pinot Cluster Spec.
pushJobSpec
Defines segment push job-related configuration. For more details, scroll down to Push Job Spec.
Property
Description
name
name of the execution framework. can be one of spark,hadoop or standalone
segmentGenerationJobRunnerClassName
The class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface to run the segment generation job
segmentTarPushJobRunnerClassName
The class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface to push the segment TAR file
segmentUriPushJobRunnerClassName
The class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface to send segment URI
segmentMetadataPushJobRunnerClassName
The class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface to send segment Metadata
extraConfigs
Key-value pairs of configs related to the framework of the executions
field
description
schema
used to identify a PinotFS. E.g. local, hdfs, dbfs, etc
className
Class name used to create the PinotFS instance. E.g.
org.apache.pinot.spi.filesystem.LocalPinotFS
is used for local filesystem
org.apache.pinot.plugin.filesystem.HadoopPinotFS
is used for HDFS
configs
configs used to init PinotFS instance
Property
Description
tableName
name of the table in which to populate the data
schemaURI
location from which to read the schema for the table. Supports both File systems as well as HTTP
URI
tableConfigURI
location from which to read the config for the table. Supports both File systems as well as HTTP
URI
field
description
dataFormat
Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc.
className
Corresponding RecordReader class name. E.g.
org.apache.pinot.plugin.inputformat.avro.AvroRecordReader
org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader
org.apache.pinot.plugin.inputformat.json.JSONRecordReader
org.apache.pinot.plugin.inputformat.orc.ORCRecordReader
org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader
configClassName
Corresponding RecordReaderConfig class name, it's mandatory for CSV and Thrift file format. E.g.
org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig
org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReaderConfig
configs
Used to init RecordReaderConfig class name, this config is required for CSV and Thrift data format.
Property
Description
type
the type of name generator to use. Following values are supported -
simple
- this is the default spec.
normalizedDate
- use this type when the time column in your data is in the String format instead of epoch time.
fixed
- configure the segment name by the user.
inputFile
- derive segment name from input file path
configs
Configs to init SegmentNameGenerator
segment.name
For fixed
SegmentNameGenerator. Explicitly set the segment name.
segment.name.postfix
For simple
SegmentNameGenerator.
Postfix will be appended to all the segment names.
segment.name.prefix
For normalizedDate
SegmentNameGenerator.
The Prefix will be prepended to all the segment names.
exclude.sequence.id
Whether to include sequence ids in segment name.
Needed when there are multiple segments for the same time range.
file.path.pattern
For inputFile
SegmentNameGenerator.
A regex to extract parts of the input file path to generate segment names, e.g. .+/(.+)
\.csv
segment.name.template
For inputFile
SegmentNameGenerator.
A template to put the parts extracted from input file path into segment name, e.g. ${filePathPattern: \1}
use.global.directory.sequence.id
Assign sequence ids to input files based on all input files under the directory.
Set to false
to use local directory sequence id. This is useful when generating multiple segments for multiple days. In that case, each of the days will start from sequence id 0.
Property
Description
controllerURI
URI to use to fetch table/schema information and push data
Property
Description
pushAttempts
Number of attempts for push job. Default is 1, which means no retry
pushParallelism
Workers to use for push job. Default is 1
pushRetryIntervalMillis
Time in milliseconds to wait for between retry attempts Default is 1 second.
segmentUriPrefix
append this string before the path of the push destination. Generally, it is the scheme of the filesystem e.g. s3://
, file://
etc.
segmentUriSuffix
append this string after the path of the push destination.