LogoLogo
release-0.10.0
release-0.10.0
  • Introduction
  • Basics
    • Concepts
    • Architecture
    • Components
      • Cluster
      • Controller
      • Broker
      • Server
      • Minion
      • Tenant
      • Schema
      • Table
      • Segment
      • Deep Store
      • Pinot Data Explorer
    • Getting Started
      • Running Pinot locally
      • Running Pinot in Docker
      • Quick Start Examples
      • Running in Kubernetes
      • Running on public clouds
        • Running on Azure
        • Running on GCP
        • Running on AWS
      • Batch import example
      • Stream ingestion example
      • HDFS as Deep Storage
      • Troubleshooting Pinot
      • Frequently Asked Questions (FAQs)
        • General
        • Pinot On Kubernetes FAQ
        • Ingestion FAQ
        • Query FAQ
        • Operations FAQ
    • Import Data
      • Batch Ingestion
        • Spark
        • Hadoop
        • Backfill Data
        • Dimension Table
      • Stream ingestion
        • Apache Kafka
        • Amazon Kinesis
        • Apache Pulsar
      • Stream Ingestion with Upsert
      • File Systems
        • Amazon S3
        • Azure Data Lake Storage
        • HDFS
        • Google Cloud Storage
      • Input formats
      • Complex Type (Array, Map) Handling
    • Indexing
      • Forward Index
      • Inverted Index
      • Star-Tree Index
      • Bloom Filter
      • Range Index
      • Text search support
      • JSON Index
      • Geospatial
      • Timestamp Index
    • Releases
      • 0.10.0
      • 0.9.3
      • 0.9.2
      • 0.9.1
      • 0.9.0
      • 0.8.0
      • 0.7.1
      • 0.6.0
      • 0.5.0
      • 0.4.0
      • 0.3.0
      • 0.2.0
      • 0.1.0
    • Recipes
      • GitHub Events Stream
  • For Users
    • Query
      • Querying Pinot
      • Filtering with IdSet
      • Transformation Functions
      • Aggregation Functions
      • User-Defined Functions (UDFs)
      • Cardinality Estimation
      • Lookup UDF Join
      • Querying JSON data
      • Explain Plan
      • Grouping Algorithm
    • APIs
      • Broker Query API
        • Query Response Format
      • Controller Admin API
    • External Clients
      • JDBC
      • Java
      • Python
      • Golang
    • Tutorials
      • Use OSS as Deep Storage for Pinot
      • Ingest Parquet Files from S3 Using Spark
      • Creating Pinot Segments
      • Use S3 as Deep Storage for Pinot
      • Use S3 and Pinot in Docker
      • Batch Data Ingestion In Practice
      • Schema Evolution
  • For Developers
    • Basics
      • Extending Pinot
        • Writing Custom Aggregation Function
        • Segment Fetchers
      • Contribution Guidelines
      • Code Setup
      • Code Modules and Organization
      • Update Documentation
    • Advanced
      • Data Ingestion Overview
      • Ingestion Transformations
      • Null Value Support
      • Advanced Pinot Setup
    • Plugins
      • Write Custom Plugins
        • Input Format Plugin
        • Filesystem Plugin
        • Batch Segment Fetcher Plugin
        • Stream Ingestion Plugin
    • Design Documents
      • Segment Writer API
  • For Operators
    • Deployment and Monitoring
      • Setup cluster
      • Setup table
      • Setup ingestion
      • Decoupling Controller from the Data Path
      • Segment Assignment
      • Instance Assignment
      • Rebalance
        • Rebalance Servers
        • Rebalance Brokers
      • Tiered Storage
      • Pinot managed Offline flows
      • Minion merge rollup task
      • Access Control
      • Monitoring
      • Tuning
        • Realtime
        • Routing
      • Upgrading Pinot with confidence
    • Command-Line Interface (CLI)
    • Configuration Recommendation Engine
    • Tutorials
      • Authentication, Authorization, and ACLs
      • Configuring TLS/SSL
      • Build Docker Images
      • Running Pinot in Production
      • Kubernetes Deployment
      • Amazon EKS (Kafka)
      • Amazon MSK (Kafka)
      • Monitor Pinot using Prometheus and Grafana
  • Configuration Reference
    • Cluster
    • Controller
    • Broker
    • Server
    • Table
    • Schema
    • Ingestion Job Spec
    • Functions
      • ABS
      • ADD
      • arrayConcatInt
      • arrayConcatString
      • arrayContainsInt
      • arrayContainsString
      • arrayDistinctString
      • arrayDistinctInt
      • arrayIndexOfInt
      • arrayIndexOfString
      • ARRAYLENGTH
      • arrayRemoveInt
      • arrayRemoveString
      • arrayReverseInt
      • arrayReverseString
      • arraySliceInt
      • arraySliceString
      • arraySortInt
      • arraySortString
      • arrayUnionInt
      • arrayUnionString
      • AVGMV
      • ceil
      • CHR
      • codepoint
      • concat
      • count
      • COUNTMV
      • day
      • dayOfWeek
      • dayOfYear
      • DISTINCT
      • DISTINCTCOUNT
      • DISTINCTCOUNTBITMAP
      • DISTINCTCOUNTBITMAPMV
      • DISTINCTCOUNTHLL
      • DISTINCTCOUNTHLLMV
      • DISTINCTCOUNTMV
      • DISTINCTCOUNTRAWHLL
      • DISTINCTCOUNTRAWHLLMV
      • DISTINCTCOUNTRAWTHETASKETCH
      • DISTINCTCOUNTTHETASKETCH
      • DIV
      • DATETIMECONVERT
      • DATETRUNC
      • exp
      • FLOOR
      • FromDateTime
      • FromEpoch
      • FromEpochBucket
      • hour
      • JSONFORMAT
      • JSONPATH
      • JSONPATHARRAY
      • JSONPATHARRAYDEFAULTEMPTY
      • JSONPATHDOUBLE
      • JSONPATHLONG
      • JSONPATHSTRING
      • jsonextractkey
      • jsonextractscalar
      • length
      • ln
      • lower
      • lpad
      • ltrim
      • max
      • MAXMV
      • MD5
      • millisecond
      • min
      • minmaxrange
      • MINMAXRANGEMV
      • MINMV
      • minute
      • MOD
      • mode
      • month
      • mult
      • now
      • percentile
      • percentileest
      • percentileestmv
      • percentilemv
      • percentiletdigest
      • percentiletdigestmv
      • quarter
      • regexpExtract
      • remove
      • replace
      • reverse
      • round
      • rpad
      • rtrim
      • second
      • SEGMENTPARTITIONEDDISTINCTCOUNT
      • sha
      • sha256
      • sha512
      • sqrt
      • startswith
      • ST_AsBinary
      • ST_AsText
      • ST_Contains
      • ST_Distance
      • ST_GeogFromText
      • ST_GeogFromWKB
      • ST_GeometryType
      • ST_GeomFromText
      • ST_GeomFromWKB
      • STPOINT
      • ST_Polygon
      • strpos
      • ST_Union
      • SUB
      • substr
      • sum
      • summv
      • TIMECONVERT
      • timezoneHour
      • timezoneMinute
      • ToDateTime
      • ToEpoch
      • ToEpochBucket
      • ToEpochRounded
      • TOJSONMAPSTR
      • toGeometry
      • toSphericalGeography
      • trim
      • upper
      • VALUEIN
      • week
      • year
      • yearOfWeek
  • RESOURCES
    • Community
    • Team
    • Blogs
    • Presentations
    • Videos
  • Integrations
    • Tableau
    • Trino
    • ThirdEye
    • Superset
    • Presto
Powered by GitBook
On this page
  • Top Level Spec
  • Execution Framework Spec
  • Pinot FS Spec
  • Table Spec
  • Record Reader Spec
  • Segment Name Generator Spec
  • Pinot Cluster Spec
  • Push Job Spec

Was this helpful?

Export as PDF
  1. Configuration Reference

Ingestion Job Spec

The ingestion job spec is used while generating, running, and pushing segments from the input files.

The Job spec can be in either YAML or JSON format (0.5.0 onwards). Property names remain the same in both formats.

To use the JSON format, add the propertyjob-spec-format=jsonin the properties file while launching the ingestion job. The properties file can be passed as follows

pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /path/to/job_spec.json -propertyFile /path/to/job.properties

The following configurations are supported by Pinot

Top Level Spec

Property

Description

executionFrameworkSpec

jobType

Type of job to execute. The following types are supported

  • SegmentCreation -

  • SegmentTarPush

  • SegmentUriPush

  • SegmentMetadataPush

  • SegmentCreationAndTarPush

  • SegmentCreationAndUriPush

  • SegmentCreationAndMetadataPush

Note: For production environments where Pinot Deep Store is configured, it's recommended to use SegmentCreationAndMetadataPush

inputDirURI

Absolute Path along with scheme of the directory containing all the files to be ingested, e.g. s3://bucket/path/to/input, /path/to/local/input

includeFileNamePattern

Only Files matching this pattern will be included from inputDirURI. Both glob and regex patterns are supported. E.g. use

'glob:*.avro'or 'regex:^.*\.(avro)$' to include all avro files one level deep in the inputDirURI

excludeFileNamePattern

Exclude file name pattern, supported glob pattern. Similar usage as includeFilePatternName

outputDirURI

Absolute Path along with scheme of the directory where to output all the segments.

overwriteOutput

Set to true to overwrite segments if already present in the output directory. Or set tofalseto raise exceptions.

pinotFSSpecs

tableSpec

recordReaderSpec

segmentNameGeneratorSpec

pinotClusterSpecs

pushJobSpec

Example

executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'
  extraConfigs:
    stagingDir: hdfs://examples/batch/airlineStats/staging

# Recommended to set jobType to SegmentCreationAndMetadataPush for production environments where Pinot Deep Store is configured
jobType: SegmentCreationAndTarPush

inputDirURI: 'examples/batch/airlineStats/rawdata'
includeFileNamePattern: 'glob:**/*.avro'
outputDirURI: 'hdfs:///examples/batch/airlineStats/segments'
overwriteOutput: true
pinotFSSpecs:
  - scheme: hdfs
    className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS 
recordReaderSpec:
  className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
tableSpec:
  tableName: 'airlineStats'
  schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
  tableConfigURI: 'http://localhost:9000/tables/airlineStats'
segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    segment.name.prefix: 'airlineStats_batch'
    exclude.sequence.id: true
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000

Execution Framework Spec

Property

Description

name

name of the execution framework. can be one of spark,hadoop or standalone

segmentGenerationJobRunnerClassName

The class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface to run the segment generation job

segmentTarPushJobRunnerClassName

The class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface to push the segment TAR file

segmentUriPushJobRunnerClassName

The class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface to send segment URI

segmentMetadataPushJobRunnerClassName

The class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface to send segment Metadata

extraConfigs

Key-value pairs of configs related to the framework of the executions

Example

executionFrameworkSpec:
    name: 'standalone'
    segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
    segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
    segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'

Pinot FS Spec

field

description

schema

used to identify a PinotFS. E.g. local, hdfs, dbfs, etc

className

Class name used to create the PinotFS instance. E.g.

org.apache.pinot.spi.filesystem.LocalPinotFS is used for local filesystem

org.apache.pinot.plugin.filesystem.HadoopPinotFS is used for HDFS

configs

configs used to init PinotFS instance

Table Spec

Table spec is used to specify the table in which data should be populated along with schema.

Property

Description

tableName

name of the table in which to populate the data

schemaURI

tableConfigURI

Example

tableSpec:
  tableName: 'airlineStats'
  schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
  tableConfigURI: 'http://localhost:9000/tables/airlineStats'
​

Record Reader Spec

field

description

dataFormat

Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc.

className

Corresponding RecordReader class name. E.g.

org.apache.pinot.plugin.inputformat.avro.AvroRecordReader

org.apache.pinot.plugin.inputformat.csv.CSVRecordReader

org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader

org.apache.pinot.plugin.inputformat.json.JSONRecordReader

org.apache.pinot.plugin.inputformat.orc.ORCRecordReader

org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader

configClassName

Corresponding RecordReaderConfig class name, it's mandatory for CSV and Thrift file format. E.g.

org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig

org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReaderConfig

configs

Used to init RecordReaderConfig class name, this config is required for CSV and Thrift data format.

Segment Name Generator Spec

Property

Description

type

the type of name generator to use. Following values are supported -

  • simple - this is the default spec.

  • normalizedDate - use this type when the time column in your data is in the String format instead of epoch time.

  • fixed - configure the segment name by the user.

configs

Configs to init SegmentNameGenerator

segment.name

For fixed SegmentNameGenerator. Explicitly set the segment name.

segment.name.postfix

For simple SegmentNameGenerator. Postfix will be appended to all the segment names.

segment.name.prefix

For normalizedDate SegmentNameGenerator. The Prefix will be prepended to all the segment names.

exclude.sequence.id

Whether to include sequence ids in segment name.

Needed when there are multiple segments for the same time range.

use.global.directory.sequence.id

Assign sequence ids to input files based on all input files under the directory.

Set to false to use local directory sequence id. This is useful when generating multiple segments for multiple days. In that case, each of the days will start from sequence id 0.

Example

segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    segment.name.prefix: 'airlineStats_batch'
    exclude.sequence.id: true

Pinot Cluster Spec

Property

Description

controllerURI

URI to use to fetch table/schema information and push data

Example

pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'

Push Job Spec

Property

Description

pushAttempts

Number of attempts for push job. Default is 1, which means no retry

pushParallelism

Workers to use for push job. Default is 1

pushRetryIntervalMillis

Time in milliseconds to wait for between retry attempts Default is 1 second.

segmentUriPrefix

append this string before the path of the push destination. Generally, it is the scheme of the filesystem e.g. s3:// , file:// etc.

segmentUriSuffix

append this string after the path of the push destination.

Example

pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000
  segmentUriPrefix : 'file://'
  segmentUriSuffix : my-dir/
PreviousSchemaNextFunctions

Last updated 3 years ago

Was this helpful?

Contains config related to the executor to use to ingest data. See

'glob:**/*.avro' to include all the avro files in inputDirURI as well as its subdirectories. You can use or to test out your patterns.

List of all the filesystems to be used for ingestions. You can mention multiple values in case input and output directories are present in different filesystems. For more details, scroll down to .

Defines table name and where to fetch corresponding table config and table schema. For more details, scroll down to .

Parser to use to read and decode input data. For more details, scroll down to .

Defines how the names of the segments will be. For more details, scroll down to .

Defines the Pinot Cluster Access Point. For more details, scroll down to .

Defines segment push job-related configuration. For more details, scroll down to .

The configs specify the execution framework to use to ingest data. Check out for configs related to all the supported frameworks

location from which to read the schema for the table. Supports both as well as HTTP URI

location from which to read the config for the table. Supports both as well as HTTP URI

Batch Ingestion
Glob tool
Regex Tool
File systems
File systems
Execution Framework Spec
Pinot FS Spec
Table Spec
Record Reader Spec
Segment Name Generator Spec
Pinot Cluster Spec
Push Job Spec