1 of 5

Batch Ingestion

Batch ingestion allows users to create a table using data already present in a file system such as S3. This is particularly useful for the cases where the user wants to utilize Pinot's ability to query large data with minimal latency or test out new features using a simple data file.

Ingesting data from a filesystem involves the following steps -

Define Schema
Define Table Config
Upload Schema and Table configs
Upload data

Batch Ingestion currently supports the following mechanisms to upload the data -

Standalone

Here we'll take a look at the standalone local processing to get you started.

Let's create a table for the following CSV data source.

Create Schema Configuration

In our data, the only column on which aggregations can be performed is score. Secondly, timestampInEpoch is the only timestamp column. So, on our schema, we keep score as metric and timestampInEpoch as timestamp column.

Here, we have also defined two extra fields - format and granularity. The format specifies the formatting of our timestamp column in the data source. Currently, it is in milliseconds hence we have specified 1:MILLISECONDS:EPOCH

Create Table Configuration

We define a tabletranscriptand map the schema created in the previous step to the table. For batch data, we keep the tableType as OFFLINE

Upload Schema and Table

Now that we have both the configs, we can simply upload them and create a table. To achieve that, just run the command -

Check out the table config and schema in the [Rest API] to make sure it was successfully uploaded.

Upload data

We now have an empty table in pinot. So as the next step we will upload our CSV file to this table.

A table is composed of multiple segments. The segments can be created using three ways

1) Minion based ingestion 2) Upload API 3) Ingestion jobs

Minion Based Ingestion

Refer to

Upload API

There are 2 Controller APIs that can be used for a quick ingestion test using a small file.

When these APIs are invoked, the controller has to download the file and build the segment locally.

Hence, these APIs are NOT meant for production environments and for large input files.

/ingestFromFile

This API creates a segment using the given file and pushes it to Pinot. All steps happen on the controller. Example usage:

To upload a JSON file data.json to a table called foo_OFFLINE, use below command

Note that query params need to be URLEncoded. For example, {"inputFormat":"json"} in the command below needs to be converted to %7B%22inputFormat%22%3A%22json%22%7D.

The batchConfigMapStr can be used to pass in additional properties needed for decoding the file. For example, in case of csv, you may need to provide the delimiter

/ingestFromURI

This API creates a segment using file at the given URI and pushes it to Pinot. Properties to access the FS need to be provided in the batchConfigMap. All steps happen on the controller. Example usage:

Ingestion Jobs

Segments can be created and uploaded using tasks known as DataIngestionJobs. A job also needs a config of its own. We call this config the JobSpec.

For our CSV file and table, the job spec should look like below.

You can refer to for more details.

Now that we have the job spec for our table transcript , we can trigger the job using the following command

Once the job has successfully finished, you can head over to the [query console] and start playing with the data.

Segment Push Job Type

There are 3 ways to upload a Pinot segment:

1. Segment Tar Push

This is the original and default push mechanism.

Tar push requires the segment to be stored locally or can be opened as an InputStream on PinotFS. So we can stream the entire segment tar file to the controller.

The push job will:

Upload the entire segment tar file to the Pinot controller.

Pinot controller will:

Save the segment into the controller segment directory(Local or any PinotFS).
Extract segment metadata.
Add the segment to the table.

2. Segment URI Push

This push mechanism requires the segment Tar file stored on a deep store with a globally accessible segment tar URI.

URI push is light-weight on the client-side, and the controller side requires equivalent work as the Tar push.

The push job will:

POST this segment Tar URI to the Pinot controller.

Pinot controller will:

Download segment from the URI and save it to controller segment directory(Local or any PinotFS).
Extract segment metadata.
Add the segment to the table.

3. Segment Metadata Push

This push mechanism also requires the segment Tar file stored on a deep store with a globally accessible segment tar URI.

Metadata push is light-weight on the controller side, there is no deep store download involves from the controller side.

The push job will:

Download the segment based on URI.
Extract metadata.
Upload metadata to the Pinot Controller.

Pinot Controller will:

Add the segment to the table based on the metadata.

Segment Fetchers

When pinot segment files are created in external systems (Hadoop/spark/etc), there are several ways to push those data to the Pinot Controller and Server:

Push segment to shared NFS and let pinot pull segment files from the location of that NFS. See .
Push segment to a Web server and let pinot pull segment files from the Web server with HTTP/HTTPS link. See .
Push segment to PinotFS(HDFS/S3/GCS/ADLS) and let pinot pull segment files from PinotFS URI. See and .

The first three options are supported out of the box within the Pinot package. As long your remote jobs send Pinot controller with the corresponding URI to the files it will pick up the file and allocate it to proper Pinot Servers and brokers. To enable Pinot support for PinotFS, you will need to provide configuration and proper Hadoop dependencies.

Persistence

By default, Pinot does not come with a storage layer, so all the data sent, won't be stored in case of a system crash. In order to persistently store the generated segments, you will need to change controller and server configs to add deep storage. Checkout for all the info and related configs.

Tuning

Standalone

Since pinot is written in Java, you can set the following basic java configurations to tune the segment runner job -

Log4j2 file location with -Dlog4j2.configurationFile
Plugin directory location with -Dplugins.dir=/opt/pinot/plugins
JVM props, like -Xmx8g -Xms4G

If you are using the docker, you can set the following under JAVA_OPTS variable.

Hadoop

You can set -D mapreduce.map.memory.mb=8192 to set the mapper memory size when submitting the Hadoop job.

Spark

You can add config spark.executor.memory to tune the memory usage for segment creation when submitting the Spark job.

Spark

Pinot supports Apache spark as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.

You can follow the wiki to build pinot distribution from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar

Next, you need to change the execution config in the job spec to the following -

# executionFrameworkSpec: Defines ingestion jobs to be running.
executionFrameworkSpec:

  # name: execution framework name
  name: 'spark'

  # segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'

  # segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'

  # segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'

  #segmentMetadataPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'

  # extraConfigs: extra configs for execution framework.
  extraConfigs:

    # stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
    stagingDir: your/local/dir/staging

You can check out the sample job spec here.

Now, add the pinot jar to spark's classpath using following options -

Please ensure environment variables PINOT_ROOT_DIR and PINOT_VERSION are set properly.

Finally execute the spark job using the command -

Note: You should change the master to yarn and deploy-mode to cluster for production.

Hadoop

Segment Creation and Push

Pinot supports as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.

You can follow the [wiki] to build pinot distribution from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar

Backfill Data

Introduction

Pinot batch ingestion involves two parts: routing ingestion job(hourly/daily) and backfill. Here are some tutorials on how routine batch ingestion works in Pinot Offline Table:

Batch Ingestion Overview

High Level Idea

Organize raw data into buckets (eg: /var/pinot/airlineStats/rawdata/2014/01/01). Each bucket typically contains several files (eg: /var/pinot/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01_0.avro)
Run a Pinot batch ingestion job, which points to a specific date folder like ‘/var/pinot/airlineStats/rawdata/2014/01/01’. The segment generation job will convert each such avro file into a Pinot segment for that day and give it a unique name.
Run Pinot segment push job to upload those segments with those uniques names via a Controller API

IMPORTANT: The segment name is the unique identifier used to uniquely identify that segment in Pinot. If the controller gets an upload request for a segment with the same name - it will attempt to replace it with the new one.

This newly uploaded data can now be queried in Pinot. However, sometimes users will make changes to the raw data which need to be reflected in Pinot. This process is known as 'Backfill'.

How to Backfill data in Pinot

Pinot supports data modification only at the segment level, which means we should update entire segments for doing backfills. The high level idea is to repeat steps 2 (segment generation) and 3 (segment upload) mentioned above:

Backfill jobs must run at the same granularity as the daily job. E.g., if you need to backfill data for 2014/01/01, specify that input folder for your backfill job (e.g.: ‘/var/pinot/airlineStats/rawdata/2014/01/01’)
The backfill job will then generate segments with the same name as the original job (with the new data).
When uploading those segments to Pinot, the controller will replace the old segments with the new ones (segment names act like primary keys within Pinot) one by one.

Edge case

Backfill jobs expect the same number of (or more) data files on the backfill date. So the segment generation job will create the same number of (or more) segments than the original run.

E.g. assuming table airlineStats has 2 segments(airlineStats_2014-01-01_2014-01-01_0, airlineStats_2014-01-01_2014-01-01_1) on date 2014/01/01 and the backfill input directory contains only 1 input file. Then the segment generation job will create just one segment: airlineStats_2014-01-01_2014-01-01_0. After the segment push job, only segment airlineStats_2014-01-01_2014-01-01_0 got replaced and stale data in segment airlineStats_2014-01-01_2014-01-01_1 are still there.

In case the raw data is modified in such a way that the original time bucket has fewer input files than the first ingestion run, backfill will fail.

Dimension Table

Dimension tables in Apache Pinot.

Dimension tables are a special kind of offline tables from which data can be looked up via the lookup UDF, providing a join like functionality. These dimension tables are replicated on all the hosts for a given tenant to allow faster lookups.

To mark an offline table as a dim table the configuration isDimTable should be set to true in the table config as shown below

As dimension table are used to perform lookups of dimension values, they are required to have a primary key (can be a composite key).

As mentioned above, when a table is marked as a dimension table it will be replicated on all the hosts, because of this the size of the dim table has to be small. The maximum size quota for a dimension table in a cluster is controlled by controller.dimTable.maxSize controller property. Table creation will fail if the storage quota exceeds this maximum size.

Batch Ingestion

Ingesting data from a filesystem involves the following steps -

Define Schema
Define Table Config
Upload Schema and Table configs
Upload data

Batch Ingestion currently supports the following mechanisms to upload the data -

Standalone

Here we'll take a look at the standalone local processing to get you started.

Let's create a table for the following CSV data source.

Create Schema Configuration

Create Table Configuration

We define a tabletranscriptand map the schema created in the previous step to the table. For batch data, we keep the tableType as OFFLINE

Upload Schema and Table

Now that we have both the configs, we can simply upload them and create a table. To achieve that, just run the command -

Check out the table config and schema in the [Rest API] to make sure it was successfully uploaded.

Upload data

We now have an empty table in pinot. So as the next step we will upload our CSV file to this table.

A table is composed of multiple segments. The segments can be created using three ways

1) Minion based ingestion 2) Upload API 3) Ingestion jobs

Minion Based Ingestion

Refer to

Upload API

There are 2 Controller APIs that can be used for a quick ingestion test using a small file.

When these APIs are invoked, the controller has to download the file and build the segment locally.

Hence, these APIs are NOT meant for production environments and for large input files.

/ingestFromFile

This API creates a segment using the given file and pushes it to Pinot. All steps happen on the controller. Example usage:

To upload a JSON file data.json to a table called foo_OFFLINE, use below command

Note that query params need to be URLEncoded. For example, {"inputFormat":"json"} in the command below needs to be converted to %7B%22inputFormat%22%3A%22json%22%7D.

The batchConfigMapStr can be used to pass in additional properties needed for decoding the file. For example, in case of csv, you may need to provide the delimiter

/ingestFromURI

Ingestion Jobs

Segments can be created and uploaded using tasks known as DataIngestionJobs. A job also needs a config of its own. We call this config the JobSpec.

For our CSV file and table, the job spec should look like below.

You can refer to for more details.

Now that we have the job spec for our table transcript , we can trigger the job using the following command

Once the job has successfully finished, you can head over to the [query console] and start playing with the data.

Segment Push Job Type

There are 3 ways to upload a Pinot segment:

1. Segment Tar Push

This is the original and default push mechanism.

Tar push requires the segment to be stored locally or can be opened as an InputStream on PinotFS. So we can stream the entire segment tar file to the controller.

The push job will:

Upload the entire segment tar file to the Pinot controller.

Pinot controller will:

Save the segment into the controller segment directory(Local or any PinotFS).
Extract segment metadata.
Add the segment to the table.

2. Segment URI Push

This push mechanism requires the segment Tar file stored on a deep store with a globally accessible segment tar URI.

URI push is light-weight on the client-side, and the controller side requires equivalent work as the Tar push.

The push job will:

POST this segment Tar URI to the Pinot controller.

Pinot controller will:

Download segment from the URI and save it to controller segment directory(Local or any PinotFS).
Extract segment metadata.
Add the segment to the table.

3. Segment Metadata Push

This push mechanism also requires the segment Tar file stored on a deep store with a globally accessible segment tar URI.

Metadata push is light-weight on the controller side, there is no deep store download involves from the controller side.

The push job will:

Download the segment based on URI.
Extract metadata.
Upload metadata to the Pinot Controller.

Pinot Controller will:

Add the segment to the table based on the metadata.

Segment Fetchers

When pinot segment files are created in external systems (Hadoop/spark/etc), there are several ways to push those data to the Pinot Controller and Server:

Push segment to shared NFS and let pinot pull segment files from the location of that NFS. See .
Push segment to a Web server and let pinot pull segment files from the Web server with HTTP/HTTPS link. See .
Push segment to PinotFS(HDFS/S3/GCS/ADLS) and let pinot pull segment files from PinotFS URI. See and .

Persistence

Tuning

Standalone

Since pinot is written in Java, you can set the following basic java configurations to tune the segment runner job -

Log4j2 file location with -Dlog4j2.configurationFile
Plugin directory location with -Dplugins.dir=/opt/pinot/plugins
JVM props, like -Xmx8g -Xms4G

If you are using the docker, you can set the following under JAVA_OPTS variable.

Hadoop

You can set -D mapreduce.map.memory.mb=8192 to set the mapper memory size when submitting the Hadoop job.

Spark

You can add config spark.executor.memory to tune the memory usage for segment creation when submitting the Spark job.

Batch Ingestion

hashtagCreate Schema Configuration

hashtagCreate Table Configuration

hashtagUpload Schema and Table

hashtagUpload data

hashtagMinion Based Ingestion

hashtagUpload API

hashtag/ingestFromFile

hashtag/ingestFromURI

hashtagIngestion Jobs

hashtagSegment Push Job Type

hashtag1. Segment Tar Push

hashtag2. Segment URI Push

hashtag3. Segment Metadata Push

hashtagSegment Fetchers

hashtagPersistence

hashtagTuning

hashtagStandalone

hashtagHadoop

hashtagSpark

Spark

Hadoop

hashtagSegment Creation and Push

Backfill Data

hashtagIntroduction

hashtagHow to Backfill data in Pinot

hashtagEdge case

Dimension Table

Batch Ingestion

hashtagCreate Schema Configuration

hashtagCreate Table Configuration

hashtagUpload Schema and Table

hashtagUpload data

hashtagMinion Based Ingestion

hashtagUpload API

hashtag/ingestFromFile

hashtag/ingestFromURI

hashtagIngestion Jobs

hashtagSegment Push Job Type

hashtag1. Segment Tar Push

hashtag2. Segment URI Push

hashtag3. Segment Metadata Push

hashtagSegment Fetchers

hashtagPersistence

hashtagTuning

hashtagStandalone

hashtagHadoop

hashtagSpark

Backfill Data

hashtagIntroduction

hashtagHow to Backfill data in Pinot

hashtagEdge case

Hadoop

hashtagSegment Creation and Push

Dimension Table

hashtagData Preprocessing before Segment Creation

hashtagpreprocessing.num.reducers

hashtagpreprocessing.max.num.records.per.file

Spark

Create Schema Configuration

Create Table Configuration

Upload Schema and Table

Upload data

Minion Based Ingestion

Upload API

/ingestFromFile

/ingestFromURI

Ingestion Jobs

Segment Push Job Type

1. Segment Tar Push

2. Segment URI Push

3. Segment Metadata Push

Segment Fetchers

Persistence

Tuning

Standalone

Hadoop

Spark

Segment Creation and Push

Introduction

How to Backfill data in Pinot

Edge case

Create Schema Configuration

Create Table Configuration

Upload Schema and Table

Upload data

Minion Based Ingestion

Upload API

/ingestFromFile

/ingestFromURI

Ingestion Jobs

Segment Push Job Type

1. Segment Tar Push

2. Segment URI Push

3. Segment Metadata Push

Segment Fetchers

Persistence

Tuning

Standalone

Hadoop

Spark

Introduction

How to Backfill data in Pinot

Edge case

Segment Creation and Push

Data Preprocessing before Segment Creation

preprocessing.num.reducers

preprocessing.max.num.records.per.file