1 of 1

Input formats

This section contains a collection of guides that will show you how to import data from a Pinot-supported input format.

Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time spent doing serialization-deserialization and speed up the ingestion.

Configuring input formats

To change the input format, adjust the recordReaderSpec config in the ingestion job specification.

The configuration consists of the following keys:

dataFormat: Name of the data format to consume.
className: Name of the class that implements the RecordReader interface. This class is used for parsing the data.

Supported input formats

Pinot supports multiple input formats out of the box. Specify the corresponding readers and the associated custom configurations to switch between formats.

CSV

CSV Record Reader supports the following configs:

fileFormat: default, rfc4180, excel, tdf, mysql

Your CSV file may have raw text fields that cannot be reliably delimited using any character. In this case, explicitly set the multiValueDelimeter field to empty in the ingestion config. multiValueDelimiter: ''

Avro

The Avro record reader converts the data in file to a GenericRecord. A Java class or .avro file is not required. By default, the Avro record reader only supports primitive types. To enable support for rest of the Avro data types, set enableLogicalTypes to true .

We use the following conversion table to translate between Avro and Pinot data types. The conversions are done using the offical Avro methods present in org.apache.avro.Conversions.

Avro Data Type

Pinot Data Type

Comment

JSON

Thrift

Thrift requires the generated class using .thrift file to parse the data. The .class file should be available in the Pinot's classpath. You can put the files in the lib/ folder of Pinot distribution directory.

Parquet

Since 0.11.0 release, the Parquet record reader determines whether to use ParquetAvroRecordReader or ParquetNativeRecordReader to read records. The reader looks for the parquet.avro.schema or avro.schema key in the parquet file footer, and if present, uses the Avro reader.

You can change the record reader manually in case of a misconfiguration.

For the support of DECIMAL and other parquet native data types, always use ParquetNativeRecordReader.

For ParquetAvroRecordReader , you can refer to the for the type conversions.

ORC

ORC record reader supports the following data types -

ORC Data Type

Java Data Type

In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.

Protocol Buffers

The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc) from the .proto file using the command -

Input formats

This section contains a collection of guides that will show you how to import data from a Pinot-supported input format.

Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time spent doing serialization-deserialization and speed up the ingestion.

Configuring input formats

To change the input format, adjust the recordReaderSpec config in the ingestion job specification.

recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
  configs: 
			key1 : 'value1'
			key2 : 'value2'

The configuration consists of the following keys:

dataFormat: Name of the data format to consume.
className: Name of the class that implements the RecordReader interface. This class is used for parsing the data.

Supported input formats

Pinot supports multiple input formats out of the box. Specify the corresponding readers and the associated custom configurations to switch between formats.

CSV

CSV Record Reader supports the following configs:

fileFormat: default, rfc4180, excel, tdf, mysql

Avro

We use the following conversion table to translate between Avro and Pinot data types. The conversions are done using the offical Avro methods present in org.apache.avro.Conversions.

Avro Data Type

Pinot Data Type

Comment

JSON

Thrift

Parquet

You can change the record reader manually in case of a misconfiguration.

For the support of DECIMAL and other parquet native data types, always use ParquetNativeRecordReader.

For ParquetAvroRecordReader , you can refer to the for the type conversions.

ORC

ORC record reader supports the following data types -

ORC Data Type

Java Data Type

In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.

Protocol Buffers

The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc) from the .proto file using the command -

Input formats

hashtagConfiguring input formats

hashtagSupported input formats

hashtagCSV

hashtagAvro

hashtagJSON

hashtagThrift

hashtagParquet

hashtagORC

hashtagProtocol Buffers

Input formats

hashtagConfiguring input formats

hashtagSupported input formats

hashtagCSV

hashtagAvro

hashtagJSON

hashtagThrift

hashtagParquet

hashtagORC

hashtagProtocol Buffers

Configuring input formats

Supported input formats

CSV

Avro

JSON

Thrift

Parquet

ORC

Protocol Buffers

Configuring input formats

Supported input formats

CSV

Avro

JSON

Thrift

Parquet

ORC

Protocol Buffers