githubEdit

Supported Data Formats

This section contains a collection of guides that will show you how to import data from a Pinot-supported input format.

Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time spent doing serialization-deserialization and speed up the ingestion.

Configuring input formats

To change the input format, adjust the recordReaderSpec config in the ingestion job specification.

recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
  configs: 
			key1 : 'value1'
			key2 : 'value2'

The configuration consists of the following keys:

  • dataFormat: Name of the data format to consume.

  • className: Name of the class that implements the RecordReader interface. This class is used for parsing the data.

  • configClassName: Name of the class that implements the RecordReaderConfig interface. This class is used the parse the values mentioned in configs

  • configs: Key-value pair for format-specific configurations. This field is optional.

Supported input formats

Pinot supports multiple input formats out of the box. Specify the corresponding readers and the associated custom configurations to switch between formats.

CSV

CSV Record Reader supports the following configs:

  • fileFormat: default, rfc4180, excel, tdf, mysql

  • header: Header of the file. The columnNames should be separated by the delimiter mentioned in the configuration.

  • delimiter: The character seperating the columns.

  • multiValueDelimiter: The character separating multiple values in a single column. This can be used to split a column into a list.

  • skipHeader: Skip header record in the file. Boolean.

  • ignoreEmptyLines: Ignore empty lines (instead of filling them with default values). Boolean.

  • ignoreSurroundingSpaces: ignore spaces around column names and values. Boolean

  • quoteCharacter: Single character used for quotes in CSV files.

  • recordSeparator: Character used to separate records in the input file. Default is or \r depending on the platform.

  • nullStringValue: String value that represents null in CSV files. Default is empty string.

  • stopOnError: Stop processing the file when Pinot encounters a malformed CSV record. Boolean. Default is false.

By default, Pinot attempts to recover from malformed data rows and continue reading the rest of the file. Set stopOnError: true if you want batch ingestion to stop at the first malformed record instead. Pinot still validates the CSV header during initialization, so an invalid header or first record fails fast.

circle-info

Your CSV file may have raw text fields that cannot be reliably delimited using any character. In this case, explicitly set the multiValueDelimeter field to empty in the ingestion config. multiValueDelimiter: ''

Avro

Use extractRawTimeValues when you need raw Avro temporal logical-type values instead of Pinot's default converted values.

The Avro record reader converts the data in file to a GenericRecord. A Java class or .avro file is not required. By default, extractRawTimeValues is false, so Pinot converts Avro temporal logical types during extraction. Set extractRawTimeValues to true to keep the raw Avro integer values for date, time-millis, time-micros, timestamp-millis, timestamp-micros, and timestamp-nanos. decimal and uuid always convert.

We use the following conversion table to translate between Avro and Pinot data types. The conversions are done using the offical Avro methods present in org.apache.avro.Conversions.

Avro Data Type
Pinot Data Type
Comment

INT

INT

LONG

LONG

FLOAT

FLOAT

DOUBLE

DOUBLE

BOOLEAN

BOOLEAN

STRING

STRING

ENUM

STRING

BYTES

BYTES

FIXED

BYTES

MAP

JSON

ARRAY

JSON

RECORD

JSON

UNION

JSON

DECIMAL

BYTES

UUID

STRING

DATE

STRING

yyyy-MM-dd format

TIME_MILLIS

STRING

HH:mm:ss.SSS format

TIME_MICROS

STRING

HH:mm:ss.SSSSSS format

TIMESTAMP_MILLIS

TIMESTAMP

TIMESTAMP_MICROS

TIMESTAMP

JSON

Thrift

circle-info

Thrift requires the generated class using .thrift file to parse the data. The .class file should be available in the Pinot's classpath. You can put the files in the lib/ folder of Pinot distribution directory.

Parquet

Since 0.11.0 release, the Parquet record reader determines whether to use ParquetAvroRecordReader or ParquetNativeRecordReader to read records. The reader looks for the parquet.avro.schema or avro.schema key in the parquet file footer, and if present, uses the Avro reader.

You can change the record reader manually in case of a misconfiguration.

circle-exclamation

To keep Parquet temporal values in their raw integer form instead of Pinot's default converted values, set extractRawTimeValues on ParquetRecordReader.

When extractRawTimeValues is false (the default), Pinot converts Parquet DATE, TIME_*, and TIMESTAMP_* values during extraction. Set it to true to keep their raw integer values instead. When you use ParquetRecordReader, Pinot forwards this config to whichever underlying reader it selects (ParquetAvroRecordReader or ParquetNativeRecordReader). DECIMAL and UUID always convert.

ParquetNativeRecordReader preserves primitive values in their native Pinot-compatible form during extraction. For example, a Parquet BOOLEAN stays a Pinot BOOLEAN instead of being stringified.

Parquet Data Type
Pinot Data Type
Comment

BOOLEAN

BOOLEAN

Preserved as a native boolean value.

INT96

LONG

ParquetINT96 type converts nanoseconds to Pinot INT64 type of milliseconds

INT64

LONG

INT32

INT

FLOAT

FLOAT

DOUBLE

DOUBLE

BINARY

BYTES

FIXED-LEN-BYTE-ARRAY

BYTES

DECIMAL

DOUBLE

ENUM

STRING

UTF8

STRING

REPEATED

MULTIVALUE/MAP (represented as MV

if parquet original type is LIST, then it is converted to MULTIVALUE column otherwise a MAP column.

For ParquetAvroRecordReader , you can refer to the Avro section above for the type conversions.

LIST and MAP wrapper extraction

Parquet LIST and MAP wrapper structs are now properly unwrapped when ingesting via ParquetNativeRecordReader and ParquetAvroRecordReader. Previously, schema-identified LIST and MAP columns had their wrapper elements exposed in the data:

  • An array<string> field previously came back as [{"element": "abc"}, {"element": "xyz"}] — now it correctly comes back as ["abc", "xyz"].

  • A map<string,string> field previously came back as {"key_value":[{"key":"k","value":"v"}]} — now it correctly comes back as {"k":"v"}.

Real struct fields named element are preserved; only schema-identified LIST and MAP wrappers are normalized. The readers support both the standard 3-level Parquet LIST encoding and legacy 2-level encodings (repeated primitive, repeated multi-field group, or repeated single-field group not named element).

Backward incompatibility: If you have ingestion pipelines or transform expressions that worked around the previous broken shape (for example, selecting data.element instead of data for an array column), you will need to update those queries and transforms.

MAP ordering: Parquet itself does not preserve source MAP entry order. Pinot canonicalizes ingested MAP and JSON output by sorting map keys when it serializes the value, so query results are deterministic but do not preserve the original insertion order. If the original pair order matters, model the field as LIST<STRUCT<key, value>> instead.

ORC

ORC record reader supports the following data types -

ORC Data Type
Java Data Type

BOOLEAN

String

SHORT

Integer

INT

Integer

LONG

Integer

FLOAT

Float

DOUBLE

Double

STRING

String

VARCHAR

String

CHAR

String

LIST

Object[]

MAP

Map<Object, Object>

DATE

Long

TIMESTAMP

Long

BINARY

byte[]

BYTE

Integer

circle-info

In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.

Protocol Buffers

The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc) from the .proto file using the command -

Apache Arrow

The Arrow input format plugin supports reading data in Apache Arrow IPC formatarrow-up-right. This is useful for ingesting data from systems that produce Arrow-formatted output.

circle-check

Batch ingestion

For batch ingestion from Arrow IPC files:

The ArrowRecordReader reads Arrow IPC files for batch ingestion. Note that Arrow IPC files require seekable channels, so gzip compression is not supported.

To preserve raw Arrow temporal values instead of Pinot's default converted values, set extractRawTimeValues on ArrowRecordReader:

When extractRawTimeValues is false (the default), Pinot converts Arrow Date, Time, and Timestamp values during extraction. Set it to true to keep raw integers instead: Date stays as days since epoch, while Time and Timestamp stay in the schema's declared Arrow unit.

Stream ingestion

For stream ingestion, the Arrow decoder converts Arrow columnar batches to Pinot rows:

Configuration properties:

Property
Default
Description

arrow.allocator.limit

268435456 (256 MB)

Memory limit for Arrow's off-heap allocator in bytes

extractRawTimeValues

false

Keep Arrow Date, Time, and Timestamp values as raw integers instead of Pinot's default converted values

Arrow type conversions are handled automatically: UTF-8 text becomes String, Date becomes LocalDate, Time becomes LocalTime, Timestamp becomes Timestamp, Arrow Maps become flattened Map<String, Object>, and Arrow Lists become Object[]. Dictionary-encoded columns are decoded against their logical type before extraction.

Each Arrow Kafka message should contain a complete IPC stream. Empty batches are skipped, single-row batches ingest as one Pinot row, and multi-row batches fan out into multiple Pinot rows.

Last updated

Was this helpful?