Supported Data Formats
This section contains a collection of guides that will show you how to import data from a Pinot-supported input format.
Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time spent doing serialization-deserialization and speed up the ingestion.
Configuring input formats
To change the input format, adjust the recordReaderSpec config in the ingestion job specification.
recordReaderSpec:
dataFormat: 'csv'
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
configs:
key1 : 'value1'
key2 : 'value2'The configuration consists of the following keys:
dataFormat: Name of the data format to consume.className: Name of the class that implements theRecordReaderinterface. This class is used for parsing the data.configClassName: Name of the class that implements theRecordReaderConfiginterface. This class is used the parse the values mentioned inconfigsconfigs: Key-value pair for format-specific configurations. This field is optional.
Supported input formats
Pinot supports multiple input formats out of the box. Specify the corresponding readers and the associated custom configurations to switch between formats.
CSV
CSV Record Reader supports the following configs:
fileFormat:default,rfc4180,excel,tdf,mysqlheader: Header of the file. ThecolumnNamesshould be separated by the delimiter mentioned in the configuration.delimiter: The character seperating the columns.multiValueDelimiter: The character separating multiple values in a single column. This can be used to split a column into a list.skipHeader: Skip header record in the file. Boolean.ignoreEmptyLines: Ignore empty lines (instead of filling them with default values). Boolean.ignoreSurroundingSpaces: ignore spaces around column names and values. BooleanquoteCharacter: Single character used for quotes in CSV files.recordSeparator: Character used to separate records in the input file. Default is or\rdepending on the platform.nullStringValue: String value that represents null in CSV files. Default is empty string.stopOnError: Stop processing the file when Pinot encounters a malformed CSV record. Boolean. Default isfalse.
By default, Pinot attempts to recover from malformed data rows and continue reading the rest of the file. Set stopOnError: true if you want batch ingestion to stop at the first malformed record instead. Pinot still validates the CSV header during initialization, so an invalid header or first record fails fast.
Your CSV file may have raw text fields that cannot be reliably delimited using any character. In this case, explicitly set the multiValueDelimeter field to empty in the ingestion config.
multiValueDelimiter: ''
Avro
Use extractRawTimeValues when you need raw Avro temporal logical-type values instead of Pinot's default converted values.
The Avro record reader converts the data in file to a GenericRecord. A Java class or .avro file is not required. By default, extractRawTimeValues is false, so Pinot converts Avro temporal logical types during extraction. Set extractRawTimeValues to true to keep the raw Avro integer values for date, time-millis, time-micros, timestamp-millis, timestamp-micros, and timestamp-nanos. decimal and uuid always convert.
We use the following conversion table to translate between Avro and Pinot data types. The conversions are done using the offical Avro methods present in org.apache.avro.Conversions.
INT
INT
LONG
LONG
FLOAT
FLOAT
DOUBLE
DOUBLE
BOOLEAN
BOOLEAN
STRING
STRING
ENUM
STRING
BYTES
BYTES
FIXED
BYTES
MAP
JSON
ARRAY
JSON
RECORD
JSON
UNION
JSON
DECIMAL
BYTES
UUID
STRING
DATE
STRING
yyyy-MM-dd format
TIME_MILLIS
STRING
HH:mm:ss.SSS format
TIME_MICROS
STRING
HH:mm:ss.SSSSSS format
TIMESTAMP_MILLIS
TIMESTAMP
TIMESTAMP_MICROS
TIMESTAMP
JSON
Thrift
Thrift requires the generated class using .thrift file to parse the data. The .class file should be available in the Pinot's classpath. You can put the files in the lib/ folder of Pinot distribution directory.
Parquet
Since 0.11.0 release, the Parquet record reader determines whether to use ParquetAvroRecordReader or ParquetNativeRecordReader to read records. The reader looks for the parquet.avro.schema or avro.schema key in the parquet file footer, and if present, uses the Avro reader.
You can change the record reader manually in case of a misconfiguration.
For the support of DECIMAL and other parquet native data types, always use ParquetNativeRecordReader.
To keep Parquet temporal values in their raw integer form instead of Pinot's default converted values, set extractRawTimeValues on ParquetRecordReader.
When extractRawTimeValues is false (the default), Pinot converts Parquet DATE, TIME_*, and TIMESTAMP_* values during extraction. Set it to true to keep their raw integer values instead. When you use ParquetRecordReader, Pinot forwards this config to whichever underlying reader it selects (ParquetAvroRecordReader or ParquetNativeRecordReader). DECIMAL and UUID always convert.
ParquetNativeRecordReader preserves primitive values in their native Pinot-compatible form during extraction. For example, a Parquet BOOLEAN stays a Pinot BOOLEAN instead of being stringified.
BOOLEAN
BOOLEAN
Preserved as a native boolean value.
INT96
LONG
ParquetINT96 type converts nanoseconds to Pinot INT64 type of milliseconds
INT64
LONG
INT32
INT
FLOAT
FLOAT
DOUBLE
DOUBLE
BINARY
BYTES
FIXED-LEN-BYTE-ARRAY
BYTES
DECIMAL
DOUBLE
ENUM
STRING
UTF8
STRING
REPEATED
MULTIVALUE/MAP (represented as MV
if parquet original type is LIST, then it is converted to MULTIVALUE column otherwise a MAP column.
For ParquetAvroRecordReader , you can refer to the Avro section above for the type conversions.
LIST and MAP wrapper extraction
Parquet LIST and MAP wrapper structs are now properly unwrapped when ingesting via ParquetNativeRecordReader and ParquetAvroRecordReader. Previously, schema-identified LIST and MAP columns had their wrapper elements exposed in the data:
An
array<string>field previously came back as[{"element": "abc"}, {"element": "xyz"}]— now it correctly comes back as["abc", "xyz"].A
map<string,string>field previously came back as{"key_value":[{"key":"k","value":"v"}]}— now it correctly comes back as{"k":"v"}.
Real struct fields named element are preserved; only schema-identified LIST and MAP wrappers are normalized. The readers support both the standard 3-level Parquet LIST encoding and legacy 2-level encodings (repeated primitive, repeated multi-field group, or repeated single-field group not named element).
Backward incompatibility: If you have ingestion pipelines or transform expressions that worked around the previous broken shape (for example, selecting data.element instead of data for an array column), you will need to update those queries and transforms.
MAP ordering: Parquet itself does not preserve source MAP entry order. Pinot canonicalizes ingested MAP and JSON output by sorting map keys when it serializes the value, so query results are deterministic but do not preserve the original insertion order. If the original pair order matters, model the field as LIST<STRUCT<key, value>> instead.
ORC
ORC record reader supports the following data types -
BOOLEAN
String
SHORT
Integer
INT
Integer
LONG
Integer
FLOAT
Float
DOUBLE
Double
STRING
String
VARCHAR
String
CHAR
String
LIST
Object[]
MAP
Map<Object, Object>
DATE
Long
TIMESTAMP
Long
BINARY
byte[]
BYTE
Integer
In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.
Protocol Buffers
The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc) from the .proto file using the command -
Apache Arrow
The Arrow input format plugin supports reading data in Apache Arrow IPC format. This is useful for ingesting data from systems that produce Arrow-formatted output.
The pinot-arrow plugin is included in the standard Pinot binary distribution (tarball and Docker image). No additional installation steps are required to use Apache Arrow format for data ingestion.
Batch ingestion
For batch ingestion from Arrow IPC files:
The ArrowRecordReader reads Arrow IPC files for batch ingestion. Note that Arrow IPC files require seekable channels, so gzip compression is not supported.
To preserve raw Arrow temporal values instead of Pinot's default converted values, set extractRawTimeValues on ArrowRecordReader:
When extractRawTimeValues is false (the default), Pinot converts Arrow Date, Time, and Timestamp values during extraction. Set it to true to keep raw integers instead: Date stays as days since epoch, while Time and Timestamp stay in the schema's declared Arrow unit.
Stream ingestion
For stream ingestion, the Arrow decoder converts Arrow columnar batches to Pinot rows:
Configuration properties:
arrow.allocator.limit
268435456 (256 MB)
Memory limit for Arrow's off-heap allocator in bytes
extractRawTimeValues
false
Keep Arrow Date, Time, and Timestamp values as raw integers instead of Pinot's default converted values
Arrow type conversions are handled automatically: UTF-8 text becomes String, Date becomes LocalDate, Time becomes LocalTime, Timestamp becomes Timestamp, Arrow Maps become flattened Map<String, Object>, and Arrow Lists become Object[]. Dictionary-encoded columns are decoded against their logical type before extraction.
Each Arrow Kafka message should contain a complete IPC stream. Empty batches are skipped, single-row batches ingest as one Pinot row, and multi-row batches fan out into multiple Pinot rows.
Last updated
Was this helpful?

