Input formats
This section contains a collection of guides that will show you how to import data from a Pinot-supported input format.
Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time spent doing serialization-deserialization and speed up the ingestion.
Configuring input formats
To change the input format, adjust the recordReaderSpec
config in the ingestion job specification.
The configuration consists of the following keys:
dataFormat
: Name of the data format to consume.className
: Name of the class that implements theRecordReader
interface. This class is used for parsing the data.configClassName
: Name of the class that implements theRecordReaderConfig
interface. This class is used the parse the values mentioned inconfigs
configs
: Key-value pair for format-specific configurations. This field is optional.
Supported input formats
Pinot supports multiple input formats out of the box. Specify the corresponding readers and the associated custom configurations to switch between formats.
CSV
CSV Record Reader supports the following configs:
fileFormat
:default
,rfc4180
,excel
,tdf
,mysql
header
: Header of the file. ThecolumnNames
should be separated by the delimiter mentioned in the configuration.delimiter
: The character seperating the columns.multiValueDelimiter
: The character separating multiple values in a single column. This can be used to split a column into a list.skipHeader
: Skip header record in the file. Boolean.ignoreEmptyLines
: Ignore empty lines (instead of filling them with default values). Boolean.ignoreSurroundingSpaces
: ignore spaces around column names and values. BooleanquoteCharacter
: Single character used for quotes in CSV files.recordSeparator
: Character used to separate records in the input file. Default is or\r
depending on the platform.nullStringValue
: String value that represents null in CSV files. Default is empty string.skipUnParseableLines
: Skip lines that cannot be parsed. Note that this would result in data loss. Boolean.
Your CSV file may have raw text fields that cannot be reliably delimited using any character. In this case, explicitly set the multiValueDelimeter field to empty in the ingestion config.
multiValueDelimiter: ''
Avro
The Avro record reader converts the data in file to a GenericRecord
. A Java class or .avro
file is not required. By default, the Avro record reader only supports primitive types. To enable support for rest of the Avro data types, set enableLogicalTypes
to true
.
We use the following conversion table to translate between Avro and Pinot data types. The conversions are done using the offical Avro methods present in org.apache.avro.Conversions
.
INT
INT
LONG
LONG
FLOAT
FLOAT
DOUBLE
DOUBLE
BOOLEAN
BOOLEAN
STRING
STRING
ENUM
STRING
BYTES
BYTES
FIXED
BYTES
MAP
JSON
ARRAY
JSON
RECORD
JSON
UNION
JSON
DECIMAL
BYTES
UUID
STRING
DATE
STRING
yyyy-MM-dd
format
TIME_MILLIS
STRING
HH:mm:ss.SSS
format
TIME_MICROS
STRING
HH:mm:ss.SSSSSS
format
TIMESTAMP_MILLIS
TIMESTAMP
TIMESTAMP_MICROS
TIMESTAMP
JSON
Thrift
Thrift requires the generated class using .thrift
file to parse the data. The .class
file should be available in the Pinot's classpath
. You can put the files in the lib/
folder of Pinot distribution directory.
Parquet
Since 0.11.0 release, the Parquet record reader determines whether to use ParquetAvroRecordReader
or ParquetNativeRecordReader
to read records. The reader looks for the parquet.avro.schema
or avro.schema
key in the parquet file footer, and if present, uses the Avro reader.
You can change the record reader manually in case of a misconfiguration.
For the support of DECIMAL and other parquet native data types, always use ParquetNativeRecordReader
.
INT96
LONG
ParquetINT96
type converts nanoseconds
to Pinot INT64
type of milliseconds
INT64
LONG
INT32
INT
FLOAT
FLOAT
DOUBLE
DOUBLE
BINARY
BYTES
FIXED-LEN-BYTE-ARRAY
BYTES
DECIMAL
DOUBLE
ENUM
STRING
UTF8
STRING
REPEATED
MULTIVALUE/MAP (represented as MV
if parquet original type is LIST, then it is converted to MULTIVALUE column otherwise a MAP column.
For ParquetAvroRecordReader
, you can refer to the Avro section above for the type conversions.
ORC
ORC record reader supports the following data types -
BOOLEAN
String
SHORT
Integer
INT
Integer
LONG
Integer
FLOAT
Float
DOUBLE
Double
STRING
String
VARCHAR
String
CHAR
String
LIST
Object[]
MAP
Map<Object, Object>
DATE
Long
TIMESTAMP
Long
BINARY
byte[]
BYTE
Integer
In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.
Protocol Buffers
The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc
) from the .proto
file using the command -
Last updated