This section contains a collection of guides that will show you how to import data from a Pinot supported input format.
Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time that goes in serialization-deserialization and speed up the ingestion.
The input format can be changed using the recordReaderSpec
config in the ingestion job spec.
The config consists of the following keys -
dataFormat
- Name of the data format to consume.
className
- name of the class that implements the RecordReader
interface. This class is used for parsing the data.
configClassName
- name of the class that implements the RecordReaderConfig
interface. This class is used the parse the values mentioned in configs
configs
- Key value pair for format specific configs. This field can be left out.
Pinot supports the multiple input formats out of the box. You just need to specify the corresponding readers and the associated custom configs to switch between the formats.
CSV Record Reader supports the following configs -
fileFormat
- can be one of default, rfc4180, excel, tdf, mysql
header
- header of the file. The columnNames should be seperated by the delimiter mentioned in the config
delimiter
- The character seperating the columns
multiValueDelimiter
- The character seperating multiple values in a single column. This can be used to split a column into a list.
Your CSV file may have raw text fields that cannot be reliably delimited using any character. In this case, explicitly set the multiValueDelimeter field to empty in the ingestion config.
multiValueDelimiter: ''
The Avro record reader converts the data in file to a GenericRecord
. A java class or .avro
file is not required.
Note: Thrift requires the generated class using .thrift
file to parse the data. The .class file should be available in the Pinot's classpath. You can put the files in the lib/
folder of pinot distribution directory.
The above class doesn't read the Parquet INT96
and Decimal
type.
Please use the below class to handle INT96
and Decimal
type.
ORC record reader supports the following data types -
In LIST and MAP types, the object should only belong to one of the data types supported by Pinot.
The reader requires a descriptor file to deserialize the data present in the files. You can generate the descriptor file (.desc
) from the .proto
file using the command -
Parquet Data Type
Java Data Type
Comment
INT96
INT64
ParquetINT96
type converts nanoseconds
to Pinot INT64
type of milliseconds
DECIMAL
DOUBLE
ORC Data Type
Java Data Type
BOOLEAN
String
SHORT
Integer
INT
Integer
LONG
Integer
FLOAT
Float
DOUBLE
Double
STRING
String
VARCHAR
String
CHAR
String
LIST
Object[]
MAP
Map<Object, Object>
DATE
Long
TIMESTAMP
Long
BINARY
byte[]
BYTE
Integer