1 of 3

Input formats

This section contains a collection of guides that will show you how to import data from a Pinot-supported input format.

Pinot offers support for various popular input formats during ingestion. By changing the input format, you can reduce the time spent doing serialization-deserialization and speed up the ingestion.

Configuring input formats

To change the input format, adjust the recordReaderSpec config in the ingestion job specification.

Complex Type (Array, Map) Handling

Complex type handling in Apache Pinot.

Commonly, ingested data has a complex structure. For example, Avro schemas have records and arrays while JSON supports objects and arrays.

Apache Pinot's data model supports primitive data types (including int, long, float, double, BigDecimal, string, bytes), and limited multi-value types, such as an array of primitive types. Simple data types allow Pinot to build fast indexing structures for good query performance, but does require some handling of the complex structures.

There are two options for complex type handling:

Convert the complex-type data into a JSON string and then build a JSON index.
Use the built-in complex-type handling rules in the ingestion configuration.

On this page, we'll show how to handle these complex-type structures with each of these two approaches. We will process some example data, consisting of the field group from the .

This object has two child fields and the child group is a nested array with elements of object type.

JSON indexing

Apache Pinot provides a powerful to accelerate the value lookup and filtering for the column. To convert an object group with complex type to JSON, add the following to your table configuration.

The config transformConfigs transforms the object group to a JSON string group_json, which then creates the JSON indexing with configuration jsonIndexColumns. To read the full spec, see .

Also, note that group is a reserved keyword in SQL and therefore needs to be quoted in transformFunction.

The columnName can't use the same name as any of the fields in the source JSON data, for example, if our source data contains the field group and we want to transform the data in that field before persisting it, the destination column name would need to be something different, like group_json.

Note that you do not need to worry about the maxLength of the field group_json on the schema, because "JSON" data type does not have a maxLength and will not be truncated. This is true even though "JSON" is stored as a string internally.

The schema will look like this:

For the full specification, see .

With this, you can start to query the nested fields under group. For more details about the supported JSON function, see ).

Ingestion configurations

Though JSON indexing is a handy way to process the complex types, there are some limitations:

It’s not performant to group by or order by a JSON field, because JSON_EXTRACT_SCALAR is needed to extract the values in the GROUP BY and ORDER BY clauses, which invokes the function evaluation.
It does not work with Pinot's such as DISTINCTCOUNTMV.

Alternatively, from Pinot 0.8, you can use the complex-type handling in ingestion configurations to flatten and unnest the complex structure and convert them into primitive types. Then you can reduce the complex-type data into a flattened Pinot table, and query it via SQL. With the built-in processing rules, you do not need to write ETL jobs in another compute framework such as Flink or Spark.

To process this complex type, you can add the configuration complexTypeConfig to the ingestionConfig. For example:

With the complexTypeConfig , all the map objects will be flattened to direct fields automatically. And with unnestFields , a record with the nested collection will unnest into multiple records. For instance, the example at the beginning will transform into two rows with this configuration example.

Note that:

The nested field group_id under group is flattened to group.group_id. The default value of the delimiter is . You can choose another delimiter by specifying the configuration delimiter under complexTypeConfig. This flattening rule also applies to maps in the collections to be unnested.

You can find the full specifications of the table config and the table schema .

You can then query the table with primitive values using the following SQL query:

. is a reserved character in SQL, so you need to quote the flattened columns in the query.

Infer the Pinot schema from the Avro schema and JSON data

When there are complex structures, it can be challenging and tedious to figure out the Pinot schema manually. To help with schema inference, Pinot provides utility tools to take the Avro schema or JSON data as input and output the inferred Pinot schema.

To infer the Pinot schema from Avro schema, you can use a command like this:

Note you can input configurations like fieldsToUnnest similar to the ones in complexTypeConfig. And this will simulate the complex-type handling rules on the Avro schema and output the Pinot schema in the file specified in outputDir.

Similarly, you can use the command like the following to infer the Pinot schema from a file of JSON objects.

You can check out an example of this run in this .

Ingest records with dynamic schemas

Storing records with dynamic schemas in a table with a fixed schema.

Some domains (e.g., logging) generate records where each record can have a different set of keys, whereas Pinot tables have a relatively static schema. For records with varying keys, it's impractical to store each field in its own table column. However, most (if not all) fields may be important, so fields should not be dropped unnecessarily.

The SchemaConformingTransformer is a RecordTransformer that can transform records with dynamic schemas such that they can be ingested in a table with a static schema. The transformer primarily takes record fields that don't exist in the schema and stores them in a type of catchall field.

For example, consider this record:

Let's say the table's schema contains the following fields:

timestamp
hostname
level
message
tags.platform
tags.service
indexableExtras
unindexableExtras

Without this transformer, the HOSTNAME field and the entire tags field would be dropped when storing the record in the table. However, with this transformer, the record would be transformed into the following:

Notice that the transformer does the following:

Flattens nested fields which exist in the schema, like tags.platform
Drops some fields like HOSTNAME, where HOSTNAME must be listed as a field in the config option fieldPathsToDrop

The unindexableExtras field allows the transformer to separate fields that don't need indexing (because they are only retrieved, not searched) from those that do.

SchemaConformingTransformer Configuration

To use the transformer, add the schemaConformingTransformerConfig option in the ingestionConfig section of your table configuration, as shown in the following example.

For example:

Available configuration options are listed in .

Complex Type (Array, Map) Handling

Complex type handling in Apache Pinot.

Commonly, ingested data has a complex structure. For example, Avro schemas have records and arrays while JSON supports objects and arrays.

There are two options for complex type handling:

Convert the complex-type data into a JSON string and then build a JSON index.
Use the built-in complex-type handling rules in the ingestion configuration.

On this page, we'll show how to handle these complex-type structures with each of these two approaches. We will process some example data, consisting of the field group from the .

This object has two child fields and the child group is a nested array with elements of object type.

JSON indexing

Apache Pinot provides a powerful to accelerate the value lookup and filtering for the column. To convert an object group with complex type to JSON, add the following to your table configuration.

The config transformConfigs transforms the object group to a JSON string group_json, which then creates the JSON indexing with configuration jsonIndexColumns. To read the full spec, see .

Also, note that group is a reserved keyword in SQL and therefore needs to be quoted in transformFunction.

The schema will look like this:

For the full specification, see .

With this, you can start to query the nested fields under group. For more details about the supported JSON function, see ).

Ingestion configurations

Though JSON indexing is a handy way to process the complex types, there are some limitations:

It’s not performant to group by or order by a JSON field, because JSON_EXTRACT_SCALAR is needed to extract the values in the GROUP BY and ORDER BY clauses, which invokes the function evaluation.
It does not work with Pinot's such as DISTINCTCOUNTMV.

To process this complex type, you can add the configuration complexTypeConfig to the ingestionConfig. For example:

Note that:

The nested field group_id under group is flattened to group.group_id. The default value of the delimiter is . You can choose another delimiter by specifying the configuration delimiter under complexTypeConfig. This flattening rule also applies to maps in the collections to be unnested.

You can find the full specifications of the table config and the table schema .

You can then query the table with primitive values using the following SQL query:

. is a reserved character in SQL, so you need to quote the flattened columns in the query.

Infer the Pinot schema from the Avro schema and JSON data

To infer the Pinot schema from Avro schema, you can use a command like this:

Similarly, you can use the command like the following to infer the Pinot schema from a file of JSON objects.

You can check out an example of this run in this .

fileFormat: default, rfc4180, excel, tdf, mysql
header: Header of the file. The columnNames should be separated by the delimiter mentioned in the configuration.
delimiter: The character seperating the columns.
multiValueDelimiter: The character separating multiple values in a single column. This can be used to split a column into a list.
skipHeader: Skip header record in the file. Boolean.
ignoreEmptyLines: Ignore empty lines (instead of filling them with default values). Boolean.
ignoreSurroundingSpaces: ignore spaces around column names and values. Boolean
quoteCharacter: Single character used for quotes in CSV files.
recordSeparator: Character used to separate records in the input file. Default is or \r depending on the platform.
nullStringValue: String value that represents null in CSV files. Default is empty string.
skipUnParseableLines : Skip lines that cannot be parsed. Note that this would result in data loss. Boolean.

Ingest records with dynamic schemas

Storing records with dynamic schemas in a table with a fixed schema.

For example, consider this record:

{
  "timestamp": 1687786535928,
  "hostname": "host1",
  "HOSTNAME": "host1",
  "level": "INFO",
  "message": "Started processing job1",
  "tags": {
    "platform": "data",
    "service": "serializer",
    "params": {
      "queueLength": 5,
      "timeout": 299,
      "userData_noIndex": {
        "nth": 99
      }
    }
  }
}

Let's say the table's schema contains the following fields:

timestamp
hostname
level
message
tags.platform
tags.service
indexableExtras
unindexableExtras

Notice that the transformer does the following:

Flattens nested fields which exist in the schema, like tags.platform
Drops some fields like HOSTNAME, where HOSTNAME must be listed as a field in the config option fieldPathsToDrop

The unindexableExtras field allows the transformer to separate fields that don't need indexing (because they are only retrieved, not searched) from those that do.

SchemaConformingTransformer Configuration

To use the transformer, add the schemaConformingTransformerConfig option in the ingestionConfig section of your table configuration, as shown in the following example.

For example:

Available configuration options are listed in .

Input formats

Configuring input formats

Complex Type (Array, Map) Handling

JSON indexing

Ingestion configurations

Infer the Pinot schema from the Avro schema and JSON data

Ingest records with dynamic schemas

SchemaConformingTransformer Configuration

Complex Type (Array, Map) Handling

JSON indexing

Ingestion configurations

Infer the Pinot schema from the Avro schema and JSON data

Input formats

Configuring input formats

Supported input formats

CSV

Avro

JSON

Thrift

Parquet

ORC

Protocol Buffers

Ingest records with dynamic schemas

SchemaConformingTransformer Configuration

Input formats

hashtagConfiguring input formats

Complex Type (Array, Map) Handling

hashtagJSON indexing

hashtagIngestion configurations

hashtagInfer the Pinot schema from the Avro schema and JSON data

Ingest records with dynamic schemas

hashtagSchemaConformingTransformer Configuration

Complex Type (Array, Map) Handling

hashtagJSON indexing

hashtagIngestion configurations

hashtagInfer the Pinot schema from the Avro schema and JSON data

Input formats

hashtagConfiguring input formats

hashtagSupported input formats

hashtagCSV

hashtagAvro

hashtagJSON

hashtagThrift

hashtagParquet

hashtagORC

hashtagProtocol Buffers

Ingest records with dynamic schemas

hashtagSchemaConformingTransformer Configuration

Configuring input formats

JSON indexing

Ingestion configurations

Infer the Pinot schema from the Avro schema and JSON data

SchemaConformingTransformer Configuration

JSON indexing

Ingestion configurations

Infer the Pinot schema from the Avro schema and JSON data

Configuring input formats

Supported input formats

CSV

Avro

JSON

Thrift

Parquet

ORC

Protocol Buffers

SchemaConformingTransformer Configuration