Complex Type (Array, Map) Handling
Complex-type handling in Apache Pinot.
It's common for the ingested data to have complex structure. For example, Avro schema has records and arrays, and JSON data has objects and arrays. In Apache Pinot, the data model supports primitive data types (including int, long, float, double, string, bytes), as well as limited multi-value types such as an array of primitive types. Such simple data types allow Pinot to build fast indexing structures for good query performance, but it requires some handling on the complex structures. There are in general two options for such handling: convert the complex-type data into JSON string and then build JSON index; or use the inbuilt complex-type handling rules in the ingestion config.
In this page, we'll show how to handle this complex-type structure with these two approaches, to process the example data in the following figure, which is a field
groupfrom the Meetup events Quickstart example. Note this object has two child fields, and the child
groupis a nested array with the element of object type.
Example JSON data
Apache Pinot provides powerful JSON index to accelerate the value lookup and filtering for the column. To convert an object
groupwith complex type to JSON, you can add the following config to table config.
Note the config
transformConfigstransforms the object
groupto a JSON string
group_json, which then creates the JSON indexing with config
jsonIndexColumns. To read the full spec, see this file. Also note that
groupis a reserved keyword in SQL and therefore needs to be quoted in
columnNamecan't use the same name as any of the fields in the source JSON data e.g. if our source data contains the field
groupand we want to transform the data in that field before persisting it, the destination column name would need to be something different, like
Additionally, you need to overwrite the
maxLengthof the field
group_jsonon the schema, because by default, a string column has a limited length. For example,
Though JSON indexing is a handy way to process the complex types, there are some limitations:
- It’s not performant to group by or order by a JSON field, because
JSON_EXTRACT_SCALARis needed to extract the values in the GROUP BY and ORDER BY clauses, which invokes the function evaluation.
Alternatively, from Pinot 0.8, you can use the complex-type handling in ingestion configurations to flatten and unnest the complex structure and convert them into primitive types. Then you can reduce the complex-type data into a flattened Pinot table, and query it via SQL. With the inbuilt processing rules, you do not need to write ETL jobs in another compute framework such as Flink or Spark.
To process this complex-type, you can add the configuration
ingestionConfig. For example:
complexTypeConfig, all the map objects will be flattened to direct fields automatically. And with
unnestFields, a record with the nested collection will unnest into multiple records. For instance, the example in the beginning will transform into two rows with this configuration example.
- The nested field
groupis flattened to field
group.group_id. The default value of the delimiter is
., you can choose other delimiter by changing the configuration
complexTypeConfig. This flattening rule also apllies on the maps in the collections to be unnested.
- The nested array
groupis unnested into the top-level, and convert the output to a collection of two rows. Note the handling of the nested field within
group_topics, and the eventual top-level field of
group.group_topics.urlkey. All the collections to unnest shall be included in configuration
- For the collections not in specified in
fieldsToUnnest, the ingestion by default will serialize them into JSON string, except for the array of primitive values, which will be ingested as multi-value column by default. The behavior is defined in config
collectionNotUnnestedToJsonwith default value to
NON_PRIMITIVE. Other behaviors include (1)
ALL, which aslo convert the array of primitive values to JSON string; (2)
NONE, this does not do conversion, but leave it to the users to use transform functions for handling.
With the flattening/unnesting, you can then query the table with primitive values using the SQL query like:
.is a reserved character in SQL, so you need to quote the flattened column.
When there are complex structures, it could be challenging and tedious to figure out the Pinot schema manually. To help the schema inference, Pinot provides utility tools to take the Avro schema or JSON data as input and output the inferred Pinot schema.
To infer the Pinot schema from Avro schema, you can use the command like the following
bin/pinot-admin.sh AvroSchemaToPinotSchema \
-timeColumnName fields.hoursSinceEpoch \
-avroSchemaFile /tmp/test.avsc \
-pinotSchemaName myTable \
-outputDir /tmp/test \
Note you can input configurations like
fieldsToUnnestsimilar to the ones in
complexTypeConfig. And this will simulate the complex-type handling rules on the Avro schema and output the Pinot schema in the file specified in
Similarly, you can use the command like the following to infer the Pinot schema from a file of JSON objects.
bin/pinot-admin.sh JsonToPinotSchema \
-timeColumnName hoursSinceEpoch \
-jsonFile /tmp/test.json \
-pinotSchemaName myTable \
-outputDir /tmp/test \