JSON index
This page describes configuring the JSON index for Apache Pinot.
The JSON index can be applied to JSON string columns to accelerate value lookups and filtering for the column.
When to use JSON index
JSON strings can be used to represent arrays, maps, and nested fields without forcing a fixed schema. While JSON strings are flexible, filtering on JSON string columns is expensive, so consider the use case.
Suppose we have some JSON records similar to the following sample record stored in the person column:
{
  "name": "adam",
  "age": 30,
  "country": "us",
  "addresses":
  [
    {
      "number" : 112,
      "street" : "main st",
      "country" : "us"
    },
    {
      "number" : 2,
      "street" : "second st",
      "country" : "us"
    },
    {
      "number" : 3,
      "street" : "third st",
      "country" : "ca"
    }
  ]
}Without an index, to look up the key and filter records based on the value, Pinot must scan and reconstruct the JSON object from the JSON string for every record, look up the key and then compare the value.
For example, in order to find all persons whose name is "adam", the query will look like:
SELECT *
FROM mytable
WHERE JSON_EXTRACT_SCALAR(person, '$.name', 'STRING') = 'adam'The JSON index is designed to accelerate the filtering on JSON string columns without scanning and reconstructing all the JSON objects.
Enable and configure a JSON index
To enable the JSON index, you can configure the following options in the table configuration:
maxLevels
Max levels to flatten the json object (array is also counted as one level)
int
-1 (unlimited)
excludeArray
Whether to exclude array when flattening the object
boolean
false (include array)
disableCrossArrayUnnest
Whether to not unnest multiple arrays (unique combination of all elements in those arrays). If document contains two arrays holding, respectively M and N elements, then flattening produces M*N documents. If number of such combinations reaches 100k, error with "Got too many combinations" message is thrown.
boolean
false (calculate unique combination of all elements)
includePaths
Only include the given paths, e.g. "$.a.b", "$.a.c[*]" (mutual exclusive with excludePaths). Paths under the included paths will be included, e.g. "$.a.b.c" will be included when "$.a.b" is configured to be included.
Set<String>
null (include all paths)
excludePaths
Exclude the given paths, e.g. "$.a.b", "$.a.c[*]" (mutual exclusive with includePaths). Paths under the excluded paths will also be excluded, e.g. "$.a.b.c" will be excluded when "$.a.b" is configured to be excluded.
Set<String>
null (include all paths)
excludeFields
Exclude the given fields, e.g. "b", "c", even if it is under the included paths.
Set<String>
null (include all fields)
indexPaths
Index the given paths, e.g. *.*, a.**. Paths matches the indexed paths will be indexed, e.g. a.** will index everything whose first layer is "a", *.* will index everything with maxLevels=2. This config could work together with other configs, e.g. includePaths, excludePaths, maxLevels but usually does not have to because it should be flexible enough to catch any scenarios.
Set<String>
null that is equivalent to ** (include all fields)
maxValueLength
If the value of a json node (not the whole document)  is longer  than given value then replace it with $SKIPPED$ before indexing. 
int
0 (disabled)
skipInvalidJson
If set, while adding json to index, instead of throwing exception, replace ill-formed json with empty key/path and $SKIPPED$ value .
boolean
false (disabled)
Recommended way to configure
The recommended way to configure a JSON index is in the fieldConfigList.indexes object, within the json key.
{
  "fieldConfigList": [
    {
      "name": "person",
      "indexes": {
        "json": {
          "maxLevels": 2,
          "excludeArray": false,
          "disableCrossArrayUnnest": true,
          "includePaths": null,
          "excludePaths": null,
          "excludeFields": null,
          "indexPaths": null
        }
      }
    }
  ],
  ...
}All options are optional, so the following is a valid configuration that use the default parameter values:
{
  "fieldConfigList": [
    {
      "name": "person",
      "indexes": {
        "json": {}
      }
    }
  ],
  ...
}Deprecated ways to configure JSON indexes
There are two older ways to configure the indexes that can be configured in the tableIndexConfig section inside table config.
The first one uses the same JSON explained above, but it is defined inside tableIndexConfig.jsonIndexConfigs.<column name>:
{
  "tableIndexConfig": {
    "jsonIndexConfigs": {
      "person": {
        "maxLevels": 2,
        "excludeArray": false,
        "disableCrossArrayUnnest": true,
        "includePaths": null,
        "excludePaths": null,
        "excludeFields": null,
        "indexPaths": null
      },
      ...
    },
    ...
  }
}Like in the previous case, all parameters are optional, so the following is also valid:
{
  "tableIndexConfig": {
    "jsonIndexConfigs": {
      "person": {},
      ...
    },
    ...
  }
}The last option does not support to configure any parameter. In order to use this option, add the name of the column in tableIndexConfig.jsonIndexColumns like in this example:
{
  "tableIndexConfig": {
    "jsonIndexColumns": [
      "person",
      ...
    ],
    ...
  }
}Example:
With the following JSON document:
{
  "name": "adam",
  "age": 20,
  "addresses": [
    {
      "country": "us",
      "street": "main st",
      "number": 1
    },
    {
      "country": "ca",
      "street": "second st",
      "number": 2
    }
  ],
  "skills": [
    "english",
    "programming"
  ]
}Using the default setting, we will flatten the document into the following records:
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1,
  "skills[1]": "programming"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "addresses[1].number": 2,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "addresses[1].number": 2,
  "skills[1]": "programming"
}With maxValueLength set to 9:
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1,
  "skills[1]": "$SKIPPED$"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "addresses[1].number": 2,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "addresses[1].number": 2,
  "skills[1]": "$SKIPPED$"
}With maxLevels set to 1:
{
  "name": "adam",
  "age": 20
}With maxLevels set to 2:
{
  "name": "adam",
  "age": 20,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "skills[1]": "programming"
}With excludeArray set to true:
{
  "name": "adam",
  "age": 20
}With disableCrossArrayUnnest set to true:
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "addresses[0].number": 1
},
{
  "name": "adam",
  "age": 20,
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "skills[1]": "programming"
}When cross array un-nesting is disabled, then number of documents produced during JSON flattening is the sum of all array sizes, e.g. 2+2 = 4 in the example above.
With disableCrossArrayUnnest set to false:
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].number": 1,
  "addresses[0].street": "main st",
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
  "addresses[0].number": 1,
  "addresses[0].street": "main st",
  "skills[1]": "programming"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].number": 2,
  "addresses[1].street": "second st",
  "skills[0]": "english"
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
  "addresses[1].number": 2,
  "addresses[1].street": "second st",
  "skills[1]": "programming"
}When cross array un-nesting is enabled, then number of documents produced during JSON flattening is the product of all array sizes, e.g. 2*2 = 4 in the example above. If JSON contains multiple large nested arrays, it might be necessary to disable cross array un-nesting (disableCrossArrayUnnest=true) to avoid hitting the 100k flattened documents limit and triggering 'Got to many combinations' error.
With includePaths set to ["$.name", "$.addresses[*].country"]:
{
  "name": "adam",
  "addresses[0].country": "us"
},
{
  "name": "adam",
  "addresses[1].country": "ca"
}With excludePaths set to ["$.age", "$.addresses[*].number"]:
{
  "name": "adam",
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "skills[0]": "english"
},
{
  "name": "adam",
  "addresses[0].country": "us",
  "addresses[0].street": "main st",
  "skills[1]": "programming"
},
{
  "name": "adam",
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "skills[0]": "english"
},
{
  "name": "adam",
  "addresses[1].country": "ca",
  "addresses[1].street": "second st",
  "skills[1]": "programming"
}With excludeFields set to ["age", "street"]:
{
  "name": "adam",
  "addresses[0].country": "us",
  "addresses[0].number": 1,
  "skills[0]": "english"
},
{
  "name": "adam",
  "addresses[0].country": "us",
  "addresses[0].number": 1,
  "skills[1]": "programming"
},
{
  "name": "adam",
  "addresses[1].country": "ca",
  "addresses[1].number": 2,
  "skills[0]": "english"
},
{
  "name": "adam",
  "addresses[1].country": "ca",
  "addresses[1].number": 2,
  "skills[1]": "programming"
}With indexPaths set to ["*", "address..country"]:
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
},
{
  "name": "adam",
  "age": 20,
  "addresses[0].country": "us",
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
},
{
  "name": "adam",
  "age": 20,
  "addresses[1].country": "ca",
}With skipInvalidJson set to true, if we corrupt the original JSON, e.g. to
{ _invalid_json_
  "name": "adam",
  "age": 20,
  "addresses": [...]
  "skills": [...]
}then flattening will be produce:
{ "": "$SKIPPED$" }Note that the JSON index can only be applied to STRING/JSON columns whose values are JSON strings.
How to use the JSON index
The JSON index can be used via the JSON_MATCH predicate for filtering: JSON_MATCH(<column>, '<filterExpression>'). For example, to find every entry with the name "adam":
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name"=''adam''')Note that the quotes within the filter expression need to be escaped.
The JSON index can also be used via the JSON_EXTRACT_INDEX predicate for value extraction (optionally with filtering): JSON_EXTRACT_INDEX(<column>, '<jsonPath>', ['resultsType'], ['filter']). For example, to extract every value for path $.name when the path $.id is less than 10:
SELECT jsonextractindex(repo, '$.name', 'STRING', 'dummyValue', '"$.id" < 10')
FROM mytableMore in-depth examples can be found in the JSON_EXTRACT_INDEX function documentation.
Supported filter expressions
Simple key lookup
Find all persons whose name is "adam":
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name"=''adam''')or
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name" IN (''adam'')')Chained key lookup
Find all persons who have an address (one of the addresses) with number 112:
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].number"=112')Find all persons who have at least one address that is not in the US:
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country" != ''us''')or
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country" NOT IN (''us'') ')Regex based lookup
Find all persons who have an address (one of the addresses) where the street contains the term 'st':
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, 'REGEXP_LIKE("$.addresses[*].street", ''.*st.*'')')Range lookup
Find all persons whose age is greater than 18:
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.age" > 18')Find all persons whose age is between 20 and 40 (inclusive):
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.age" BETWEEN 20 AND 40')Nested filter expression
Find all persons whose name is "adam" and also have an address (one of the addresses) with number 112:
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.name"=''adam'' AND "$.addresses[*].number"=112')Array access
Find all persons whose first address has number 112:
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[0].number"=112')Since JSON index works based on flattened JSON documents, if cross array un-nesting is disabled ( disableCrossArrayUnnest = true ), then querying more than one array in a single JSON_MATCH function call returns empty result, e.g.
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country"=''us'' AND "$.skills[*]"=''english''')In such cases expression should be split into multiple JSON_MATCH calls, e.g.
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].country"=''us''')
AND   JSON_MATCH(person, '"$.skills[*]"=''english''')Existence check
Find all persons who have a phone field within the JSON:
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.phone" IS NOT NULL')Find all persons whose first address does not contain floor field within the JSON:
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[0].floor" IS NULL')JSON context is maintained
The JSON context is maintained for object elements within an array, meaning the filter won't cross-match different objects in the array.
To find all persons who live on "main st" in "ca":
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].street"=''main st'' AND "$.addresses[*].country"=''ca''')This query won't match "adam" because none of his addresses matches both the street and the country.
If you don't want JSON context, use multiple separate JSON_MATCH predicates. For example, to find all persons who have addresses on "main st" and have addresses in "ca" (matches need not have the same address):
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[*].street"=''main st''') 
  AND JSON_MATCH(person, '"$.addresses[*].country"=''ca''')This query will match "adam" because one of his addresses matches the street and another one matches the country.
The array index is maintained as a separate entry within the element, so in order to query different elements within an array, multiple JSON_MATCH predicates are required. For example, to find all persons who have first address on "main st" and second address on "second st":
SELECT ...
FROM mytable
WHERE JSON_MATCH(person, '"$.addresses[0].street"=''main st''') 
  AND JSON_MATCH(person, '"$.addresses[1].street"=''second st''')Supported JSON values
Object
See examples above.
Array
["item1", "item2", "item3"]To find the records with array element "item1" in "arrayCol":
SELECT ...
FROM mytable
WHERE JSON_MATCH(arrayCol, '"$[*]"=''item1''')To find the records with second array element "item2" in "arrayCol":
SELECT ...
FROM mytable
WHERE JSON_MATCH(arrayCol, '"$[1]"=''item2''')Value
123
1.23
"Hello World"To find the records with value 123 in "valueCol":
SELECT ...
FROM mytable
WHERE JSON_MATCH(valueCol, '"$"=123')Null
nullTo find the records with null in "nullableCol":
SELECT ...
FROM mytable
WHERE JSON_MATCH(nullableCol, '"$" IS NULL')Limitations
- The key (left-hand side) of the filter expression must be the leaf level of the JSON object, for example, - "$.addresses[*]"='main st'won't work.
Last updated
Was this helpful?

