1 of 1

Text search support

This page talks about support for text search in Pinot.

This text index method is recommended over the experimental native text index.

Click to skip the background info and go straight to the procedure to enable this text index.

Why do we need text search?

Pinot supports super-fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index, and sorted index.

This is useful for a query like the following, which looks for exact matches on two columns of type STRING and INT respectively:

For arbitrary text data that falls into the BLOB/CLOB territory, we need more than exact matches. This often involves using regex, phrase, fuzzy queries on BLOB like data. Text indexes can efficiently perform arbitrary search on STRING columns where each column value is a large BLOB of text using the TEXT_MATCH function, like this:

where <column_name> is the column text index is created on and <search_expression> conforms to one of the following:

Current restrictions

Pinot supports text search with the following requirements:

The column type should be STRING.
The column should be single-valued.
Using a text index in coexistence with other Pinot indexes is not supported.

Sample Datasets

Text search should ideally be used on STRING columns where doing standard filter operations (EQUALITY, RANGE, BETWEEN) doesn't fit the bill because each column value is a reasonably large blob of text.

Apache Access Log

Consider the following snippet from an Apache access log. Each line in the log consists of arbitrary data (IP addresses, URLs, timestamps, symbols etc) and represents a column value. Data like this is a good candidate for doing text search.

Let's say the following snippet of data is stored in the ACCESS\_LOG\_COL column in a Pinot table.

Here are some examples of search queries on this data:

Count the number of GET requests.

Count the number of POST requests that have administrator in the URL (administrator/index)

Count the number of POST requests that have a particular URL and handled by Firefox browser

Resume text

Let's consider another example using text from job candidate resumes. Each line in this file represents skill-data from resumes of different candidates.

This data is stored in the SKILLS_COL column in a Pinot table. Each line in the input text represents a column value.

Here are some examples of search queries on this data:

Count the number of candidates that have "machine learning" and "gpu processing": This is a phrase search (more on this further in the document) where we are looking for exact match of phrases "machine learning" and "gpu processing", not necessarily in the same order in the original data.

Count the number of candidates that have "distributed systems" and either 'Java' or 'C++': This is a combination of searching for exact phrase "distributed systems" along with other terms.

Query Log

Next, consider a snippet from a log file containing SQL queries handled by a database. Each line (query) in the file represents a column value in the QUERY\_LOG\_COL column in a Pinot table.

Here are some examples of search queries on this data:

Count the number of queries that have GROUP BY

Count the number of queries that have the SELECT count... pattern

Count the number of queries that use BETWEEN filter on timestamp column along with GROUP BY

Read on for concrete examples on each kind of query and step-by-step guides covering how to write text search queries in Pinot.

A column in Pinot can be dictionary-encoded or stored RAW. In addition, we can create an inverted index and/or a sorted index on a dictionary-encoded column.

The text index is an addition to the type of per-column indexes users can create in Pinot. However, it only supports text index on a RAW column, not a dictionary-encoded column.

Enable a text index

Enable a text index on a column in the by adding a new section with the name "fieldConfigList".

Each column that has a text index should also be specified as noDictionaryColumns in tableIndexConfig:

You can configure text indexes in the following scenarios:

Adding a new table with text index enabled on one or more columns.
Adding a new column with text index enabled to an existing table.
Enabling a text index on an existing column.

When you're using a text index, add the indexed column to the noDictionaryColumns columns list to reduce unnecessary storage overhead.

For instructions on that configuration property, see the documentation.

Text index creation

Once the text index is enabled on one or more columns through a , segment generation code will automatically create the text index (per column).

Text index is supported for both offline and real-time segments.

Text parsing and tokenization

The original text document (denoted by a value in the column that has text index enabled) is parsed, tokenized and individual "indexable" terms are extracted. These terms are inserted into the index.

Pinot's text index is built on top of Lucene. Lucene's standard english text tokenizer generally works well for most classes of text. To build a custom text parser and tokenizer to suit particular user requirements, this can be made configurable for the user to specify on a per-column text-index basis.

There is a default set of "stop words" built in Pinot's text index. This is a set of high frequency words in English that are excluded for search efficiency and index size, including:

Any occurrence of these words will be ignored by the tokenizer during index creation and search.

In some cases, users might want to customize the set. A good example would be when IT (Information Technology) appears in the text that collides with "it", or some context-specific words that are not informative in the search. To do this, one can config the words in fieldConfig to include/exclude from the default stop words:

The words should be comma separated and in lowercase. Words appearing in both lists will be excluded as expected.

Writing text search queries

The TEXT_MATCH function enables using text search in SQL/PQL.

TEXT_MATCH(text_column_name, search_expression)

text_column_name - name of the column to do text search on.
search_expression - search query

You can use TEXT_MATCH function as part of queries in the WHERE clause, like this:

You can also use the TEXT_MATCH filter clause with other filter operators. For example:

You can combine multiple TEXT_MATCH filter clauses:

TEXT_MATCH can be used in WHERE clause of all kinds of queries supported by Pinot.

Selection query which projects one or more columns
- User can also include the text column name in select list
Aggregation query

The search expression (the second argument to TEXT_MATCH function) is the query string that Pinot will use to perform text search on the column's text index.

Phrase query

This query is used to seek out an exact match of a given phrase, where terms in the user-specified phrase appear in the same order in the original text document.

The following example reuses the earlier example of resume text data containing 14 documents to walk through queries. In this sentence, "document" means the column value. The data is stored in the SKILLS_COL column and we have created a text index on this column.

This example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "Distributed systems":

The search expression is '\"Distributed systems\"'

The search expression is always specified within single quotes '<your expression>'
Since we are doing a phrase search, the phrase should be specified within double quotes inside the single quotes and the double quotes should be escaped
- '\"<your phrase>\"'

The above query will match the following documents:

But it won't match the following document:

This is because the phrase query looks for the phrase occurring in the original document "as is". The terms as specified by the user in phrase should be in the exact same order in the original document for the document to be considered as a match.

NOTE: Matching is always done in a case-insensitive manner.

The next example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "query processing":

The above query will match the following documents:

Term query

Term queries are used to search for individual terms.

This example will query the SKILLS_COL column to look for documents where each matching document MUST contain the term 'Java'.

As mentioned earlier, the search expression is always within single quotes. However, since this is a term query, we don't have to use double quotes within single quotes.

Composite query using Boolean operators

The Boolean operators AND and OR are supported and we can use them to build a composite query. Boolean operators can be used to combine phrase and term queries in any arbitrary manner

This example queries the SKILLS_COL column to look for documents where each matching document MUST contain the phrases "distributed systems" and "tensor flow". This combines two phrases using the AND Boolean operator.

The above query will match the following documents:

This example queries the SKILLS_COL column to look for documents where each document MUST contain the phrase "machine learning" and the terms 'gpu' and 'python'. This combines a phrase and two terms using Boolean operators.

The above query will match the following documents:

When using Boolean operators to combine term(s) and phrase(s) or both, note that:

The matching document can contain the terms and phrases in any order.
The matching document may not have the terms adjacent to each other (if this is needed, use appropriate phrase query).

Use of the OR operator is implicit. In other words, if phrase(s) and term(s) are not combined using AND operator in the search expression, the OR operator is used by default:

This example queries the SKILLS_COL column to look for documents where each document MUST contain ANY one of:

phrase "distributed systems" OR
term 'java' OR
term 'C++'.

Grouping using parentheses is supported:

This example queries the SKILLS_COL column to look for documents where each document MUST contain

phrase "distributed systems" AND
at least one of the terms Java or C++

Here the terms Java and C++ are grouped without any operator, which implies the use of OR. The root operator AND is used to combine this with phrase "distributed systems"

Prefix query

Prefix queries can be done in the context of a single term. We can't use prefix matches for phrases.

This example queries the SKILLS_COL column to look for documents where each document MUST contain text like stream, streaming, streams etc

The above query will match the following documents:

Regular Expression Query

Phrase and term queries work on the fundamental logic of looking up the terms in the text index. The original text document (a value in the column with text index enabled) is parsed, tokenized, and individual "indexable" terms are extracted. These terms are inserted into the index.

Based on the nature of the original text and how the text is segmented into tokens, it is possible that some terms don't get indexed individually. In such cases, it is better to use regular expression queries on the text index.

Consider a server log as an example where we want to look for exceptions. A regex query is suitable here as it is unlikely that 'exception' is present as an individual indexed token.

Syntax of a regex query is slightly different from queries mentioned earlier. The regular expression is written between a pair of forward slashes (/).

The above query will match any text document containing "exception".

Phrase search with wildcard term matching

Phrase search with wildcard and prefix term matching can match patterns like "pache pino" to the text "Apache Pinot" directly. The kind of queries is very common in use case like log search where user needs to search substrings across term boundary in long text. To enable such search (which can be more costly because Lucene by default does not allow * to start a pattern to avoid costly term matching), one can add a new config key to the column text index config:

With this config enabled, one can now perform the pharse wildcard search using the following syntax like

to match the string "Apache pinot" in the SIKLLS_COL. Boolean expressions like 'pache pino AND apche luce' are are supported.

Deciding Query Types

Combining phrase and term queries using Boolean operators and grouping lets you build a complex text search query expression.

The key thing to remember is that phrases should be used when the order of terms in the document is important and when separating the phrase into individual terms doesn't make sense from end user's perspective.

An example would be phrase "machine learning".

However, if we are searching for documents matching Java and C++ terms, using phrase query "Java C++" will actually result in in partial results (could be empty too) since now we are relying the on the user specifying these skills in the exact same order (adjacent to each other) in the resume text.

Term query using Boolean AND operator is more appropriate for such cases

Text Index Tuning

To improve Lucene index creation time, some configs have been provided. Field Config properties luceneUseCompoundFile and luceneMaxBufferSizeMB can provide faster index writing at but may increase file descriptors and/or memory pressure.

Text search support

This page talks about support for text search in Pinot.

This text index method is recommended over the experimental native text index.

Click to skip the background info and go straight to the procedure to enable this text index.

Why do we need text search?

This is useful for a query like the following, which looks for exact matches on two columns of type STRING and INT respectively:

where <column_name> is the column text index is created on and <search_expression> conforms to one of the following:

Current restrictions

Pinot supports text search with the following requirements:

The column type should be STRING.
The column should be single-valued.
Using a text index in coexistence with other Pinot indexes is not supported.

Sample Datasets

Apache Access Log

Let's say the following snippet of data is stored in the ACCESS\_LOG\_COL column in a Pinot table.

Here are some examples of search queries on this data:

Count the number of GET requests.

Count the number of POST requests that have administrator in the URL (administrator/index)

Count the number of POST requests that have a particular URL and handled by Firefox browser

Resume text

Let's consider another example using text from job candidate resumes. Each line in this file represents skill-data from resumes of different candidates.

This data is stored in the SKILLS_COL column in a Pinot table. Each line in the input text represents a column value.

Here are some examples of search queries on this data:

Count the number of candidates that have "distributed systems" and either 'Java' or 'C++': This is a combination of searching for exact phrase "distributed systems" along with other terms.

Query Log

Next, consider a snippet from a log file containing SQL queries handled by a database. Each line (query) in the file represents a column value in the QUERY\_LOG\_COL column in a Pinot table.

Here are some examples of search queries on this data:

Count the number of queries that have GROUP BY

Count the number of queries that have the SELECT count... pattern

Count the number of queries that use BETWEEN filter on timestamp column along with GROUP BY

Read on for concrete examples on each kind of query and step-by-step guides covering how to write text search queries in Pinot.

A column in Pinot can be dictionary-encoded or stored RAW. In addition, we can create an inverted index and/or a sorted index on a dictionary-encoded column.

The text index is an addition to the type of per-column indexes users can create in Pinot. However, it only supports text index on a RAW column, not a dictionary-encoded column.

Enable a text index

Enable a text index on a column in the by adding a new section with the name "fieldConfigList".

Each column that has a text index should also be specified as noDictionaryColumns in tableIndexConfig:

You can configure text indexes in the following scenarios:

Adding a new table with text index enabled on one or more columns.
Adding a new column with text index enabled to an existing table.
Enabling a text index on an existing column.

When you're using a text index, add the indexed column to the noDictionaryColumns columns list to reduce unnecessary storage overhead.

For instructions on that configuration property, see the documentation.

Text index creation

Once the text index is enabled on one or more columns through a , segment generation code will automatically create the text index (per column).

Text index is supported for both offline and real-time segments.

Text parsing and tokenization

The original text document (denoted by a value in the column that has text index enabled) is parsed, tokenized and individual "indexable" terms are extracted. These terms are inserted into the index.

There is a default set of "stop words" built in Pinot's text index. This is a set of high frequency words in English that are excluded for search efficiency and index size, including:

Any occurrence of these words will be ignored by the tokenizer during index creation and search.

The words should be comma separated and in lowercase. Words appearing in both lists will be excluded as expected.

Writing text search queries

The TEXT_MATCH function enables using text search in SQL/PQL.

TEXT_MATCH(text_column_name, search_expression)

text_column_name - name of the column to do text search on.
search_expression - search query

You can use TEXT_MATCH function as part of queries in the WHERE clause, like this:

You can also use the TEXT_MATCH filter clause with other filter operators. For example:

You can combine multiple TEXT_MATCH filter clauses:

TEXT_MATCH can be used in WHERE clause of all kinds of queries supported by Pinot.

Selection query which projects one or more columns
- User can also include the text column name in select list
Aggregation query

The search expression (the second argument to TEXT_MATCH function) is the query string that Pinot will use to perform text search on the column's text index.

Phrase query

This query is used to seek out an exact match of a given phrase, where terms in the user-specified phrase appear in the same order in the original text document.

This example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "Distributed systems":

The search expression is '\"Distributed systems\"'

The search expression is always specified within single quotes '<your expression>'
Since we are doing a phrase search, the phrase should be specified within double quotes inside the single quotes and the double quotes should be escaped
- '\"<your phrase>\"'

The above query will match the following documents:

But it won't match the following document:

NOTE: Matching is always done in a case-insensitive manner.

The next example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "query processing":

The above query will match the following documents:

Term query

Term queries are used to search for individual terms.

This example will query the SKILLS_COL column to look for documents where each matching document MUST contain the term 'Java'.

As mentioned earlier, the search expression is always within single quotes. However, since this is a term query, we don't have to use double quotes within single quotes.

Composite query using Boolean operators

The Boolean operators AND and OR are supported and we can use them to build a composite query. Boolean operators can be used to combine phrase and term queries in any arbitrary manner

The above query will match the following documents:

When using Boolean operators to combine term(s) and phrase(s) or both, note that:

The matching document can contain the terms and phrases in any order.
The matching document may not have the terms adjacent to each other (if this is needed, use appropriate phrase query).

Use of the OR operator is implicit. In other words, if phrase(s) and term(s) are not combined using AND operator in the search expression, the OR operator is used by default:

This example queries the SKILLS_COL column to look for documents where each document MUST contain ANY one of:

phrase "distributed systems" OR
term 'java' OR
term 'C++'.

Grouping using parentheses is supported:

This example queries the SKILLS_COL column to look for documents where each document MUST contain

phrase "distributed systems" AND
at least one of the terms Java or C++

Here the terms Java and C++ are grouped without any operator, which implies the use of OR. The root operator AND is used to combine this with phrase "distributed systems"

Prefix query

Prefix queries can be done in the context of a single term. We can't use prefix matches for phrases.

This example queries the SKILLS_COL column to look for documents where each document MUST contain text like stream, streaming, streams etc

The above query will match the following documents:

Regular Expression Query

Consider a server log as an example where we want to look for exceptions. A regex query is suitable here as it is unlikely that 'exception' is present as an individual indexed token.

Syntax of a regex query is slightly different from queries mentioned earlier. The regular expression is written between a pair of forward slashes (/).

The above query will match any text document containing "exception".

Phrase search with wildcard term matching

With this config enabled, one can now perform the pharse wildcard search using the following syntax like

to match the string "Apache pinot" in the SIKLLS_COL. Boolean expressions like 'pache pino AND apche luce' are are supported.

Deciding Query Types

Combining phrase and term queries using Boolean operators and grouping lets you build a complex text search query expression.

An example would be phrase "machine learning".

Term query using Boolean AND operator is more appropriate for such cases

Text search support

hashtagWhy do we need text search?

hashtagCurrent restrictions

hashtagSample Datasets

hashtagApache Access Log

hashtagResume text

hashtagQuery Log

hashtagEnable a text index

hashtagText index creation

hashtagText parsing and tokenization

hashtagWriting text search queries

hashtagPhrase query

hashtagTerm query

hashtagComposite query using Boolean operators

hashtagPrefix query

hashtagRegular Expression Query

hashtagPhrase search with wildcard term matching

hashtagDeciding Query Types

hashtagText Index Tuning

Text search support

hashtagWhy do we need text search?

hashtagCurrent restrictions

hashtagSample Datasets

hashtagApache Access Log

hashtagResume text

hashtagQuery Log

hashtagEnable a text index

hashtagText index creation

hashtagText parsing and tokenization

hashtagWriting text search queries

hashtagPhrase query

hashtagTerm query

hashtagComposite query using Boolean operators

hashtagPrefix query

hashtagRegular Expression Query

hashtagPhrase search with wildcard term matching

hashtagDeciding Query Types

hashtagText Index Tuning

Why do we need text search?

Current restrictions

Sample Datasets

Apache Access Log

Resume text

Query Log

Enable a text index

Text index creation

Text parsing and tokenization

Writing text search queries

Phrase query

Term query

Composite query using Boolean operators

Prefix query

Regular Expression Query

Phrase search with wildcard term matching

Deciding Query Types

Text Index Tuning

Why do we need text search?

Current restrictions

Sample Datasets

Apache Access Log

Resume text

Query Log

Enable a text index

Text index creation

Text parsing and tokenization

Writing text search queries

Phrase query

Term query

Composite query using Boolean operators

Prefix query

Regular Expression Query

Phrase search with wildcard term matching

Deciding Query Types

Text Index Tuning