Text search support

This page talks about support for text search in Pinot.

This text index method is recommended over the experimental native text index.

Click to skip the background info and go straight to the procedure to enable this text index.

Pinot supports super-fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index, and sorted index.

This is useful for a query like the following, which looks for exact matches on two columns of type STRING and INT respectively:

SELECT COUNT(*) 
FROM Foo 
WHERE STRING_COL = 'ABCDCD' 
AND INT_COL > 2000

For arbitrary text data that falls into the BLOB/CLOB territory, we need more than exact matches. This often involves using regex, phrase, fuzzy queries on BLOB like data. Text indexes can efficiently perform arbitrary search on STRING columns where each column value is a large BLOB of text using the TEXT_MATCH function, like this:

SELECT COUNT(*) 
FROM Foo 
WHERE TEXT_MATCH (<column_name>, '<search_expression>')

where <column_name> is the column text index is created on and <search_expression> conforms to one of the following:

Search Expression Type

Example

Phrase query

TEXT_MATCH (<column_name>, '"distributed system"')

Term Query

TEXT_MATCH (<column_name>, 'Java')

Boolean Query

TEXT_MATCH (<column_name>, 'Java AND c++')

Prefix Query

TEXT_MATCH (<column_name>, 'stream*')

Regex Query

TEXT_MATCH (<column_name>, '/Exception.*/')

Not Query

TEXT_MATCH (<column_name>, ': NOT c%')

NOT TEXT_MATCH (<column_name>, 'c%')

Current restrictions

Pinot supports text search with the following requirements:

  • The column type should be STRING, or stored as STRING (e.g. JSON).

Sample Datasets

Text search should ideally be used on STRING columns where doing standard filter operations (EQUALITY, RANGE, BETWEEN) doesn't fit the bill because each column value is a reasonably large blob of text.

Apache Access Log

Consider the following snippet from an Apache access log. Each line in the log consists of arbitrary data (IP addresses, URLs, timestamps, symbols etc) and represents a column value. Data like this is a good candidate for doing text search.

Let's say the following snippet of data is stored in the ACCESS_LOG_COL column in a Pinot table.

Here are some examples of search queries on this data:

Count the number of GET requests.

Count the number of POST requests that have administrator in the URL (administrator/index)

Count the number of POST requests that have a particular URL and handled by Firefox browser

Resume text

Let's consider another example using text from job candidate resumes. Each line in this file represents skill-data from resumes of different candidates.

This data is stored in the SKILLS_COL column in a Pinot table. Each line in the input text represents a column value.

Here are some examples of search queries on this data:

Count the number of candidates that have "machine learning" and "gpu processing": This is a phrase search (more on this further in the document) where we are looking for exact match of phrases "machine learning" and "gpu processing", not necessarily in the same order in the original data.

Count the number of candidates that have "distributed systems" and either 'Java' or 'C++': This is a combination of searching for exact phrase "distributed systems" along with other terms.

Query Log

Next, consider a snippet from a log file containing SQL queries handled by a database. Each line (query) in the file represents a column value in the QUERY_LOG_COL column in a Pinot table.

Here are some examples of search queries on this data:

Count the number of queries that have GROUP BY

Count the number of queries that have the SELECT count... pattern

Count the number of queries that use BETWEEN filter on timestamp column along with GROUP BY

Read on for concrete examples on each kind of query and step-by-step guides covering how to write text search queries in Pinot.

A column in Pinot can be dictionary-encoded or stored RAW. In addition, we can create an inverted index and/or a sorted index on a dictionary-encoded column.

The text index is an addition to the type of per-column indexes users can create in Pinot. However, it only supports text index on a RAW column, not a dictionary-encoded column.

Multi-column text index

Since version 1.4.0, Pinot offers two types of text indexes:

  • per-column / single-column text index - that stores data separately for each indexed column. It's the type used prior to the 1.4.0 version.

  • per-segment / multi-column text index - that stores all indexed column's data together. Doing so reduces both RAM and disk space sizes and speeds up index creation, allowing efficient indexing of tens or hundreds or columns.

Aside from configuration, the new index type behaves the same as per-column index at query time.

When choosing between the two index types, you might consider the following :

Property \ Type
Per-Column
Per-segment

Querying speed

slower - especially when querying multiple columns

faster

Disk and memory usage

higher - each column uses separate set of Lucene files and document id mapping

lower - Lucene file size is smaller; only one document id mapping is used for all columns

Initial build time

higher - because each column uses separate Lucene files

lower - one set of Lucene files and one document id mapping is generated

Rebuild time

lower - rebuild affected columns only, other indexes are copied

higher - removes all files and rebuilds from scratch

Enable a per-column text index

Enable a text index on a column in the table configuration by adding a new section with the name "fieldConfigList".

Each column that has a text index should also be specified as noDictionaryColumns in tableIndexConfig:

You can configure text indexes in the following scenarios:

  • Adding a new table with text index enabled on one or more columns.

  • Adding a new column with text index enabled to an existing table.

  • Enabling a text index on an existing column.

When you're using a text index, add the indexed column to the noDictionaryColumns columns list to reduce unnecessary storage overhead.

For instructions on that configuration property, see the Raw value forward index documentation.

Enable a per-segment text index

Contrary to per-column text index, per-segment text index can only be configured once in table index configuration by adding multiColumnTextIndexConfig element:

The config contains a list of columns to index - columns, settings meant for all columns - properties, and settings applied to particular column - perColumnProperties.

As shown in example above, index configuration allows for both:

  • setting shared index properties that apply to all columns with "properties". Allowed keys are : enableQueryCacheForTextIndex, luceneUseCompoundFile, luceneMaxBufferSizeMB, reuseMutableIndex and all allowed in perColumnProperties.

  • setting column-specific properties (overriding shared ones) with perColumnProperties. Allowed keys: useANDForMultiTermTextIndexQueries, enablePrefixSuffixMatchingInPhraseQueries, stopWordInclude, stopWordExclude, caseSensitive, luceneAnalyzerClass, luceneAnalyzerClassArgs, luceneAnalyzerClassArgTypes, luceneQueryParserClass.

Shared properties-only settings, e.g. luceneMaxBufferSizeMB , set in per column properties have no effect and will be ignored.

Text index creation

Once the text index is enabled on one or more columns through a table configuration, segment generation code will automatically create the text index (per column).

Text index is supported for both offline and real-time segments.

Text parsing and tokenization

The original text document (denoted by a value in the column that has text index enabled) is parsed, tokenized and individual "indexable" terms are extracted. These terms are inserted into the index.

Pinot's text index is built on top of Lucene. Lucene's standard english text tokenizer generally works well for most classes of text. To build a custom text parser and tokenizer to suit particular user requirements, this can be made configurable for the user to specify on a per-column text-index basis.

There is a default set of "stop words" built in Pinot's text index. This is a set of high frequency words in English that are excluded for search efficiency and index size, including:

Any occurrence of these words will be ignored by the tokenizer during index creation and search.

In some cases, users might want to customize the set. A good example would be when IT (Information Technology) appears in the text that collides with "it", or some context-specific words that are not informative in the search. To do this, one can config the words in fieldConfig to include/exclude from the default stop words:

The words should be comma separated and in lowercase. Words appearing in both lists will be excluded as expected.

Writing text search queries

The TEXT_MATCH function enables using text search in SQL/PQL.

TEXT_MATCH(text_column_name, search_expression)

  • text_column_name - name of the column to do text search on.

  • search_expression - search query

You can use TEXT_MATCH function as part of queries in the WHERE clause, like this:

You can also use the TEXT_MATCH filter clause with other filter operators. For example:

You can combine multiple TEXT_MATCH filter clauses:

TEXT_MATCH can be used in WHERE clause of all kinds of queries supported by Pinot.

  • Selection query which projects one or more columns

    • User can also include the text column name in select list

  • Aggregation query

  • Aggregation GROUP BY query

The search expression (the second argument to TEXT_MATCH function) is the query string that Pinot will use to perform text search on the column's text index.

TEXT_MATCH Query Options

The TEXT_MATCH function supports an optional third parameter for specifying Lucene query parser options at query time. This allows for flexible and advanced text search without changing table configuration.

Function Signature:

  • text_column_name: Name of the column to perform text search on.

  • search_expression: The query string for text search.

  • options (optional): Comma-separated string of key-value pairs to control query parsing and search behavior.

Available Options:

Option
Values
Description

parser

CLASSIC, STANDARD, COMPLEX

Selects the Lucene query parser to use. Default is CLASSIC.

allowLeadingWildcard

true, false

Allows queries to start with a wildcard (e.g., *term). Default is false.

defaultOperator

AND, OR

Sets the default boolean operator for multi-term queries. Default is OR.

Examples:

Phrase query

This query is used to seek out an exact match of a given phrase, where terms in the user-specified phrase appear in the same order in the original text document.

The following example reuses the earlier example of resume text data containing 14 documents to walk through queries. In this sentence, "document" means the column value. The data is stored in the SKILLS_COL column and we have created a text index on this column.

This example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "Distributed systems":

The search expression is '\"Distributed systems\"'

  • The search expression is always specified within single quotes '<your expression>'

  • Since we are doing a phrase search, the phrase should be specified within double quotes inside the single quotes and the double quotes should be escaped

    • '\"<your phrase>\"'

The above query will match the following documents:

But it won't match the following document:

This is because the phrase query looks for the phrase occurring in the original document "as is". The terms as specified by the user in phrase should be in the exact same order in the original document for the document to be considered as a match.

NOTE: Matching is always done in a case-insensitive manner.

The next example queries the SKILLS_COL column to look for documents where each matching document MUST contain phrase "query processing":

The above query will match the following documents:

Term query

Term queries are used to search for individual terms.

This example will query the SKILLS_COL column to look for documents where each matching document MUST contain the term 'Java'.

As mentioned earlier, the search expression is always within single quotes. However, since this is a term query, we don't have to use double quotes within single quotes.

Composite query using Boolean operators

The Boolean operators AND and OR are supported and we can use them to build a composite query. Boolean operators can be used to combine phrase and term queries in any arbitrary manner

This example queries the SKILLS_COL column to look for documents where each matching document MUST contain the phrases "machine learning" and "tensor flow". This combines two phrases using the AND Boolean operator.

The above query will match the following documents:

This example queries the SKILLS_COL column to look for documents where each document MUST contain the phrase "machine learning" and the terms 'gpu' and 'python'. This combines a phrase and two terms using Boolean operators.

The above query will match the following documents:

When using Boolean operators to combine term(s) and phrase(s) or both, note that:

  • The matching document can contain the terms and phrases in any order.

  • The matching document may not have the terms adjacent to each other (if this is needed, use appropriate phrase query).

Use of the OR operator is implicit. In other words, if phrase(s) and term(s) are not combined using AND operator in the search expression, the OR operator is used by default:

This example queries the SKILLS_COL column to look for documents where each document MUST contain ANY one of:

  • phrase "distributed systems" OR

  • term 'java' OR

  • term 'C++'.

Grouping using parentheses is supported:

This example queries the SKILLS_COL column to look for documents where each document MUST contain

  • phrase "distributed systems" AND

  • at least one of the terms Java or C++

Here the terms Java and C++ are grouped without any operator, which implies the use of OR. The root operator AND is used to combine this with phrase "distributed systems"

Prefix query

Prefix queries can be done in the context of a single term. We can't use prefix matches for phrases.

This example queries the SKILLS_COL column to look for documents where each document MUST contain text like stream, streaming, streams etc

The above query will match the following documents:

Regular Expression Query

Phrase and term queries work on the fundamental logic of looking up the terms in the text index. The original text document (a value in the column with text index enabled) is parsed, tokenized, and individual "indexable" terms are extracted. These terms are inserted into the index.

Based on the nature of the original text and how the text is segmented into tokens, it is possible that some terms don't get indexed individually. In such cases, it is better to use regular expression queries on the text index.

Consider a server log as an example where we want to look for exceptions. A regex query is suitable here as it is unlikely that 'exception' is present as an individual indexed token.

Syntax of a regex query is slightly different from queries mentioned earlier. The regular expression is written between a pair of forward slashes (/).

The above query will match any text document containing "exception".

Phrase search with wildcard term matching

Phrase search with wildcard and prefix term matching can match patterns like "pache pino" to the text "Apache Pinot" directly. The kind of queries is very common in use case like log search where user needs to search substrings across term boundary in long text. To enable such search (which can be more costly because Lucene by default does not allow * to start a pattern to avoid costly term matching), one can add a new config key to the column text index config:

With this config enabled, one can now perform the pharse wildcard search using the following syntax like

to match the string "Apache pinot" in the SIKLLS_COL. Boolean expressions like 'pache pino AND apche luce' are are supported.

Deciding Query Types

Combining phrase and term queries using Boolean operators and grouping lets you build a complex text search query expression.

The key thing to remember is that phrases should be used when the order of terms in the document is important and when separating the phrase into individual terms doesn't make sense from end user's perspective.

An example would be phrase "machine learning".

However, if we are searching for documents matching Java and C++ terms, using phrase query "Java C++" will actually result in in partial results (could be empty too) since now we are relying the on the user specifying these skills in the exact same order (adjacent to each other) in the resume text.

Term query using Boolean AND operator is more appropriate for such cases

Text Index Tuning

To improve Lucene index creation time, some configs have been provided. Field Config properties luceneUseCompoundFile and luceneMaxBufferSizeMB can provide faster index writing at but may increase file descriptors and/or memory pressure.

When text search queries contain too many terms or clauses, Lucene may throw TooManyClauses exceptions, causing query failures. This commonly occurs with:

  • Complex boolean queries with many OR conditions

  • Wildcard queries that expand to many terms

  • Queries with large numbers of search terms To handle such cases, you can increase the maximum clause count at the cluster level. See the cluster configuration reference for the pinot.lucene.max.clause.count setting.

Last updated

Was this helpful?