Add Prefix, Suffix & Ngram UDFs

This section contains reference documentation for the add prefix, suffix and ngram UDFs.

Adding ngram, prefix, postfix UDFs to improve query throughput by enabling efficient text search through derived columns with inverted indexes.

Context

We were aiming to increase query throughput for a matching use case. Initial testing shows that the current approaches using REGEXP_LIKE and TEXT_MATCH queries have reached their performance limits, and further improvements in QPS cannot be achieved with these methods. The representative queries under evaluation are as follows:

SELECT col1, col2 FROM table WHERE REGEXP_LIKE(col3, '^data*')
SELECT col1, col2 FROM table WHERE REGEXP_LIKE(col3, 'data$')
SELECT col1, col2 FROM table WHERE REGEXP_LIKE(col3, '*data*')
SELECT col1, col2 FROM table WHERE TEXT_MATCH(col3, '/data*/')

The plan is to generate derived columns that persist prefix, postfix, and ngram to use inverted indexes to filter results fast, and add text match indexes to do validation after filtering to avoid false positive results.

What is N-gram

N-gram breaks down text into overlapping sequences of characters. For example:

"Apache pinot" produces:

ap, pa, ac, ch, he, e , p, ...

When you match ngram(col, "ap"), it searches for the substring "ap" within the generated n-grams.

When you match ngram(col, 'apache'), it generates n-grams from "apache" (ap, pa, ac, etc.) and matches against the column's n-grams.

When to Use N-gram

N-grams solve performance issues with:

Text match - Avoid full scans
REGEXP_LIKE - Currently requires full scan
Prefix matching - Reduce O(N * length of string) complexity
Wildcard matching - Reduce O(N * length) complexity

How to Make Wildcards Fast

Need to do prefiltering to avoid full scan of all rows. How to prefilter → use ngram!

Function Parameters

Function

Parameters

Description

prefixes

String input, int maxlength

Generate array of prefix strings shorter than maxlength

prefixesWithPrefix

String input, int maxlength, @Nullable String prefix

Generate array of prefix matchers with prepended prefix (e.g. '^' for regex)

suffixes

String input, int maxlength

Generate array of suffix strings shorter than maxlength

suffixesWithSuffix

String input, int maxlength, @Nullable String suffix

Generate array of suffix matchers with appended suffix (e.g. '$' for regex)

uniqueNgrams

String input, int length

Generate array of unique ngrams of exact specified length

uniqueNgrams

String input, int minGram, int maxGram

Generate array of unique ngrams within length range [minGram, maxGram]

Usage Examples

Example Query

SELECT organizationUUID, confirmedEmployeeCount, name
FROM rta.rta.u4b_organizations
WHERE deletedAt IS NULL
  AND (IN_SERIALIZED_LIST(entityTypes, 'org:u4b:organization'))
  AND (TEXT_MATCH(name, '/.*pacific.*/ AND /.*equity.*/')
       OR TEXT_MATCH(profileName, '/.*pacific.*/ AND /.*equity.*/'))
ORDER BY confirmedEmployeeCount DESC
OFFSET 0 LIMIT 10

Prefix Operations

SELECT prefixes('data', 3) AS result
FROM myTable

result

["d", "da", "dat"]

SELECT prefixesWithPrefix('data', 3, '^') AS result
FROM myTable

result

["^d", "^da", "^dat"]

Suffix Operations

SELECT suffixes('data', 3) AS result
FROM myTable

result

["a", "ta", "ata"]

SELECT suffixesWithSuffix('data', 3, '$') AS result
FROM myTable

result

["a$", "ta$", "ata$"]

N-gram Generation

Table Config with Transformation

Instead of using SQL UDF, configure n-gram generation during ingestion using transformationConfig:

{
  "tableName": "myTable",
  "tableType": "OFFLINE",
  "ingestionConfig": {
    "transformationConfigs": [
      {
        "columnName": "content_bigrams",
        "transformFunction": "uniqueNgrams(content, 2)"
      },
      {
        "columnName": "content_ngrams",
        "transformFunction": "uniqueNgrams(content, 2, 4)"
      }
    ]
  }
}

Query Examples

SELECT uniqueNgrams('Apache pinot', 2) AS bigrams
FROM myTable

bigrams

["Ap", "pa", "ac", "ch", "he", "e ", " p", "pi", "in", "no", "ot"]

SELECT uniqueNgrams('data', 2, 4) AS ngrams
FROM myTable

ngrams

["da", "at", "ta", "dat", "ata", "data"]

Using Generated N-grams

-- Traditional approach (slow)
SELECT * FROM T WHERE REGEXP_LIKE(field, '*pino.*')

-- Optimized approach with n-gram prefiltering
SELECT * FROM T
WHERE ngram = 'pin' AND ngram = 'ino' AND ngram = 'not'
  AND REGEXP_LIKE(field, '*pino.*')

-- Traditional approach for short strings
SELECT * FROM T WHERE REGEXP_LIKE(field, '*pi.*')

-- Optimized approach (no validation needed when search string ≤ ngram length)
SELECT * FROM T WHERE ngram = 'pi'

How to Use N-gram

Simulated N-gram Approach

This patch creates a simulated ngram approach by storing the grams into a column:

First, use the UDF to generate a ngram column during data ingestion
Translate the query with prefilters to use the ngram column for fast filtering
Trade-off: Disk size will grow, but query performance improves significantly

Implementation Steps

Generate derived columns with prefix, postfix, and ngram values
Create inverted indexes on these derived columns
Use prefiltering with ngram indexes to reduce candidate rows
Apply original text match validation to avoid false positives

For more details, see GitHub PR #12392.

PreviousString Functions Nextchr

Last updated 1 hour ago

Was this helpful?

hashtagContext

hashtagWhat is N-gram

hashtagWhen to Use N-gram

hashtagHow to Make Wildcards Fast

hashtagFunction Parameters

hashtagUsage Examples

hashtagExample Query

hashtagPrefix Operations

hashtagSuffix Operations

hashtagN-gram Generation

hashtagTable Config with Transformation

hashtagQuery Examples

hashtagHow to Use N-gram

hashtagSimulated N-gram Approach

hashtagImplementation Steps

Context

What is N-gram

When to Use N-gram

How to Make Wildcards Fast

Function Parameters

Usage Examples

Example Query

Prefix Operations

Suffix Operations

N-gram Generation

Table Config with Transformation

Query Examples

How to Use N-gram

Simulated N-gram Approach

Implementation Steps