Vector index
Overview
Apache Pinot now supports a Vector Index for efficient similarity searches over high-dimensional vector embeddings. This feature introduces the capability to store and query float array columns (multi-valued) using a vector similarity algorithm.
Key Features
Vector Index is implemented using HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor (ANN) search.
Adds support for a predicate and function:
VECTOR_SIMILARITY(v1, v2, [optional topK]) to retrieve the topK closest vectors based on similarity.
The similarity function can be used as part of a query to filter and rank results.
Examples
Below is an example schema designed for a use case involving product reviews with vector embeddings for each review.
Schema
In this schema:
• The embedding column is a multi-valued float array designed to store high-dimensional vector embeddings (e.g., 1536 dimensions from an NLP model).
• Other fields, such as ProductId, UserId, and Text, store metadata and review text.
Table Config
To enable the Vector Index, configure the table with the appropriate fieldConfigList
. The embedding column is specified to use the Vector Index with HNSW for similarity searches.
Explanation of Properties:
vectorIndexType:
Specifies the type of vector index to use. Currently supports HNSW.
vectorDimension:
Defines the dimensionality of the vectors stored in the column. (e.g., 1536 for typical embeddings from models like OpenAI or BERT).
vectorDistanceFunction:
Specifies the distance metric for similarity computation. Options include:
INNER_PRODUCT:
• Computes the inner product (dot product) of the two vectors.
• Typically used when vectors are normalized and higher scores indicate greater similarity.
L2:
• Measures the Euclidean distance between vectors.
• Suitable for tasks where spatial closeness in high-dimensional space indicates similarity.
L1:
• Measures the Manhattan distance between vectors (sum of absolute differences of coordinates).
• Useful for some scenarios where simpler distance metrics are preferred.
COSINE:
• Measures cosine similarity, which considers the angle between vectors.
• Ideal for normalized vectors where orientation matters more than magnitude.
version:
Specifies the version of the Vector Index implementation.
Query
VECTOR_SIMILARITY
:
A predicate that retrieves the top k closest vectors to the query vector.
Inputs:
embedding: The vector column.
Query vector (literal array).
Optional topK parameter (default: 10).
Last updated