This section contains articles that provide technical and implementation details of Pinot features
Loading...
Loading...
Loading...
Explore the data on our Pinot cluster
Now that the QuickStartCluster is setup, we can start exploring the data and the APIs. Head over to http://localhost:9000 in your browser.
You are now connected to the Pinot controller. Let's take a look at the following two features.
Query Console let's us run queries on the data in the Pinot cluster
We can see our baseballStats
table listed on the left (you will see meetupRSVP
or airlineStats
if you used the streaming or the hybrid quick start). Clicking on the table name should display all the names and data types of the columns of the table, and also execute a sample query select * from baseballStats limit 10
. You can query this table by typing your query in the text box and clicking the Run Query
button.
Here's some other queries you can try out:
select playerName, max(hits) from baseballStats group by playerName order by max(hits) desc
select sum(hits), sum(homeRuns), sum(numberOfGames) from baseballStats where yearID > 2010
select * from baseballStats order by league
Pinot supports a subset of standard SQL. See Pinot Query Language for more information.
The Pinot Admin UI contains all the APIs that you will need to operate and manage your cluster. It provides a set of APIs for Pinot cluster management including health check, instances management, schema and table management, data segments management.
Let's check out the tables in this cluster by going to Table -> List all tables in cluster and click on Try it out!
. We can see the baseballStats
table listed here. We can also see the exact curl
call made to the controller API.
You can look at the configuration of this table by going to Tables -> Get/Enable/Disable/Drop a table, type in baseballStats
in the table name, and click Try it out!
Let's check out the schemas in the cluster by going to Schema -> List all schemas in the cluster and click Try it out!
. We can see a schema called baseballStats
in this list.
Take a look at the schema by going to Schema -> Get a schema, type baseballStats
in the schema name, and click Try it out!
.
Finally, let's checkout the data segments in the cluster by going to Segment -> List all segments, type in baseballStats
in the table name, and click Try it out!
. There's 1 segment for this table, called baseballStats_OFFLINE_0
.
You might have figured out by now, in order to get data into the Pinot cluster, we need a table, a schema and segments. Let's head over to Batch upload sample data, to find out more about these components and learn how to create them for your own data.
This page talks about support for text search functionality in Pinot.
Pinot supports super fast query processing through its indexes on non-BLOB like columns. Queries with exact match filters are run efficiently through a combination of dictionary encoding, inverted index and sorted index. An example:
In the above query, we are doing exact match on two columns of type STRING and INT respectively.
For arbitrary text data which falls into the BLOB/CLOB territory, we need more than exact matches. Users are interested in doing regex, phrase, fuzzy queries on BLOB like data. Before 0.3.0, one had to use regexp_like to achieve this. However, this was scan based which was not performant and features like fuzzy search (edit distance search) were not possible.
In version 0.3.0, we added support for text indexes to efficiently do arbitrary search on STRING columns where each column value is a large BLOB of text. This can be achieved by using the new built-in function TEXT_MATCH.
where <column_name> is the column text index is created on and <search_expression> can be:
Search Expression Type
Example
Phrase query
TEXT_MATCH (<column_name>, '\"distributed system\"')
Term Query
TEXT_MATCH (<column_name>, 'Java')
Boolean Query
TEXT_MATCH (<column_name>, 'Java and c++')
Prefix Query
TEXT_MATCH (<column_name>, 'stream*')
Regex Query
TEXT_MATCH (<column_name>, '/Exception.*/')
Text search should ideally be used on STRING columns where doing standard filter operations (EQUALITY, RANGE, BETWEEN) doesn't fit the bill because each column value is a reasonably large blob of text.
Consider the following snippet from Apache access log. Each line in the log consists of arbitrary data (IP addresses, URLs, timestamps, symbols etc) and represents a column value. Data like this is a good candidate for doing text search.
Let's say the following snippet of data is stored in ACCESS_LOG_COL column in Pinot table.
Few examples of search queries on this data:
Count the number of GET requests.
Count the number of POST requests that have administrator in the URL (administrator/index)
Count the number of POST requests that have a particular URL and handled by Firefox browser
Consider another example of simple resume text. Each line in the file represents skill-data from resumes of different candidates
Let's say the following snippet of data is stored in SKILLS_COL column in Pinot table. Each line in the input text represents a column value.
Few examples of search queries on this data:
Count the number of candidates that have "machine learning" and "gpu processing" - a phrase search (more on this further in the document) where we are looking for exact match of phrases "machine learning" and "gpu processing" not necessarily in the same order in original data.
Count the number of candidates that have "distributed systems" and either 'Java' or 'C++' - a combination of searching for exact phrase "distributed systems" along with other terms.
Consider a snippet from a log file containing SQL queries handled by a database. Each line (query) in the file represents a column value in QUERY_LOG_COL column in Pinot table.
Few examples of search queries on this data:
Count the number of queries that have GROUP BY
Count the number of queries that have the SELECT count... pattern
Count the number of queries that use BETWEEN filter on timestamp column along with GROUP BY
Further sections in the document cover several concrete examples on each kind of query and step-by-step guide on how to write text search queries in Pinot.
Currently we support text search in a restricted manner. More specifically, we have the following constraints:
The column type should be STRING.
The column should be single-valued.
Co-existence of text index with other Pinot indexes is currently not supported.
The last two restrictions are going to be relaxed very soon in the upcoming releases.
Currently, a column in Pinot can be dictionary encoded or stored RAW. Furthermore, we can create inverted index on the dictionary encoded column. We can also create a sorted index on the dictionary encoded column.
Text index is an addition to the type of per-column indexes users can create in Pinot. However, the current implementation supports text index on RAW column. In other words, the column should not be dictionary encoded. As we relax this constraint in upcoming releases, text index can be created on a dictionary encoded column that also has other indexes (inverted, sorted etc).
Similar to other indexes, users can enable text index on a column through table config. As part of text-search feature, we have also introduced a new generic way of specifying the per-column encoding and index information. In the table config, there will be a new section with name "fieldConfigList".
IMPORTANT: This mechanism of using "fieldConfigList" is currently ONLY used for text indexes. Our plan is to migrate all other indexes to this model. We are going to do that in upcoming releases and accordingly user documentation and new guidelines will be published. So please continue to specify other index info in table config as you have done till now and use the "fieldConfigList" only for text indexes.
"fieldConfigList" will be a new section in table config. It is essentially a list of per-column encoding and index information. In the above example, the list contains text index information for two columns text_col_1 and text_col_2. Each object in fieldConfigList contains the following information
name - Name of the column text index is enabled on
encodingType - As mentioned earlier, we can store a column either as RAW or dictionary encoded. Since for now we have a restriction on the text index, this should always be RAW.
indexType - This should be TEXT.
Also, since we haven't yet removed the old way of specifying the index info, each column that text index is enabled on should also be specified in noDictionaryColumns in tableIndexConfig
The above mechanism should allow the user to use text index in all of the following scenarios:
Adding new table with text index enabled on one or more columns.
Adding a new column with text index enabled to an existing table.
Enabling text index on an existing column.
Since we haven't yet removed the old way of specifying the
Once the text index is enabled on one or more columns through table config, our segment generation code will pick up the config and automatically create text index (per column). This is exactly how other indexes in Pinot are created.
Text index is supported for both offline and realtime segments.
The original text document (a value in the column with text index enabled) is parsed, tokenized and individual "indexable" terms are extracted. These terms are inserted into the index.
Pinot's text index is built on top of Lucene. Lucene's standard english text tokenizer generally works well for most classes of text. We might want to build custom text parser and tokenizer to suit particular user requirements. Accordingly, we can make this configurable for the user to specify on per column text index basis.
A new built-in function TEXT_MATCH has been introduced for using text search in SQL/PQL.
TEXT_MATCH(text_column_name, search_expression)
text_column_name - name of the column to do text search on.
search_expression - search query
We can use TEXT_MATCH function as part of our queries in the WHERE clause. Examples:
We can also use the TEXT_MATCH filter clause with other filter operators. For example:
Combining multiple TEXT_MATCH filter clauses
TEXT_MATCH can be used in WHERE clause of all kinds of queries supported by Pinot
Selection query which projects one or more columns
User can also include the text column name in select list
Aggregation query
Aggregation GROUP BY query
The search expression (second argument to TEXT_MATCH function) is the query string that Pinot will use to perform text search on the column's text index. **Following expression types are supported
This query is used to do exact match of a given phrase. Exact match implies that terms in the user specified phrase should appear in the exact same order in the original text document. Note that document is referred to as the column value.
Let's take the example of resume text data containing 14 documents to walk through queries. The data is stored in column named SKILLS_COL and we have created a text index on this column.
Example 1 - Search in SKILL_COL column to look for documents where each matching document MUST contain phrase "distributed systems" as is
The search expression is '\"Distributed systems\"'
The search expression is always specified within single quotes '<your expression>'
Since we are doing a phrase search, the phrase should be specified within double quotes inside the single quotes and the double quotes should be escaped
'\"<your phrase>\"'
The above query will match the following documents: