Configuration Recommendation Engine
This page describes the automated mechanisms we have for recommending a suitable configuration for your deployment.
Recommendation Engine is a rule based engine that recommends optimal configuration options for Pinot tables. The configuration options currently covered by the engine are mostly TableConfig related (e.g indexes, realtime config). Please note that not all configuration options in TableConfig are currently covered. The following table shows the ones that are currently covered.
Recommendation Engine can be used to optimize the configuration parameters for both new and existing tables. Also since the recommendation engine tries to generate near-optimal configurations, users are strongly encouraged to provide the input information to the best of their knowledge. It is ok if the information is not fully accurate. However, random/arbitrary and incomplete information will not help the Recommendation Engine’s algorithms.
How to use the engine
The engine is currently accessible by a REST endpoint on Pinot Controller. It's a PUT endpoint under /tables/recommender path which takes a json as its input and produce a json as its output. You can try it out in Swagger REST API section of Pinot Web UI.
The input needs to be provided in json format. Different fields of the input can be categorized into the following four groups.
1. Data Characteristics
Data characteristics is defined in "schema" field. The content of this field is an extended version of Pinot Schema which has some additional metadata that's inserted into the definition of each column:
cardinality Total number of unique values for this dimension. Please provide cardinality to the best of your knowledge to help generate the best possible recommendations.
numValuesPerEntry For multi-value columns only, this is the average number of values per column value. Please note that:
For multi-value columns, other than numValuesPerEntry, singleValueField has to be defined as false.
2. Table Characteristics
For realtime/hybrid tables,
Kafka partitions [Optional] Fill this field with the partition count suggested by Kafka team. If this field is not provided, the engine tries to recommend the optimal number of partitions for the Kafka topic.
Kafka messages per second [Required] Average number of messages go to the Kafka topic per second.
In this section, lets describe the rules which generate recommendations for different configurations:
Segment Size - This rule recommends the following parameters for offline tables: 1) number of segments, 2) number of records in each segment, and 3) size of each segment. For new tables, the rule uses the provided data characteristics to find the optimal values for these parameters. If your table already exists in production, you can obtain the following information from existing segments: number of rows in segment and segment size. Then add them as actualSegmentSize and numRowsInActualSegment parameters of segment size rule. You'll see an example for this in Overrides section.
Kafka Partitions - If the number of Kafka partitions is not already determined/provided, this rules recommends a value for it. It requires topic aggregate message rate (number of messages per seconds across all partitions of the topic) to drive the optimal number of partitions.
All rules run by default. You have the option to select the ones you want to run.
Note that the following rules only apply to realtime or hybrid tables:
And the following rule only applies to offline or hybrid tables:
Each rule may have some parameters with default values. You can change these default values and consequently change the behavior of the rules. As an example, in the Recommend Table Partitioning rule, we do not recommend partitioning for low QPS tables. To change the default behavior, you can use the followings:
As a second example, let’s look at Segment Size rule. This rule generates a segment based on the provided data characteristics. Parameter numRowsInGeneratedSegment controls the size of the generated segment. Parameter desiredSegmentSizeMB specifies the desired ideal segment size. The default values are 50,000 rows and 500MB:
Or in case your table is already in production and you know the actual segment size and number of rows in an actual segment, you can do:
The detailed list and usages of the parameters are documented in detail in and this page.
One last item to explain for the input json is the overridden configs. You can instruct the recommendation engine to not recommend some config values and instead use your provided values for those configs. This way, the Rule Engine tries to find the optimal value for other parameter with respsect to the fixed parameters that you have provided. As an example, you can say I want the followings to be my inverted index columns and range index columns:
Currently, the engine recommends the following configs:
Indexing Config:
Inverted Index Columns The columns to apply inverted (bitmap) indices on.
Primary Sorted Column The ONE column used to sort all the data during the segment generation.
We are planning to add more rules in the near future.
Input:
Output: