1 of 1

Configuration Recommendation Engine

This page describes the automated mechanisms we have for recommending a suitable configuration for your deployment.

Overview

Recommendation Engine is a rule based engine that recommends optimal configuration options for Pinot tables. The configuration options currently covered by the engine are mostly TableConfig related (e.g indexes, real-time config). Note that not all configuration options in TableConfig are currently covered. The following table shows the ones that are currently covered.

Rule

Config Entity

Config Name

Applicable Table Type

Recommendation Engine can be used to optimize the configuration parameters for both new and existing tables. Also since the recommendation engine tries to generate near-optimal configurations, users are strongly encouraged to provide the input information to the best of their knowledge. It is ok if the information is not fully accurate. However, random/arbitrary and incomplete information will not help the Recommendation Engine’s algorithms.

Also see the section on .

How to use the engine

The engine is currently accessible by a REST endpoint on Pinot Controller. It's a PUT endpoint under /tables/recommender path which takes a json as its input and produce a json as its output. You can try it out in Swagger REST API section of Pinot Web UI.

Input

The input needs to be provided in json format. Different fields of the input can be categorized into the following four groups.

1. Data Characteristics

Data characteristics is defined in "schema" field. The content of this field is an extended version of Pinot Schema which has some additional metadata that's inserted into the definition of each column:

Cardinality: Total number of unique values for this dimension. Provide cardinality to the best of your knowledge to help generate the best possible recommendations.
numValuesPerEntry: For multi-value columns only, this is the average number of values per column value. Note that:
- For multi-value columns, other than numValuesPerEntry, singleValueField has to be defined as false.

2. Table Characteristics

For real-time/hybrid tables,
- Kafka partitions [Optional] Fill this field with the partition count suggested by Kafka team. If this field is not provided, the engine tries to recommend the optimal number of partitions for the Kafka topic.
- Kafka messages per second [Required] Average number of messages go to the Kafka topic per second.

3. Rules

In this section, lets describe the rules which generate recommendations for different configurations:

Segment Size - This rule recommends the following parameters for offline tables: 1) number of segments, 2) number of records in each segment, and 3) size of each segment. For new tables, the rule uses the provided data characteristics to find the optimal values for these parameters. If your table already exists in production, you can obtain the following information from existing segments: number of rows in segment and segment size. Then add them as actualSegmentSize and numRowsInActualSegment parameters of segment size rule. You'll see an example for this in Overrides section.
Kafka Partitions - If the number of Kafka partitions is not already determined/provided, this rules recommends a value for it. It requires topic aggregate message rate (number of messages per seconds across all partitions of the topic) to drive the optimal number of partitions.

All rules run by default. You have the option to select the ones you want to run.

Note that the following rules only apply to real-time or hybrid tables:

Kafka Partitions
Real-time Provisioning
Aggregate Metrics

And the following rule only applies to offline or hybrid tables:

Segment Size

4. Overrides

Each rule may have some parameters with default values. You can change these default values and consequently change the behavior of the rules. As an example, in the Recommend Table Partitioning rule, we do not recommend partitioning for low QPS tables. To change the default behavior, you can use the followings:

As a second example, let’s look at Segment Size rule. This rule generates a segment based on the provided data characteristics. Parameter numRowsInGeneratedSegment controls the size of the generated segment. Parameter desiredSegmentSizeMB specifies the segment size you want. The default values are 50,000 rows and 500MB:

Or in case your table is already in production and you know the actual segment size and number of rows in an actual segment, you can do:

The detailed list and usages of the parameters are documented in detail in and this page.

One last item to explain for the input json is the overridden configs. You can instruct the recommendation engine to not recommend some config values and instead use your provided values for those configs. This way, the Rule Engine tries to find the optimal value for other parameter with respsect to the fixed parameters that you have provided. As an example, you can say I want the followings to be my inverted index columns and range index columns:

Output

Currently, the engine recommends the following configs:

Indexing Config:
- Inverted Index Columns The columns to apply inverted (bitmap) indices on.
- Primary Sorted Column The ONE column used to sort all the data during the segment generation.

We are planning to add more rules in the near future.

Sample run

Input:

Output:

Configuration Recommendation Engine

This page describes the automated mechanisms we have for recommending a suitable configuration for your deployment.

Overview

Rule

Config Entity

Config Name

Applicable Table Type

Also see the section on .

How to use the engine

Input

The input needs to be provided in json format. Different fields of the input can be categorized into the following four groups.

1. Data Characteristics

Cardinality: Total number of unique values for this dimension. Provide cardinality to the best of your knowledge to help generate the best possible recommendations.
numValuesPerEntry: For multi-value columns only, this is the average number of values per column value. Note that:
- For multi-value columns, other than numValuesPerEntry, singleValueField has to be defined as false.

2. Table Characteristics

For real-time/hybrid tables,
- Kafka partitions [Optional] Fill this field with the partition count suggested by Kafka team. If this field is not provided, the engine tries to recommend the optimal number of partitions for the Kafka topic.
- Kafka messages per second [Required] Average number of messages go to the Kafka topic per second.

3. Rules

In this section, lets describe the rules which generate recommendations for different configurations:

Segment Size - This rule recommends the following parameters for offline tables: 1) number of segments, 2) number of records in each segment, and 3) size of each segment. For new tables, the rule uses the provided data characteristics to find the optimal values for these parameters. If your table already exists in production, you can obtain the following information from existing segments: number of rows in segment and segment size. Then add them as actualSegmentSize and numRowsInActualSegment parameters of segment size rule. You'll see an example for this in Overrides section.
Kafka Partitions - If the number of Kafka partitions is not already determined/provided, this rules recommends a value for it. It requires topic aggregate message rate (number of messages per seconds across all partitions of the topic) to drive the optimal number of partitions.

All rules run by default. You have the option to select the ones you want to run.

Note that the following rules only apply to real-time or hybrid tables:

Kafka Partitions
Real-time Provisioning
Aggregate Metrics

And the following rule only applies to offline or hybrid tables:

Segment Size

4. Overrides

Or in case your table is already in production and you know the actual segment size and number of rows in an actual segment, you can do:

The detailed list and usages of the parameters are documented in detail in and this page.

Output

Currently, the engine recommends the following configs:

Indexing Config:
- Inverted Index Columns The columns to apply inverted (bitmap) indices on.
- Primary Sorted Column The ONE column used to sort all the data during the segment generation.

We are planning to add more rules in the near future.

Sample run

Input:

Output:

Configuration Recommendation Engine

hashtagOverview

hashtagHow to use the engine

hashtagInput

hashtag1. Data Characteristics

hashtag2. Table Characteristics

hashtag3. Rules

hashtag4. Overrides

hashtagOutput

hashtagSample run

Configuration Recommendation Engine

hashtagOverview

hashtagHow to use the engine

hashtagInput

hashtag1. Data Characteristics

hashtag2. Table Characteristics

hashtag3. Rules

hashtag4. Overrides

hashtagOutput

hashtagSample run

Overview

How to use the engine

Input

1. Data Characteristics

2. Table Characteristics

3. Rules

4. Overrides

Output

Sample run

Overview

How to use the engine

Input

1. Data Characteristics

2. Table Characteristics

3. Rules

4. Overrides

Output

Sample run