Rebalance Servers

Rebalance operation is used to recompute assignment of brokers or servers in the cluster. This is not a single command, but more of a series of steps that need to be taken.

In case of servers, rebalance operation is used to balance the distribution of the segments amongst the servers being used by a Pinot table. This is typically done after capacity changes, or config changes such as replication or segment assignment strategies.

Changes that need to be followed by a rebalance

Here's some common scenarios where the changes need to be followed by a rebalance.

  1. Capacity changes

  2. Increasing/decreasing replication for a table

  3. Changing segment assignment for a table

Capacity changes

These are typically done when downsizing/uplifting a cluster, or replacing nodes of a cluster.

Tenants and tags

Every server added to the Pinot cluster, has tags associated with it. A group of servers with the same tag forms a Server Tenant. By default, a server in the cluster gets added to the DefaultTenant i.e. gets tagged as DefaultTenant_OFFLINE and DefaultTenant_REALTIME. Below is an example of how this looks in the znode, as seen in ZooInspector.

A Pinot table config has a tenants section, to define the tenant to be used by the table. The Pinot table will use all the servers which belong to the tenant as described in this config. More details about this in the Tenants section.

 {   
    "tableName": "myTable_OFFLINE",
    "tenants" : {
      "broker":"DefaultTenant",
      "server":"DefaultTenant"
    }
  }

Updating tags

Using master or 0.6.0 onwards

In order to change the server tags, the following API can be used.

PUT /instances/{instanceName}/updateTags?tags=<comma separated tags>

0.5.0 and prior

UpdateTags API is not available in 0.5.0 and prior. Instead use this API to update the Instance.

PUT /instances/{instanceName}

For example,

curl -X PUT "http://localhost:9000/instances/Server_10.1.10.51_7000" 
    -H "accept: application/json" 
    -H "Content-Type: application/json" 
    -d "{ \"host\": \"10.1.10.51\", \"port\": \"7000\", \"type\": \"SERVER\", \"tags\": [ \"newName_OFFLINE\", \"DefaultTenant_REALTIME\" ]}"

NOTE

The output of GET and input of PUT don't match for this API. Please make sure to use the right payload as shown in example above. Particularly, notice that instance name "Server_host_port" gets split up into their own fields in this PUT API.

Replication changes

In order to make change to the replication factor of a table, update the table config as follows

OFFLINE table - update the replication field

REALTIME table - update the replicasPerPartition field

Segment Assignment changes

The most common segment assignment change would be to move from the default segment assignment to replica group segment assignment. Discussing the details of the segment assignment is beyond the scope of this page. More details can be found in Routing and in this FAQ question.

Running a Rebalance

After any of the above described changes are done, a rebalance is needed to make those changes take effect.

To run a rebalance, use the following API.

POST /tables/{tableName}/rebalance?type=<OFFLINE/REALTIME>

This API has a lot of knobs to control various behaviors. Make sure to go over them and change the defaults as needed.

Note

Typically, the flags that need to be changed from defaults are

includeConsuming=true for REALTIME

downtime=true if you have only 1 replica, or prefer faster rebalance at the cost of a momentary downtime

Query param

Default value

Description

dryRun

false

If set to true, rebalance is run as a dry-run so that you can see the expected changes to the ideal state and instance partition assignment.

includeConsuming

false

Applicable for REALTIME tables. CONSUMING segments are rebalanced only if this is set to true. Moving a CONSUMING segment involves dropping the data consumed so far on old server, and re-consuming on the new server. If an application is sensitive to increased memory utilization due to re-consumption or to a momentary data staleness, they may choose to not include consuming in the rebalance. Whenever the CONSUMING segment completes, the completed segment will be assigned to the right instances, and the new CONSUMING segment will also be started on the correct instances. If you choose to includeConsuming=false and let the segments move later on, any downsized nodes need to remain untagged in the cluster, until the segment completion happens.

downtime

false

This controls whether Pinot allows downtime while rebalancing. If downtime = true, all replicas of a segment can be moved around in one go, which could result in a momentary downtime for that segment (time gap between ideal state updated to new servers and new servers downloading the segments). If downtime = false, Pinot will make sure to keep certain number of replicas (config in next row) always up. The rebalance will be done in multiple iterations under the hood, in order to fulfill this constraint.

Note: If you have only 1 replica for your table, rebalance with downtime=false is not possible.

minAvailableReplicas

1

Applicable for rebalance with downtime=false.

This is the minimum number of replicas that are expected to stay alive through the rebalance.

bestEfforts

false

Applicable for rebalance with downtime=false.

If a no-downtime rebalance cannot be performed successfully, this flag controls whether to fail the rebalance or do a best-effort rebalance.

reassignInstances

false

Applicable to tables where the instance assignment has been persisted to zookeeper. Setting this to true will make the rebalance first update the instance assignment, and then rebalance the segments.

bootstrap

false

Rebalances all segments again, as if adding segments to an empty table. If this is false, then the rebalance will try to minimize segment movements.

You can check the status of the rebalance by

  1. Checking the controller logs

  2. Running rebalance again after a while, you should receive status "status": "NO_OP"

  3. Checking the External View of the table, to see the changes in capacity/replicas have taken effect.

Last updated