HDFS Balancers

HDFS data might not always be distributed uniformly across DataNodes. One common reason is addition of new DataNodes to an existing cluster. HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage. The balancer does not balance between individual volumes on a single DataNode.

Continue reading:

Configuring and Running the HDFS Balancer Using Cloudera Manager
Configuring and Running the HDFS Balancer Using the Command Line

Configuring and Running the HDFS Balancer Using Cloudera Manager

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

In Cloudera Manager, the HDFS balancer utility is implemented by the Balancer role. The Balancer role usually shows a health of None on the HDFS Instances tab because it does not run continuously.

The Balancer role is normally added (by default) when the HDFS service is installed. If it has not been added, you must add a Balancer role to rebalance HDFS and to see the Rebalance action.

Configuring the Balancer Threshold

The Balancer has a default threshold of 10%, which ensures that disk usage on each DataNode differs from the overall usage in the cluster by no more than 10%. For example, if overall usage across all the DataNodes in the cluster is 40% of the cluster's total disk-storage capacity, the script ensures that DataNode disk usage is between 30% and 50% of the DataNode disk-storage capacity. To change the threshold:

Go to the HDFS service.
Click the Configuration tab.
Select Scope > Balancer.
Select Category > Main.
Set the Rebalancing Threshold property.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.
Click Save Changes to commit the changes.

Configuring Concurrent Moves

The property dfs.datanode.balance.max.concurrent.moves sets the maximum number of threads used by the DataNode balancer for pending moves. It is a throttling mechanism to prevent the balancer from taking too many resources from the DataNode and interfering with normal cluster operations. Increasing the value allows the balancing process to complete more quickly, decreasing the value allows rebalancing to complete more slowly, but is less likely to compete for resources with other tasks on the DataNode. To use this property, you need to set the value on both the DataNode and the Balancer.

To configure the Datanode:
- Go to the HDFS service.
- Click the Configuration tab.
- Search for DataNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.
- Add the following code to the configuration field, for example, setting the value to 50.
```
<property>
  <name>dfs.datanode.balance.max.concurrent.moves</name>
  <value>50</value>
</property>
```
- Restart the DataNode.
To configure the Balancer:
1. Go to the HDFS service.
2. Click the Configuration tab.
3. Search for Balancer Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.
4. Add the following code to the configuration field, for example, setting the value to 50.
```
<property>
  <name>dfs.datanode.balance.max.concurrent.moves</name>
  <value>50</value>
</property>
```

Running the Balancer

Go to the HDFS service.
Ensure the service has a Balancer role.
Select Actions > Rebalance.
Click Rebalance to confirm. If you see a Finished status, the Balancer ran successfully.

Configuring and Running the HDFS Balancer Using the Command Line

The HDFS balancer re-balances data across the DataNodes, moving blocks from overutilized to underutilized nodes. As the system administrator, you can run the balancer from the command-line as necessary -- for example, after adding new DataNodes to the cluster.

Points to note:

The balancer requires the capabilities of an HDFS superuser (for example, the hdfs user) to run.
The balancer does not balance between individual volumes on a single DataNode.
You can run the balancer without parameters, as follows:
```
sudo -u hdfs hdfs balancer
```
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> hadoop <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>
This runs the balancer with a default threshold of 10%, meaning that the script will ensure that disk usage on each DataNode differs from the overall usage in the cluster by no more than 10%. For example, if overall usage across all the DataNodes in the cluster is 40% of the cluster's total disk-storage capacity, the script ensures that each DataNode's disk usage is between 30% and 50% of that DataNode's disk-storage capacity.
You can run the script with a different threshold; for example:
```
sudo -u hdfs hdfs balancer -threshold 5
```
This specifies that each DataNode's disk usage must be (or will be adjusted to be) within 5% of the cluster's overall usage.
You can adjust the network bandwidth used by the balancer, by running the dfsadmin -setBalancerBandwidth command before you run the balancer; for example:
```
dfsadmin -setBalancerBandwidth  newbandwidth
```
where newbandwidth is the maximum amount of network bandwidth, in bytes per second, that each DataNode can use during the balancing operation. For more information about the setBalancerBandwidth and other HDFS command-line options, see the dfsadmin documentation.
The property dfs.datanode.balance.max.concurrent.moves sets the maximum number of threads used by the DataNode balancer for pending moves. It is a throttling mechanism to prevent the balancer from taking too many resources from the DataNode and interfering with normal cluster operations. Increasing the value allows the balancing process to complete more quickly, decreasing the value allows rebalancing to complete more slowly, but is less likely to compete for resources with other tasks on the DataNode. Adjust the value of this property in the /etc/hadoop/[service name]/hdfs-site.xml configuration file.
```
<property>
  <name>dfs.datanode.balance.max.concurrent.moves</name>
  <value>50</value>
</property>
            
```
The balancer can take a long time to run, especially if you are running it for the first time or do not run it regularly.

Configuring HDFS Trash

Enabling WebHDFS