Setting Up MapReduce v2 with YARN Using the Command Line
This section describes configuration tasks for YARN clusters only, and is specifically tailored for administrators who have installed YARN from packages.
Continue reading:
- About MapReduce v2 (YARN)
- Step 1: Configure Properties for YARN Clusters
- Step 2: Configure YARN daemons
- Step 3: Configure the JobHistory Server
- Step 4: Configure the Staging Directory
- Step 5: If Necessary, Deploy your Custom Configuration to your Entire Cluster
- Step 6: If Necessary, Start HDFS on Every Host in the Cluster
- Step 7: If Necessary, Create the HDFS /tmp Directory
- Step 8: Create the history Directory and Set Permissions
- Step 9: Start YARN and the MapReduce JobHistory Server
- Step 10: Create a Home Directory for each MapReduce User
- Step 11: Configure the Hadoop Daemons to Run at Startup
About MapReduce v2 (YARN)
The default installation in CDH 5 is MapReduce 2.x (MRv2) built on the YARN framework. In this document we usually refer to this new version as YARN. The fundamental idea of MRv2's YARN architecture is to split up the two primary responsibilities of the JobTracker — resource management and job scheduling/monitoring — into separate daemons: a global ResourceManager (RM) and per-application ApplicationMasters (AM). With MRv2, the ResourceManager (RM) and per-host NodeManagers (NM), form the data-computation framework. The ResourceManager service effectively replaces the functions of the JobTracker, and NodeManagers run on worker hosts instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to run and monitor the tasks. For details of the new architecture, see Apache Hadoop NextGen MapReduce (YARN).
Step 1: Configure Properties for YARN Clusters
Property |
Configuration File |
Description |
---|---|---|
mapreduce.framework.name |
mapred-site.xml |
If you plan on running YARN, you must set this property to the value of yarn. |
Sample Configuration:
mapred-site.xml:
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
Step 2: Configure YARN daemons
Configure the following services: ResourceManager (on a dedicated host) and NodeManager (on every host where you plan to run MapReduce v2 jobs).
The following table shows the most important properties that you must configure for your cluster in yarn-site.xml
Property |
Recommended value |
Description |
---|---|---|
yarn.nodemanager.aux-services |
mapreduce_shuffle |
Shuffle service that needs to be set for Map Reduce applications. |
yarn.resourcemanager.hostname |
resourcemanager.company.com |
The following properties will be set to their default ports on this host:
yarn.resourcemanager. address, yarn.resourcemanager. admin.address, yarn.resourcemanager. scheduler.address, yarn.resourcemanager. resource-tracker.address, yarn.resourcemanager. webapp.address |
yarn.application.classpath |
$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/*, $HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*, $HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*, $HADOOP_MAPRED_HOME/lib/*, $HADOOP_YARN_HOME/*, $HADOOP_YARN_HOME/lib/* |
Classpath for typical applications. |
yarn.log.aggregation-enable |
true |
Next, you need to specify, create, and assign the correct permissions to the local directories where you want the YARN daemons to store data.
You specify the directories by configuring the following two properties in the yarn-site.xml file on all cluster hosts:
Property |
Description |
---|---|
yarn.nodemanager.local-dirs |
Specifies the URIs of the directories where the NodeManager stores its localized files. All of the files required for running a particular YARN application will be put here for the duration of the application run. Cloudera recommends that this property specify a directory on each of the JBOD mount points; for example, file:///data/1/yarn/local through /data/N/yarn/local. |
yarn.nodemanager.log-dirs |
Specifies the URIs of the directories where the NodeManager stores container log files. Cloudera recommends that this property specify a directory on each of the JBOD mount points; for example, file:///data/1/yarn/logs through file:///data/N/yarn/logs. |
yarn.nodemanager.remote-app-log-dir |
Specifies the URI of the directory where logs are aggregated. Set the value to either hdfs://namenode-host.company.com:8020/var/log/hadoop-yarn/apps, using the fully qualified domain name of your NameNode host, or hdfs:/var/log/hadoop-yarn/apps. |
Here is an example configuration:
yarn-site.xml:
<property> <name>yarn.resourcemanager.hostname</name> <value>resourcemanager.company.com</value> </property> <property> <description>Classpath for typical applications.</description> <name>yarn.application.classpath</name> <value> $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*, $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/* </value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value> </property> <property> <name>yarn.log.aggregation-enable</name> <value>true</value> </property> <property> <description>Where to aggregate logs</description> <name>yarn.nodemanager.remote-app-log-dir</name> <value>hdfs://<namenode-host.company.com>:8020/var/log/hadoop-yarn/apps</value> </property>
After specifying these directories in the yarn-site.xml file, you must create the directories and assign the correct file permissions to them on each host in your cluster.
In the following instructions, local path examples are used to represent Hadoop parameters. Change the path examples to match your configuration.
To configure local storage directories for use by YARN:
- Create the yarn.nodemanager.local-dirs local directories:
$ sudo mkdir -p /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
- Create the yarn.nodemanager.log-dirs local directories:
$ sudo mkdir -p /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs
- Configure the owner of the yarn.nodemanager.local-dirs directory to be the yarn user:
$ sudo chown -R yarn:yarn /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
- Configure the owner of the yarn.nodemanager.log-dirs directory to be the yarn user:
$ sudo chown -R yarn:yarn /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs
Here is a summary of the correct owner and permissions of the local directories:
Directory |
Owner |
Permissions |
---|---|---|
yarn.nodemanager.local-dirs |
yarn:yarn |
drwxr-xr-x |
yarn.nodemanager.log-dirs |
yarn:yarn |
drwxr-xr-x |
Step 3: Configure the JobHistory Server
Property |
Recommended value |
Description |
---|---|---|
mapreduce.jobhistory.address |
historyserver.company.com:10020 |
The address of the JobHistory Server host:port |
mapreduce.jobhistory.webapp.address |
historyserver.company.com:19888 |
The address of the JobHistory Server web application host:port |
In addition, make sure proxying is enabled for the mapred user; configure the following properties in core-site.xml:
Property |
Recommended value |
Description |
---|---|---|
hadoop.proxyuser.mapred.groups |
* |
Allows the mapreduser to move files belonging to users in these groups |
hadoop.proxyuser.mapred.hosts |
* |
Allows the mapreduser to move files belonging on these hosts |
Step 4: Configure the Staging Directory
YARN requires a staging directory for temporary files created by running jobs. By default it creates /tmp/hadoop-yarn/staging with restrictive permissions that may prevent your users from running jobs. To forestall this, you should configure and create the staging directory yourself; in the example that follows we use /user:
- Configure yarn.app.mapreduce.am.staging-dir in mapred-site.xml:
<property> <name>yarn.app.mapreduce.am.staging-dir</name> <value>/user</value> </property>
- Once HDFS is up and running, you will create this directory and a history subdirectory under it (see Step 8).
Alternatively, you can do the following:
- Configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir in mapred-site.xml.
- Create these two directories.
- Set permissions on mapreduce.jobhistory.intermediate-done-dir to 1777.
- Set permissions on mapreduce.jobhistory.done-dir to 750.
If you configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir as above, you can skip Step 8.
Step 5: If Necessary, Deploy your Custom Configuration to your Entire Cluster
Deploy the configuration if you have not already done so.
Step 6: If Necessary, Start HDFS on Every Host in the Cluster
Start HDFS if you have not already done so.
Step 7: If Necessary, Create the HDFS /tmp Directory
Create the /tmp Directory if you have not already done so.
Step 8: Create the history Directory and Set Permissions
This is a subdirectory of the staging directory you configured in Step 4. In this example we're using /user/history. Create it and set permissions as follows:
sudo -u hdfs hadoop fs -mkdir -p /user/history sudo -u hdfs hadoop fs -chmod -R 1777 /user/history sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history
Step 9: Start YARN and the MapReduce JobHistory Server
To start YARN, start the ResourceManager and NodeManager services:
On the ResourceManager system:
$ sudo service hadoop-yarn-resourcemanager start
On each NodeManager system (typically the same ones where DataNode service runs):
$ sudo service hadoop-yarn-nodemanager start
To start the MapReduce JobHistory Server
On the MapReduce JobHistory Server system:
$ sudo service hadoop-mapreduce-historyserver start
Step 10: Create a Home Directory for each MapReduce User
Create a home directory on the NameNode for each MapReduce user. For example:
$ sudo -u hdfs hadoop fs -mkdir /user/<user> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.
Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
sudo -u hdfs hadoop fs -mkdir /user/$USER sudo -u hdfs hadoop fs -chown $USER /user/$USER