Setting Up MapReduce v1 (MRv1) Using the Command Line
This topic describes configuration and startup tasks for MRv1 clusters only.
- Make sure you have configured and deployed HDFS.
- Configure the JobTracker's RPC server.
- Open the mapred-site.xml file in the custom directory you created when you copied the Hadoop configuration.
- Specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host><port>. The default value is local. With the
default value, JobTracker runs on demand when you run a MapReduce job. Do not try to start the JobTracker yourself in this case. If you specify the host other than local, use the hostname (for
example mynamenode) not the IP address.
For example:
<property> <name>mapred.job.tracker</name> <value>jobtracker-host.company.com:8021</value> </property>
- Configure local storage directories for use by MRv1 daemons.
- Open the mapred-site.xml file in the custom directory you created when you copied the Hadoop configuration.
- Edit the mapred.local.dir property to specify the directories where the TaskTracker will store temporary data and intermediate map output files while
running MapReduce jobs. Cloudera recommends that you specify a directory on each of the JBOD mount points: /data/1/mapred/local through /data/N/mapred/local. For example:
<property> <name>mapred.local.dir</name> <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value> </property>
- Create the mapred.local.dir local directories:
$ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
- Configure the owner of the mapred.local.dir directory to be the mapred user:
$ sudo chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
. - Set the permissions to drwxr-xr-x.
- Configure a health check script for DataNode processes.
Because a TaskTracker that has few functioning local directories will not perform well, Cloudera recommends configuring a health script that checks if the DataNode process is running (if configured as described under Configuring DataNodes to Tolerate Local Storage Directory Failure, the DataNode will shut down after the configured number of directory failures). Here is an example health script that exits if the DataNode process is not running:
#!/bin/bash if ! jps | grep -q DataNode ; then echo ERROR: datanode not up fi
In practice, the dfs.data.dir and mapred.local.dir are often configured on the same set of disks, so a disk failure will result in the failure of both a dfs.data.dir and mapred.local.dir.
For more information, go to the section titled "Configuring the Node Health Check Script" in the Apache cluster setup documentation.
- Set the mapreduce.jobtracker.restart.recover property to true. This ensures that running jobs that fail because of a
system crash or hardware failure are re-run when the JobTracker restarts. A recovered job has the following properties:
- It will have the same job ID as when it was submitted.
- It will run under the same user as the original job.
- It will write to the same output directory as the original job, overwriting any previous output.
- It will show as RUNNING on the JobTracker web page after you restart the JobTracker.
- Repeat for each TaskTracker.
- Configure a health check script for DataNode processes.
Because a TaskTracker that has few functioning local directories will not perform well, Cloudera recommends configuring a health script that checks if the DataNode process is running (if configured as described under Configuring DataNodes to Tolerate Local Storage Directory Failure, the DataNode will shut down after the configured number of directory failures). The following is an example health script that exits if the DataNode process is not running:
#!/bin/bash if ! jps | grep -q DataNode ; then echo ERROR: datanode not up fi
For more information, go to the section titled "Configuring the Node Health Check Script" in the Apache cluster setup documentation.
- Configure JobTracker recovery.
Set the property mapreduce.jobtracker.restart.recover to true in mapred-site.xml.
JobTracker ensures that running jobs that fail because of a system crash or hardware failure are re-run when the JobTracker restarts. A recovered job has the following properties:
- It will have the same job ID as when it was submitted.
- It will run under the same user as the original job.
- It will write to the same output directory as the original job, overwriting any previous output.
- It will show as RUNNING on the JobTracker web page after you restart the JobTracker.
- Create MapReduce /var directories:
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
- Verify the HDFS file structure:
$ sudo -u hdfs hadoop fs -ls -R /
You should see:
drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
- Create and configure the mapred.system.dir directory in HDFS. The HDFS directory specified by the mapred.system.dir
parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.
To create the directory in its default location:
$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system $ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system
When starting up, MapReduce sets the permissions for the mapred.system.dir directory to drwx------, assuming the user mapred owns that directory.
- Start MapReduce by starting the TaskTracker and JobTracker services.
- On each TaskTracker system:
$ sudo service hadoop-0.20-mapreduce-tasktracker start
- On the JobTracker system:
$ sudo service hadoop-0.20-mapreduce-jobtracker start
- On each TaskTracker system:
- Create a home directory for each MapReduce user. On the NameNode, enter:
$ sudo -u hdfs hadoop fs -mkdir /user/<user> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.
Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
sudo -u hdfs hadoop fs -mkdir /user/$USER sudo -u hdfs hadoop fs -chown $USER /user/$USER
- Set HADOOP_MAPRED_HOME.
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
Set this environment variable for each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation.
- Configure the Hadoop daemons to start automatically. For more information, see Step 5: Configure CDH Daemons to Run at Startup.