Managing YARN (MRv2) and MapReduce (MRv1)
CDH supports two versions of the MapReduce computation framework: MRv1 and MRv2, which are implemented by the MapReduce (MRv1) and YARN (MRv2) services. YARN is backwards-compatible with MapReduce. (All jobs that run against MapReduce also run in a YARN cluster).
The MapReduce v2 (MRv2) or YARN architecture splits the two primary responsibilities of the JobTracker — resource management and job scheduling/monitoring — into separate daemons: a global ResourceManager and per-application ApplicationMasters. With YARN, the ResourceManager and per-host NodeManagers form the data-computation framework. The ResourceManager service effectively replaces the functions of the JobTracker, and NodeManagers run on worker hosts instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a framework-specific library and negotiates resources from the ResourceManager and works with the NodeManagers to run and monitor the tasks. For details of this architecture, see Apache Hadoop NextGen MapReduce (YARN).
- The Cloudera Manager Admin Console has different methods for displaying MapReduce and YARN job history. See Monitoring MapReduce Jobs and Monitoring YARN Applications.
- For information on configuring the MapReduce and YARN services for high availability, see MapReduce (MRv1) and YARN (MRv2) High Availability.
- For information on configuring MapReduce and YARN resource management features, see Resource Management.
Defaults and Recommendations
- In a Cloudera Manager deployment of a CDH 5 cluster, the YARN service is the default MapReduce computation framework.In CDH 5, the MapReduce service has been deprecated. However, the MapReduce service is fully supported for backward compatibility through the CDH 5 lifecycle.
- In a Cloudera Manager deployment of a CDH 4 cluster, the MapReduce service is the default MapReduce computation framework.You can create a YARN service in a CDH 4 cluster, but it is not considered production ready.
- For production uses, Cloudera recommends that only one MapReduce framework should be running at any given time. If development needs or other use case requires switching between MapReduce and YARN, both services can be configured at the same time, but only one should be running (to fully optimize the hardware resources available).
Migrating from MapReduce to YARN
Cloudera Manager provides a wizard described in Importing MapReduce Configurations to YARN to easily migrate MapReduce configurations to YARN. The wizard performs all the steps (Switching Between MapReduce and YARN Services, Updating Services Dependent on MapReduce, and Configuring Alternatives Priority for Services Dependent on MapReduce) on this page.
- Do one of the following:
- Select .
- On the Cloudera Management Service table, click the Cloudera Management Service link. tab, in
- Click the Instances tab.
- Select checkbox for Activity Monitor, select Stop to confirm. , and click
- Select checkbox for Activity Monitor, select Delete to confirm. , and click
- Manage the Activity Monitor database. The example below is for a MySQL backend database:
- Verify the Activity Monitor database:
mysql> show databases; +--------------------+ | Database | +--------------------+ | amon | +--------------------+
- Back up the database:
$ mysqldump -uroot -pcloudera amon > /safe_backup_directory/amon.sql
Drop the database:mysql> drop database amon;
- Verify the Activity Monitor database:
Once you have migrated to YARN and deleted the MapReduce service, you can remove local data from each TaskTracker host. The mapred.local.dir parameter is a directory on the local filesystem of each TaskTracker that contains temporary data for MapReduce. Once the service is stopped, you can remove this directory to free disk space on each host.
For detailed information on migrating from MapReduce to YARN, see Migrating from MapReduce (MRv1) to MapReduce (MRv2).
Switching Between MapReduce and YARN Services
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
- (Optional) Configure the new MapReduce or YARN service.
- Update dependent services to use the chosen framework.
- Configure the alternatives priority.
- Redeploy the Oozie ShareLib.
- Redeploy the client configuration.
- Start the framework service to switch to.
- (Optional) Stop the unused framework service to free up the resources it uses.
Updating Services Dependent on MapReduce
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
- Hive
- Sqoop 2
- Oozie
- Go to the service.
- Click the Configuration tab.
- Select .
- Select .
- Locate the MapReduce Service property and select the YARN or MapReduce service.
- Click Save Changes to commit the changes.
- Select .
- Go to the Hue service.
- Select .
Configuring Alternatives Priority for Services Dependent on MapReduce
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
The alternatives priority property determines which service—MapReduce or YARN—is used by clients to run MapReduce jobs. The service with a higher value of the property is used. In CDH 4, the MapReduce service alternatives priority is set to 92 and the YARN service is set to 91. In CDH 5, the values are reversed; the MapReduce service alternatives priority is set to 91 and the YARN service is set to 92.
- Go to the MapReduce or YARN service.
- Click the Configuration tab.
- Select .
- Select .
- Type Alternatives in Search box.
- In the Alternatives Priority property, set the priority value.
- Click Save Changes to commit the changes.
- Redeploy the client configuration.
Configuring MapReduce To Read/Write With Amazon Web Services
These are the steps required to configure MapReduce to read and write with AWS.
- Save your AWS access key in a .jceks file in HDFS.
hadoop credential create fs.s3a.access.key -provider \ jceks://hdfs/<hdfs directory>/<file name>.jceks -value <AWS access key id>
- Put the AWS secret in the same .jceks file created in previous step.
hadoop credential create fs.s3a.secret.key -provider \ jceks://hdfs/<hdfs directory>/<file name>.jceks -value <AWS secret access key>
- Set your hadoop.security.credential.provider.path to the path of the .jceks file in the job configuration so that the
MapReduce framework loads AWS credentials from the .jceks file in HDFS. The following example shows a Teragen MapReduce job that writes to an S3 bucket.
hadoop jar <path to the Hadoop MapReduce example jar file> teragen \ -Dhadoop.security.credential.provider.path= \ jceks://hdfs/<hdfs directory>/<file name>.jceks \ 100 s3a://<bucket name>/teragen1
You can specify the variables <hdfs directory>, <file name>, <AWS access key id>, and <AWS secret access key>. <hdfs directory> is the HDFS directory where you store the .jceks file. <file name> is the name of the .jceks file in HDFS.
To configure Oozie to submit S3 MapReduce jobs, see Configuring Oozie to Enable MapReduce Jobs To Read/Write from Amazon S3.