Installing CDH 5 with YARN on a Single Linux Host in Pseudo-distributed mode
Before you start, uninstall MRv1 if necessary
If you have already installed MRv1 following the steps in the previous section, you now need to uninstall hadoop-0.20-conf-pseudo before running YARN. Proceed as follows.
- Stop the daemons:
$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done $ for x in 'cd /etc/init.d ; ls hadoop-0.20-mapreduce-*' ; do sudo service $x stop ; done
- Remove hadoop-0.20-conf-pseudo:
- On Red Hat-compatible systems:
$ sudo yum remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
- On SLES systems:
$ sudo zypper remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
- On Ubuntu or Debian systems:
$ sudo apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
In this case (after uninstalling hadoop-0.20-conf-pseudo) you can skip the package download steps below.
- On Red Hat-compatible systems:
On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:
Download the CDH 5 Package
- Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write
access (it can be your home directory).
OS Version Link to CDH 5 RPM RHEL/CentOS/Oracle 5 RHEL/CentOS/Oracle 5 link RHEL/CentOS/Oracle 6 RHEL/CentOS/Oracle 6 link RHEL/CentOS/Oracle 7 RHEL/CentOS/Oracle 7 link - Install the RPM.
For Red Hat/CentOS/Oracle 5:
$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
For Red Hat/CentOS/Oracle 6 (64-bit):
$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, see Installing CDH 5 On Red Hat-compatible systems.
Install CDH 5
- (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by running the following command:
- For Red Hat/CentOS/Oracle 5 systems:
$ sudo rpm --import https://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
- For Red Hat/CentOS/Oracle 6 systems:
$ sudo rpm --import https://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
- For Red Hat/CentOS/Oracle 5 systems:
- Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
$ sudo yum install hadoop-conf-pseudo
On SLES systems, do the following:
Download and install the CDH 5 package
- Download the CDH 5 "1-click Install" package.
Download the RPM file, choose Save File, and save it to a directory to which you have write access (for example, your home directory).
- Install the RPM:
$ sudo rpm -i cloudera-cdh-5-0.x86_64.rpm
For instructions on how to add a CDH 5 SLES repository or build your own CDH 5 SLES repository, see Installing CDH 5 On SLES systems.
Install CDH 5
- (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by running the following command:
- For all SLES systems:
$ sudo rpm --import https://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
- For all SLES systems:
- Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
$ sudo zypper install hadoop-conf-pseudo
On Ubuntu and other Debian systems, do the following:
Download and install the package
- Download the CDH 5 "1-click Install" package:
OS Version Package Link Jessie Jessie package Wheezy Wheezy package Precise Precise package Trusty Trusty package - Install the package by doing one of the following:
- Choose Open with in the download window to use the package manager.
- Choose Save File, save the package to a directory to which you have write access (for example, your home directory), and install it from the command line.
For example:
sudo dpkg -i cdh5-repository_1.0_all.deb
Install CDH 5
- (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by running the following command:
- For Ubuntu Lucid systems:
$ curl -s https://archive.cloudera.com/cdh5/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -
- For Ubuntu Precise systems:
$ curl -s https://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
- For Debian Squeeze systems:
$ curl -s https://archive.cloudera.com/cdh5/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -
- For Ubuntu Lucid systems:
- Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
$ sudo apt-get update $ sudo apt-get install hadoop-conf-pseudo
Starting Hadoop and Verifying it is Working Properly
For YARN, a pseudo-distributed Hadoop installation consists of one host running all five Hadoop daemons: namenode, secondarynamenode, resourcemanager, datanode, and nodemanager.
- To view the files on Red Hat or SLES systems:
$ rpm -ql hadoop-conf-pseudo
- To view the files on Ubuntu systems:
$ dpkg -L hadoop-conf-pseudo
The new configuration is self-contained in the /etc/hadoop/conf.pseudo directory.
The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.
To start Hadoop, proceed as follows.
Step 1: Format the NameNode.
Before starting the NameNode for the first time you must format the file system.
$ sudo -u hdfs hdfs namenode -format
Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.
Step 2: Start HDFS
$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
To verify services have started, you can check the web console. The NameNode provides a web console http://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.
Step 3: Create the directories needed for Hadoop processes.
$ sudo /usr/lib/hadoop/libexec/init-hdfs.sh
Step 4: Verify the HDFS File Structure:
Run the following command:
$ sudo -u hdfs hadoop fs -ls -R /
You should see output similar to the following excerpt:
... drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn ...
Step 5: Start YARN
$ sudo service hadoop-yarn-resourcemanager start $ sudo service hadoop-yarn-nodemanager start $ sudo service hadoop-mapreduce-historyserver start
Step 6: Create User Directories
Create a home directory on the NameNode for each MapReduce user. For example:
$ sudo -u hdfs hadoop fs -mkdir /user/<user> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.
Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
$ sudo -u hdfs hadoop fs -mkdir /user/$USER $ sudo -u hdfs hadoop fs -chown $USER /user/$USER
Running an example application with YARN
- Create a home directory on HDFS for the user who will be running the job (for example, joe):
$ sudo -u hdfs hadoop fs -mkdir /user/joe $ sudo -u hdfs hadoop fs -chown joe /user/joe
Do the following steps as the user joe.
- Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:
$ hadoop fs -mkdir input $ hadoop fs -put /etc/hadoop/conf/*.xml input $ hadoop fs -ls input Found 3 items: -rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml -rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml -rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
- Set HADOOP_MAPRED_HOME for user joe:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
- Run an example Hadoop job to grep with a regular expression in your input data.
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
- After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
$ hadoop fs -ls Found 2 items drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23
You can see that there is a new directory called output23.
- List the output files.
$ hadoop fs -ls output23 Found 2 items drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000
- Read the results in the output file.
$ hadoop fs -cat output23/part-r-00000 | head 1 dfs.safemode.min.datanodes 1 dfs.safemode.extension 1 dfs.replication 1 dfs.permissions.enabled 1 dfs.namenode.name.dir 1 dfs.namenode.checkpoint.dir 1 dfs.datanode.data.dir