Setting Up Apache Whirr Using the Command Line
Apache Whirr is a set of libraries for running cloud services. You can use Whirr to run CDH 5 clusters on cloud providers' clusters, such as Amazon Elastic Compute Cloud (Amazon EC2). There's no need to install the RPMs for CDH 5 or do any configuration; a working cluster will start immediately with one command. It's ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs. When you are finished, you can destroy the cluster and all of its data with one command.
Use the following sections to install and deploy Whirr:
Installing Whirr
To install Whirr on an Ubuntu or other Debian system:
$ sudo apt-get install whirr
To install Whirr on a RHEL-compatible system:
$ sudo yum install whirr
To install Whirr on a SLES system:
$ sudo zypper install whirr
To install Whirr on another system: Download a Whirr tarball from here.
To verify Whirr is properly installed:
$ whirr version
Generating an SSH Key Pair for Whirr
After installing Whirr, generate a password-less SSH key pair to enable secure communication with the Whirr cluster.
ssh-keygen -t rsa -P ''
Defining a Whirr Cluster
After generating an SSH key pair, the only task left to do before using Whirr is to define a cluster by creating a properties file. You can name the properties file whatever you like. The example properties file used in these instructions is named hadoop.properties. Save the properties file in your home directory. After defining a cluster in the properties file, you will be ready to launch a cluster and run MapReduce jobs.
MRv1 Cluster
The following file defines a cluster with a single machine for the NameNode and JobTracker, and another machine for a DataNode and TaskTracker.
whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.identity=<cloud-provider-identity> whirr.credential=<cloud-provider-credential> whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub whirr.env.repo=cdh5 whirr.hadoop-install-function=install_cdh_hadoop whirr.hadoop-configure-function=configure_cdh_hadoop whirr.hardware-id=m1.large whirr.image-id=us-east-1/ami-ccb35ea5 whirr.location-id=us-east-1
YARN Cluster
The following configuration provides the essentials for a YARN cluster. Change the number of instances for hadoop-datanode+yarn-nodemanager from 2 to a larger number if you need to.
whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver,2 hadoop-datanode+yarn-nodemanager whirr.provider=aws-ec2 whirr.identity=<cloud-provider-identity> whirr.credential=<cloud-provider-credential> whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub whirr.env.mapreduce_version=2 whirr.env.repo=cdh5 whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop whirr.mr_jobhistory.start-function=start_cdh_mr_jobhistory whirr.yarn.configure-function=configure_cdh_yarn whirr.yarn.start-function=start_cdh_yarn whirr.hardware-id=m1.large whirr.image-id=us-east-1/ami-ccb35ea5 whirr.location-id=us-east-1
Managing a Cluster with Whirr
To launch a cluster:
$ whirr launch-cluster --config hadoop.properties
As the cluster starts up, messages are displayed in the console. You can see debug-level log messages in a file named whirr.log in the directory where you ran the whirr command. After the cluster has started, a message appears in the console showing the URL you can use to access the web UI for Whirr.
Running a Whirr Proxy
For security reasons, traffic from the network where your client is running is proxied through the master node of the cluster using an SSH tunnel (a SOCKS proxy on port 6666). A script to launch the proxy is created when you launch the cluster, and may be found in ~/.whirr/<cluster-name>.
To launch the Whirr proxy:
- Run the following command in a new terminal window:
$ . ~/.whirr/myhadoopcluster/hadoop-proxy.sh
- To stop the proxy, kill the process by pressing Ctrl-C.
Running a MapReduce job
After you launch a cluster, a hadoop-site.xml file is automatically created in the directory ~/.whirr/<cluster-name>. You need to update the local Hadoop configuration to use this file.
To update the local Hadoop configuration to use hadoop-site.xml:
- On all systems, type the following commands:
$ cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.whirr $ rm -f /etc/hadoop/conf.whirr/*-site.xml $ cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop/conf.whirr
- If you are using an Ubuntu, Debian, or SLES system, type these commands:
$ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.whirr 50 $ update-alternatives --display hadoop-conf
- If you are using a Red Hat system, type these commands:
$ sudo alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.whirr 50 $ alternatives --display hadoop-conf
- You can now browse HDFS:
$ hadoop fs -ls /
To run a MapReduce job, run these commands:
- For MRv1:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce $ hadoop fs -mkdir input $ hadoop fs -put $HADOOP_MAPRED_HOME/CHANGES.txt input $ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar wordcount input output $ hadoop fs -cat output/part-* | head
- For YARN:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce $ hadoop fs -mkdir input $ hadoop fs -put $HADOOP_MAPRED_HOME/CHANGES.txt input $ hadoop jar $HADOOP_MAPRED_HOME/hadoop-mapreduce-examples.jar wordcount input output $ hadoop fs -cat output/part-* | head
Destroying a cluster
When you are finished using a cluster, you can terminate the instances and clean up the resources using the commands shown in this section.
WARNING All data will be deleted when you destroy the cluster. |
To destroy a cluster:
- Run the following command to destroy a cluster:
$ whirr destroy-cluster --config hadoop.properties
- Shut down the SSH proxy to the cluster if you started one earlier.
Viewing the Whirr Documentation
For additional documentation see the Whirr Documentation.