How to Set Up a Gateway Node to Restrict Access to the Cluster
The steps below configure a firewall-protected Hadoop cluster that allows access only through the node configured as the gateway. Clients access the cluster through the gateway using the REST API, for example, using HttpFS (which provides REST access to HDFS) or using Oozie, which allows REST access for submitting and monitoring jobs.
Installing and Configuring the Firewall and Gateway
Follow these steps:
- Choose a cluster node to be the gateway machine.
- Install and configure HttpFS and Oozie by following the standard directions starting here: Step 4: Install CDH Packages.
- Start the Oozie server:
$ sudo service oozie start
- Start the HttpFS server:
$ sudo service hadoop-httpfs start
- Configure firewalls.
Block all access from outside the cluster.
- The gateway node should have ports 11000 (oozie) and 14000 (hadoop-httpfs) open.
- Optionally, to maintain access to the Web UIs for the cluster's JobTrackers and NameNode, open their HTTP ports: see Ports Used by Components of CDH.
- Optionally configure authentication in simple mode (default) or using Kerberos. See HttpFS Authentication to configure Kerberos for HttpFS and Configuring Oozie Authentication to configure Kerberos for Oozie.
- Optionally encrypt communication using HTTPS for Oozie by following these directions.
Accessing HDFS
With the Hadoop client:
All of the standard hadoop fs commands work; just make sure to specify -fs webhdfs://HOSTNAME:14000. For example (where GATEWAYHOST is the hostname of the gateway machine):
$ hadoop fs -fs webhdfs://GATEWAYHOST:14000 -cat /user/me/myfile.txt Hello World!
Without the Hadoop client:
You can run all of the standard hadoop fs commands by using the WebHDFS REST API and any program that can do GET , PUT, POST, and DELETE requests; for example:
$ curl "http://GATEWAYHOST:14000/webhdfs/v1/user/me/myfile.txt?op=OPEN&user.name=me" Hello World!
In general, the command will look like this:
$ curl "http://GATEWAYHOST/webhdfs/v1/PATH?[user.name=USER&]op=…"
You can find a full explanation of the commands in the WebHDFS REST API documentation.
Submitting and Monitoring Jobs
The Oozie REST API supports the direct submission of jobs for MapReduce, Pig, and Hive; Oozie automatically creates a workflow with a single action. For any other action types, or to execute anything more complicated than a single job, you must create an actual workflow. Required files (JAR files, input data) must already exist on HDFS; if they do not, you can use HttpFS to upload the files.
With the Oozie client:
All of the standard Oozie commands will work. You can find a full explanation of the commands in the documentation for the command-line utilities.
Without the Oozie client:
You can run all of the standard Oozie commands by using the REST API and any program that can do GET, PUT, and POST requests. You can find a full explanation of the commands in the Oozie Web Services API documentation.