Configuring ADLS Connectivity

Microsoft Azure Data Lake Store (ADLS) is a massively scalable distributed file system that can be accessed through an HDFS-compatible API. ADLS acts as a persistent storage layer for CDH clusters running on Azure. In contrast to Amazon S3, ADLS more closely resembles native HDFS behavior, providing consistency, file directory structure, and POSIX-compliant ACLs. See the ADLS documentation for conceptual details.

CDH 5.11 and higher supports using ADLS as a storage layer for MapReduce2 (MRv2 or YARN), Hive, Hive-on-Spark, Spark 2.1, and Spark 1.6. Comparable HBase support was added in CDH 5.12. Other applications are not supported and may not work, even if they use MapReduce or Spark as their execution engine. Use the steps in this topic to set up a data store to use with these CDH components.

Note the following limitations:

ADLS is not supported as the default filesystem. Do not set the default file system property (fs.defaultFS) to an adl:// URI. You can still use ADLS as secondary filesystem while HDFS remains the primary filesystem.
Hadoop Kerberos authentication is supported, but it is separate from the Azure user used for ADLS authentication.

Setting up ADLS to Use with CDH

To create your ADLS account, see the Microsoft documentation.
Create the service principal in the Azure portal. See the Microsoft documentation on creating a service principal.
Important:
While you are creating the service principal, write down the following values, which you will need in step 4:
- The client id.
- The client secret.
- The refresh URL. To get this value, in the Azure portal, go to Azure Active Directory > App registrations > Endpoints. In the Endpoints region, copy the OAUTH 2.0 TOKEN ENDPOINT. This is the value you need for the refresh_URL in step 4.
Grant the service principal permission to access the ADLS account. See the Microsoft documentation on Authorization and access control. Review the section, "Using ACLs for operations on file systems" for information about granting the service principal permission to access the account.
You can skip the section on RBAC (role-based access control) because RBAC is used for management and you only need data access.

Configure your CDH cluster to access your ADLS account. To access ADLS storage from a CDH cluster, you provide values for the following properties when submitting jobs:

ADLS Access Properties
Property Description	Property Name
Provider Type	`dfs.adls.oauth2.access.token.provider.type` The value of this property should be `ClientCredential`
Client ID	`dfs.adls.oauth2.client.id`
Client Secret	`dfs.adls.oauth2.credential`
Refresh URL	`dfs.adls.oauth2.refresh.url`

There are several methods you can use to provide these properties to your jobs. There are security and other considerations for each method. Select one of the following methods to access data in ADLS:

Testing and Using ADLS Access

After configuring access, test your configuration by running the following command that lists files in your ADLS account:
```
hadoop fs -ls adl://your_account.azuredatalakestore.net/
```
If your configuration is correct, this command lists the files in your account.
After successfully testing your configuration, you can access the ADLS account from MRv2, Hive, Hive-on-Spark , Spark 1.6, Spark 2.1, or HBase by using the following URI:
```
adl://your_account.azuredatalakestore.net
```

For additional information and examples of using ADLS access with Hadoop components:

Spark: See Accessing Data Stored in Azure Data Lake Store (ADLS) through Spark
HBase: See Using Azure Data Lake Store with HBase
distcp: See Using DistCp with Microsoft Azure (ADLS).

TeraGen:

export HADOOP_CONF_DIR=path_to_working_directory
export HADOOP_CREDSTORE_PASSWORD=hadoop_credstore_password
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 1000 adl://jzhugeadls.azuredatalakestore.net/tg

User-Supplied Key for Each Job

You can pass the ADLS properties on the command line when submitting jobs.

Advantages: No additional configuration is required.
Disadvantages: Credentials will appear in log files, command history and other artifacts, which can be a serious security issue in some deployments.

Use the following syntax to run your jobs:

hadoop command
     -Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \
     -Ddfs.adls.oauth2.client.id=CLIENT ID \
     -Ddfs.adls.oauth2.credential='CLIENT SECRET' \
     -Ddfs.adls.oauth2.refresh.url=REFRESH URL \


      adl://<store>.azuredatalakestore.net/src hdfs://nn/tgt

Single Master Key for Cluster-Wide Access

Use Cloudera Manager to save the values in the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.

Advantages: All users can access the ADLS storage
Disadvantages: This is a highly insecure means of providing access to ADLS for the following reasons:
- The credentials will appear in all Cloudera Manager-managed configuration files for all services in the cluster.
- The credentials will appear in the Job History server.

Open the Cloudera Manager Admin Console and go to Cluster Name > Configuration > Advanced Configuration Snippets.

Enter the following in the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml:

<property>
  <name>dfs.adls.oauth2.access.token.provider.type</name>
  <value>ClientCredential</value>
</property>
<property>
  <name>dfs.adls.oauth2.client.id</name>
  <value>CLIENT ID</value>
</property>
<property>
  <name>dfs.adls.oauth2.credential</name>
  <value>CLIENT SECRET</value>
</property>
<property>
  <name>dfs.adls.oauth2.refresh.url</name>
  <value>REFRESH URL</value>
</property>

Click Save Changes.
Click Restart Stale Services so the cluster can read the new configuration information.

User-Supplied Key stored in a Hadoop Credential Provider

Advantages: Credentials are securely stored in the credential provider.
Disadvantages: Works with MapReduce2 and Spark only (Hive, Impala, and HBase are not supported).

Create a Credential Provider.

Create a password for the Hadoop Credential Provider and export it to the environment:
```
export HADOOP_CREDSTORE_PASSWORD=password
```

Provision the credentials by running the following commands:

hadoop credential create dfs.adls.oauth2.client.id -provider jceks://hdfs/user/USER_NAME/adls-cred.jceks -value client ID
hadoop credential create dfs.adls.oauth2.credential -provider jceks://hdfs/user/USER_NAME/adls-cred.jceks -value client secret
hadoop credential create dfs.adls.oauth2.refresh.url -provider jceks://hdfs/user/USER_NAME/adls-cred.jceks -value refresh URL

You can omit the -value option and its value and the command will prompt the user to enter the value.

For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).

Export the password to the environment:

export HADOOP_CREDSTORE_PASSWORD=password

Reference the Credential Provider on the command line when submitting jobs:

hadoop command
     -Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \
     -Dhadoop.security.credential.provider.path=jceks://hdfs/user/USER_NAME/adls-cred.jceks \
     adl://<store>.azuredatalakestore.net/

Create a Hadoop Credential Provider and reference it in a customized copy of the `core-site.xml` file for the service

Advantages: all users can access the ADLS storage
Disadvantages: you must pass the path to the credential store on the command line.

Create a Credential Provider:

Create a password for the Hadoop Credential Provider and export it to the environment:
```
export HADOOP_CREDSTORE_PASSWORD=password
```

Provision the credentials by running the following commands:

hadoop credential create dfs.adls.oauth2.client.id -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value client ID
hadoop credential create dfs.adls.oauth2.credential -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value client secret
hadoop credential create dfs.adls.oauth2.refresh.url -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value refresh URL

You can omit the -value option and its value and the command will prompt the user to enter the value.

For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).

Export the password to the environment:

export HADOOP_CREDSTORE_PASSWORD=password

Copy the contents of the /etc/service/conf directory to a working directory. The service can be one of the following verify list:
- yarn
- spark
- spark2
Use the --dereference option when copying the file so that symlinks are correctly resolved. For example:
```
cp -r --dereference /etc/spark/conf ~/my_custom_config_directory
```
Change the ownership so that you can edit the files:
```
sudo chown --recursive $USER ~/custom-conf-file/*
```

Add the following to the core-site.xml file in the working directory:

<property>
  <name>hadoop.security.credential.provider.path</name>
  <value>jceks://hdfs/path_to_credential_store_file</value>
</property>
<property>
  <name>dfs.adls.oauth2.access.token.provider.type</name>
  <value>ClientCredential</value>
</property>

The value of the path_to_credential_store_file should be the same as the value for the --provider option in the hadoop credential create command described in step 1.

Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
```
export HADOOP_CONF_DIR=path_to_working_directory
```

Creating a Credential Provider for ADLS

You can use a Hadoop Credential Provider to specify ADLS credentials, which allows you to run jobs without having to enter the access key and secret key on the command line. This prevents these credentials from being exposed in console output, log files, configuration files, and other artifacts. Running the command in this way requires that you provision a credential store to securely store the access key and secret key. The credential store file is saved in HDFS.

To create a credential provider, run the following commands:

Create a password for the Hadoop Credential Provider and export it to the environment:
```
export HADOOP_CREDSTORE_PASSWORD=password
```

Provision the credentials by running the following commands:

hadoop credential create dfs.adls.oauth2.client.id -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value client ID
hadoop credential create dfs.adls.oauth2.credential -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value client secret
hadoop credential create dfs.adls.oauth2.refresh.url -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value refresh URL

You can omit the -value option and its value and the command will prompt the user to enter the value.

For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).

ADLS Configuration Notes

ADLS Trash Folder Behavior

If the fs.trash.interval property is set to a value other than zero on your cluster and you do not specify the -skipTrash flag with your rm command when you remove files, the deleted files are moved to the trash folder in your ADLS account. The trash folder in your ADLS account is located at adl://your_account.azuredatalakestore.net/user/user_name/.Trash/current/. For more information about HDFS trash, see Configuring HDFS Trash.

User and Group Names Displayed as GUIDs

By default ADLS user and group names are displayed as GUIDs. For example, you receive the following output for these Hadoop commands:

$hadoop fs -put /etc/hosts adl://your_account.azuredatalakestore.net/one_file
$hadoop fs -ls adl://your_account.azuredatalakestore.net/one_file
-rw-r--r--  1 94c1b91f-56e8-4527-b107-b52b6352320e cdd5b9e6-b49e-4956-be4b-7bd3ca314b18   273
2017-04-11 16:38 adl://your_account.azuredatalakestore.net/one_file

To display user-friendly names, set the property adl.feature.ownerandgroup.enableupn to true in the core-site.xml file or at the command line. When this property is set to true the -ls command returns the following output:

$hadoop fs -ls adl://your_account.azuredatalakestore.net/one_file
-rw-r--r--  1 YourADLSApp your_login_app    273 2017-04-11 16:38
adl://your_account.azuredatalakestore.net/one_file

Accessing Storage Using Microsoft ADLS

How To Create a Multitenant Enterprise Data Hub

Configuring ADLS Connectivity

Setting up ADLS to Use with CDH

Testing and Using ADLS Access

User-Supplied Key for Each Job

Single Master Key for Cluster-Wide Access

User-Supplied Key stored in a Hadoop Credential Provider

Create a Hadoop Credential Provider and reference it in a customized copy of the core-site.xml file for the service

Creating a Credential Provider for ADLS

ADLS Configuration Notes

ADLS Trash Folder Behavior

User and Group Names Displayed as GUIDs

Create a Hadoop Credential Provider and reference it in a customized copy of the `core-site.xml` file for the service