Configuring ADLS Connectivity
Microsoft Azure Data Lake Store (ADLS) is a massively scalable distributed file system that can be accessed through an HDFS-compatible API. ADLS acts as a persistent storage layer for CDH clusters running on Azure. In contrast to Amazon S3, ADLS more closely resembles native HDFS behavior, providing consistency, file directory structure, and POSIX-compliant ACLs. See the ADLS documentation for conceptual details.
CDH 5.11 and higher supports using ADLS as a storage layer for MapReduce2 (MRv2 or YARN), Hive, Hive-on-Spark, Spark 2.1, and Spark 1.6. Comparable HBase support was added in CDH 5.12. Other applications are not supported and may not work, even if they use MapReduce or Spark as their execution engine. Use the steps in this topic to set up a data store to use with these CDH components.
- ADLS is not supported as the default filesystem. Do not set the default file system property (fs.defaultFS) to an adl:// URI. You can still use ADLS as secondary filesystem while HDFS remains the primary filesystem.
- Hadoop Kerberos authentication is supported, but it is separate from the Azure user used for ADLS authentication.
Setting up ADLS to Use with CDH
- To create your ADLS account, see the Microsoft documentation.
- Create the service principal in the Azure portal. See the Microsoft documentation on creating a service principal.
- Grant the service principal permission to access the ADLS account. See the Microsoft documentation on Authorization and access control. Review the section,
"Using ACLs for operations on file systems" for information about granting the service principal permission to access the account.
You can skip the section on RBAC (role-based access control) because RBAC is used for management and you only need data access.
- Configure your CDH cluster to access your ADLS account. To access ADLS storage from a CDH cluster, you provide values for the following properties when submitting jobs:
ADLS Access Properties Property Description Property Name Provider Type dfs.adls.oauth2.access.token.provider.type The value of this property should be ClientCredential
Client ID dfs.adls.oauth2.client.id Client Secret dfs.adls.oauth2.credential Refresh URL dfs.adls.oauth2.refresh.url There are several methods you can use to provide these properties to your jobs. There are security and other considerations for each method. Select one of the following methods to access data in ADLS:
Testing and Using ADLS Access
- After configuring access, test your configuration by running the following command that lists files in your ADLS account:
hadoop fs -ls adl://your_account.azuredatalakestore.net/
If your configuration is correct, this command lists the files in your account.
- After successfully testing your configuration, you can access the ADLS account from MRv2, Hive, Hive-on-Spark , Spark 1.6, Spark 2.1, or HBase by using the following URI:
adl://your_account.azuredatalakestore.net
- Spark: See Accessing Data Stored in Azure Data Lake Store (ADLS) through Spark
- HBase: See Using Azure Data Lake Store with HBase
- distcp: See Using DistCp with Microsoft Azure (ADLS).
- TeraGen:
export HADOOP_CONF_DIR=path_to_working_directory export HADOOP_CREDSTORE_PASSWORD=hadoop_credstore_password hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 1000 adl://jzhugeadls.azuredatalakestore.net/tg
User-Supplied Key for Each Job
- Advantages: No additional configuration is required.
- Disadvantages: Credentials will appear in log files, command history and other artifacts, which can be a serious security issue in some deployments.
hadoop command -Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \ -Ddfs.adls.oauth2.client.id=CLIENT ID \ -Ddfs.adls.oauth2.credential='CLIENT SECRET' \ -Ddfs.adls.oauth2.refresh.url=REFRESH URL \ adl://<store>.azuredatalakestore.net/src hdfs://nn/tgt
Single Master Key for Cluster-Wide Access
Use Cloudera Manager to save the values in the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.
- Advantages: All users can access the ADLS storage
- Disadvantages: This is a highly insecure means of providing access to ADLS for the following reasons:
- The credentials will appear in all Cloudera Manager-managed configuration files for all services in the cluster.
- The credentials will appear in the Job History server.
- Open the Cloudera Manager Admin Console and go to .
- Enter the following in the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml:
<property> <name>dfs.adls.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property> <property> <name>dfs.adls.oauth2.client.id</name> <value>CLIENT ID</value> </property> <property> <name>dfs.adls.oauth2.credential</name> <value>CLIENT SECRET</value> </property> <property> <name>dfs.adls.oauth2.refresh.url</name> <value>REFRESH URL</value> </property>
- Click Save Changes.
- Click Restart Stale Services so the cluster can read the new configuration information.
User-Supplied Key stored in a Hadoop Credential Provider
- Advantages: Credentials are securely stored in the credential provider.
- Disadvantages: Works with MapReduce2 and Spark only (Hive, Impala, and HBase are not supported).
- Create a Credential Provider.
- Create a password for the Hadoop Credential Provider and export it to the environment:
export HADOOP_CREDSTORE_PASSWORD=password
- Provision the credentials by running the following commands:
hadoop credential create dfs.adls.oauth2.client.id -provider jceks://hdfs/user/USER_NAME/adls-cred.jceks -value client ID hadoop credential create dfs.adls.oauth2.credential -provider jceks://hdfs/user/USER_NAME/adls-cred.jceks -value client secret hadoop credential create dfs.adls.oauth2.refresh.url -provider jceks://hdfs/user/USER_NAME/adls-cred.jceks -value refresh URL
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).
- Create a password for the Hadoop Credential Provider and export it to the environment:
- Export the password to the environment:
export HADOOP_CREDSTORE_PASSWORD=password
- Reference the Credential Provider on the command line when submitting jobs:
hadoop command -Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \ -Dhadoop.security.credential.provider.path=jceks://hdfs/user/USER_NAME/adls-cred.jceks \ adl://<store>.azuredatalakestore.net/
Create a Hadoop Credential Provider and reference it in a customized copy of the core-site.xml file for the service
- Advantages: all users can access the ADLS storage
- Disadvantages: you must pass the path to the credential store on the command line.
- Create a Credential Provider:
- Create a password for the Hadoop Credential Provider and export it to the environment:
export HADOOP_CREDSTORE_PASSWORD=password
- Provision the credentials by running the following commands:
hadoop credential create dfs.adls.oauth2.client.id -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value client ID hadoop credential create dfs.adls.oauth2.credential -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value client secret hadoop credential create dfs.adls.oauth2.refresh.url -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value refresh URL
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).
- Create a password for the Hadoop Credential Provider and export it to the environment:
- Export the password to the environment:
export HADOOP_CREDSTORE_PASSWORD=password
- Copy the contents of the /etc/service/conf directory to a working directory. The service can be one of the following verify list:
- yarn
- spark
- spark2
Use the --dereference option when copying the file so that symlinks are correctly resolved. For example:cp -r --dereference /etc/spark/conf ~/my_custom_config_directory
Change the ownership so that you can edit the files:sudo chown --recursive $USER ~/custom-conf-file/*
- Add the following to the core-site.xml file in the working directory:
<property> <name>hadoop.security.credential.provider.path</name> <value>jceks://hdfs/path_to_credential_store_file</value> </property> <property> <name>dfs.adls.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property>
The value of the path_to_credential_store_file should be the same as the value for the --provider option in the hadoop credential create command described in step 1.
- Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
export HADOOP_CONF_DIR=path_to_working_directory
Creating a Credential Provider for ADLS
You can use a Hadoop Credential Provider to specify ADLS credentials, which allows you to run jobs without having to enter the access key and secret key on the command line. This prevents these credentials from being exposed in console output, log files, configuration files, and other artifacts. Running the command in this way requires that you provision a credential store to securely store the access key and secret key. The credential store file is saved in HDFS.
- Create a password for the Hadoop Credential Provider and export it to the environment:
export HADOOP_CREDSTORE_PASSWORD=password
- Provision the credentials by running the following commands:
hadoop credential create dfs.adls.oauth2.client.id -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value client ID hadoop credential create dfs.adls.oauth2.credential -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value client secret hadoop credential create dfs.adls.oauth2.refresh.url -provider jceks://hdfs/user/USER_NAME/adlskeyfile.jceks -value refresh URL
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).
ADLS Configuration Notes
ADLS Trash Folder Behavior
If the fs.trash.interval property is set to a value other than zero on your cluster and you do not specify the -skipTrash flag with your rm command when you remove files, the deleted files are moved to the trash folder in your ADLS account. The trash folder in your ADLS account is located at adl://your_account.azuredatalakestore.net/user/user_name/.Trash/current/. For more information about HDFS trash, see Configuring HDFS Trash.
User and Group Names Displayed as GUIDs
$hadoop fs -put /etc/hosts adl://your_account.azuredatalakestore.net/one_file $hadoop fs -ls adl://your_account.azuredatalakestore.net/one_file -rw-r--r-- 1 94c1b91f-56e8-4527-b107-b52b6352320e cdd5b9e6-b49e-4956-be4b-7bd3ca314b18 273 2017-04-11 16:38 adl://your_account.azuredatalakestore.net/one_file
$hadoop fs -ls adl://your_account.azuredatalakestore.net/one_file -rw-r--r-- 1 YourADLSApp your_login_app 273 2017-04-11 16:38 adl://your_account.azuredatalakestore.net/one_file