Copying Cluster Data Using DistCp
The distributed copy command, distcp, is a general utility for copying large data sets between distributed filesystems within and across clusters. You can also use distcp to copy data to and from an Amazon S3 bucket. The distcp command submits a regular MapReduce job that performs a file-by-file copy.
$ hadoop distcp
Continue reading:
- DistCp Syntax and Examples
- Using DistCp with Highly Available Remote Clusters
- Using DistCp with Amazon S3
- Using DistCp with Microsoft Azure (ADLS)
- Using DistCp with Microsoft Azure (WASB)
- Kerberos Setup Guidelines for Distcp between Secure Clusters
- Distcp between Secure Clusters in Different Kerberos Realms
- Enabling Fallback Configuration
- Protocol Support for Distcp
DistCp Syntax and Examples
You can use distcp to copy files between compatible CDH clusters. The basic form of the distcp command only requires a source cluster and a destination cluster:
$ hadoop distcp <source> <destination>
Between the Same CDH Version
Use the following syntax for copying data between clusters that run the same CDH version:
hadoop distcp hdfs://<namenode>:<port> hdfs://<namenode>
For example, the following command copies data from the example-source cluster to the example-dest cluster:
hadoop distcp hdfs://example-source.cloudera.com:50070 hdfs://example-dest.cloudera.com
Port 50070 is the default NameNode port for HDFS.
Between Different CDH Versions
You can use distcp to copy data between different CDH versions. When you do this, adhere to the following guidelines:
- The CDH versions must be compatible.
- The source cluster must run a lower version of CDH than the destination cluster.
- Run the distcp command on the cluster that runs the higher version of CDH, which should be the destination cluster.
- Use the webhdfs protocol for the remote cluster.
- Use the following syntax:
hadoop distcp webhdfs://<namenode>:<port> hdfs://<namenode>
For example, when using distcp between a CDH 5.7.0 cluster and a cluster that runs a higher version of CDH, use the webhdfs protocol for the CDH 5.7.0 cluster and designate it as the source cluster by listing it as the first cluster in the distcp command.
The following command copies data from a lower versioned source cluster named example-source to a higher versioned destination cluster named example-dest:
hadoop distcp webhdfs://example57-source.cloudera.com:50070 hdfs://example512-dest.cloudera.com
For a Specific Path
You can specify a path, such as /hbase, to move HBase data:
hadoop distcp webhdfs://example-source.cloudera.com:50070/hbase hdfs://example-dest.cloudera.com/hbase
To/from Amazon S3
You can copy data to or from Amazon S3 with the following syntax:
#Copying from S3 hadoop distcp s3a://<bucket>/<data> hdfs://<namenode>/<directory>/ #Copying to S3 hadoop distcp hdfs://<namenode>/<directory> s3a://<bucket>/<data>
This is a basic example of using distcp with S3. For more information, see Using DistCp with Amazon S3.
Using DistCp with Highly Available Remote Clusters
- Create a new directory and copy the contents of the /etc/hadoop/conf directory on the local cluster to this directory. The local cluster is the cluster
where you plan to run the distcp command.
Specify this directory for the --config parameter when you run the distcp command in step 5.
The following steps use distcpConf as the directory name. Substitute the name of the directory you created for distcpConf.
- In the hdfs-site.xml file in the distcpConf directory, add the nameservice ID for the remote cluster to the dfs.nameservices property.
- On the remote cluster, find the hdfs-site.xml file and copy the properties that refers to the nameservice ID to the end of the hdfs-site.xml file in the distcpConf directory you created in step 1:
- dfs.ha.namenodes.<nameserviceID>
- dfs.client.failover.proxy.provider.<remote nameserviceID>
- dfs.ha.automatic-failover.enabled.<remote nameserviceID>
- dfs.namenode.rpc-address.<nameserviceID>.<namenode1>
- dfs.namenode.servicerpc-address.<nameserviceID>.<namenode1>
- dfs.namenode.http-address.<nameserviceID>.<namenode1>
- dfs.namenode.https-address.<nameserviceID>.<namenode1>
- dfs.namenode.rpc-address.<nameserviceID>.<namenode2>
- dfs.namenode.servicerpc-address.<nameserviceID>.<namenode2>
- dfs.namenode.http-address.<nameserviceID>.<namenode2>
- dfs.namenode.https-address.<nameserviceID>.<namenode2>
- If you changed the nameservice ID for the remote cluster in step 2, update the nameservice ID used in the properties you copied in step 3 with the new nameservice ID, accordingly.
The following example shows the properties copied from the remote cluster with the following values:
- A remote nameservice called externalnameservice
- NameNodes called namenode1 and namenode2
- A host named remotecluster.com
<property> <name>dfs.ha.namenodes.externalnameservice</name> <value>namenode1,namenode2</value> </property> <property> <name>dfs.client.failover.proxy.provider.externalnameservice</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.automatic-failover.enabled.externalnameservice</name> <value>true</value> </property> <property> <name>dfs.namenode.rpc-address.externalnameservice.namenode1</name> <value>remotecluster.com:8020</value> </property> <property> <name>dfs.namenode.servicerpc-address.externalnameservice.namenode1</name> <value>remotecluster.com:8022</value> </property> <property> <name>dfs.namenode.http-address.externalnameservice.namenode1</name> <value>remotecluster.com:20101</value> </property> <property> <name>dfs.namenode.https-address.externalnameservice.namenode1</name> <value>remotecluster.com:20102</value> </property> <property> <name>dfs.namenode.rpc-address.externalnameservice.namenode2</name> <value>remotecluster.com:8020</value> </property> <property> <name>dfs.namenode.servicerpc-address.externalnameservice.namenode2</name> <value>remotecluster.com:8022</value> </property> <property> <name>dfs.namenode.http-address.externalnameservice.namenode2</name> <value>remotecluster.com:20101</value> </property> <property> <name>dfs.namenode.https-address.externalnameservice.namenode2</name> <value>remotecluster.com:20102</value> </property>
At this point, the hdfs-site.xml file in the distcpConf directory should have both clusters and 4 NameNode IDs.
- Depending on the use case, the options specified when you run the distcp may differ. Here are some examples:
To copy data from an insecure cluster , run the following command:To copy data from a secure cluster, run the following command:
hadoop --config distcpConf distcp hdfs://<nameservice>/<source_directory> <target directory>
hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=<nameservice> hdfs://<nameservice>/<source_directory> <target directory>
For example:hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=ns1 hdfs://ns1/xyz /tmp/test
If the distcp source or target are in encryption zones, include the following distcp options: -skipcrccheck -update. The distcp command may fail if you do not include these options when the source or target are in encryption zones because the CRC for the files may differ.
For CDH 5.12.0 and later, distcp between clusters that both use HDFS Transparent Encryption, you must include the exclude parameter.
Using DistCp with Amazon S3
You can copy HDFS files to and from an Amazon S3 instance. You must provision an S3 bucket using Amazon Web Services and obtain the access key and secret key. You can pass these credentials on the distcp command line, or you can reference a credential store to "hide" sensitive credentials so that they do not appear in the console output, configuration files, or log files.
Amazon S3 block and native filesystems are supported with the s3a:// protocol.
Example of an Amazon S3 Block Filesystem URI: s3a://bucket_name/path/to/file
<property> <name>fs.s3a.access.key</name> <value>...</value> </property> <property> <name>fs.s3a.secret.key</name> <value>...</value> </property>You can also enter the configurations in the Advanced Configuration Snippet for core-site.xml, which allows Cloudera Manager to manage this configuration. See Custom Configuration.
hadoop distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... s3a://
hadoop distcp -Dfs.s3a.access.key=myAccessKey -Dfs.s3a.secret.key=mySecretKey /user/hdfs/mydata s3a://myBucket/mydata_backup
Using a Credential Provider to Secure S3 Credentials
- Provision the credentials by running the following commands:
hadoop credential create fs.s3a.access.key -value access_key -provider jceks://hdfs/path_to_credential_store_file hadoop credential create fs.s3a.secret.key -value secret_key -provider jceks://hdfs/path_to_credential_store_file
For example:hadoop credential create fs.s3a.access.key -value foobar -provider jceks://hdfs/user/alice/home/keystores/aws.jceks hadoop credential create fs.s3a.secret.key -value barfoo -provider jceks://hdfs/user/alice/home/keystores/aws.jceks
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).
- Copy the contents of the /etc/hadoop/conf directory to a working directory.
- Add the following to the core-site.xml file in the working directory:
<property> <name>hadoop.security.credential.provider.path</name> <value>jceks://hdfs/path_to_credential_store_file</value> </property>
- Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
export HADOOP_CONF_DIR=path_to_working_directory
After completing these steps, you can run the distcp command using the following syntax:
hadoop distcp source_path s3a://destination_path
hadoop distcp source_path s3a://bucket_name/destination_path -Dhadoop.security.credential.provider.path=jceks://hdfspath_to_credential_store_file
There are additional options for the distcp command. See DistCp Guide (Apache Software Foundation).
Examples of DistCP Commands Using the S3 Protocol and Hidden Credentials
- Copying files to Amazon S3
-
hadoop distcp /user/hdfs/mydata s3a://myBucket/mydata_backup
- Copying files from Amazon S3
-
hadoop distcp s3a://myBucket/mydata_backup //user/hdfs/mydata
- Copying files to Amazon S3 using the -filters option to exclude specified source files
- You specify a file name with the -filters option. The referenced file contains regular expressions, one per line, that define file name patterns to
exclude from the distcp job. The pattern specified in the regular expression should match the fully-qualified path of the intended files, including the scheme
(hdfs, webhdfs, s3a, etc.). For example, the following are valid expressions for excluding files:
hdfs://x.y.z:8020/a/b/c webhdfs://x.y.z:50070/a/b/c s3a://bucket/a/b/c
Reference the file containing the filter expressions using -filters option. For example:hadoop distcp -filters /user/joe/myFilters /user/hdfs/mydata s3a://myBucket/mydata_backup
Contents of the sample myFilters file:.*foo.* .*/bar/.* hdfs://x.y.z:8020/tmp/.* hdfs://x.y.z:8020/tmp1/file1
The regular expressions in the myFilters exclude the following files:- .*foo.* – excludes paths that contain the string "foo".
- .*/bar/.* – excludes paths that include a directory named bar.
- hdfs://x.y.z:8020/tmp/.* – excludes all files in the /tmp directory.
- hdfs://x.y.z:8020/tmp1/file1 – excludes the file /tmp1/file1.
- Copying files to Amazon S3 with the -overwrite option.
- The -overwrite option overwrites destination files that already exist.
hadoop distcp -overwrite /user/hdfs/mydata s3a://user/mydata_backup
For more information about the -filters, -overwrite, and other options, see DIstCp Guide: Command Line Options (Apache Software Foundation).
Using DistCp with Microsoft Azure (ADLS)
- Configure connectivity to ADLS using one of the methods described in Configuring ADLS Connectivity for CDH.
- If you are copying data to or from Amazon S3, also configure connectivity to S3 as described above. See Using DistCp with Amazon S3
- Use the following syntax to define the Hadoop Credstore:
export HADOOP_CONF_DIR=path_to_working_directory export HADOOP_CREDSTORE_PASSWORD=hadoop_credstore_password
- Run distcp jobs using the following syntax:
ADLS to local cluster:
hadoop distcp adl://store.azuredatalakestore.net/src hdfs://hdfs_destination_path
Local cluster to ADLS:hadoop distcp hdfs://hdfs_destination_path adl://store.azuredatalakestore.net/src
You can also use distcp to copy data between Amazon S3 and Microsoft ADLS.
S3 to ADLS:hadoop distcp s3a://user/my_data adl://Account_Name.azuredatalakestore.net/my_data_backup/
ADL to S3:hadoop distcp s3a://user/my_data adl://Account_Name.azuredatalakestore.net/my_data_backup/
Note that when copying data between these remote filesystems, the data is first copied form the source filesystem to the local cluster before being copied to the destination filesystem.
Using DistCp with Microsoft Azure (WASB)
- Configure connectivity to Azure by setting the following property in core-site.xml.
<property> <name>fs.azure.account.key.youraccount.blob.core.windows.net</name> <value>your_access_key</value> </property>
Note that in practice, you should never store your Azure access key in cleartext. Protect your Azure credentials using one of the methods described at Configuring Azure Blob Storage Credentials.
- Run your distcp jobs using the following syntax:
hadoop distcp wasb://<sample_container>@<sample_account>.blob.core.windows.net/ /hdfs_destination_path
- Upstream Hadoop documentation on Hadoop Support for Azure
Kerberos Setup Guidelines for Distcp between Secure Clusters
- You have two clusters, each in a different Kerberos realm (SOURCE and DESTINATION in this example)
- You have data that needs to be copied from SOURCE to DESTINATION
- A Kerberos realm trust exists, either between SOURCE and DESTINATION (in either direction), or between both SOURCE and DESTINATION and a common third realm (such as an Active Directory domain).
- Both SOURCE and DESTINATION clusters are running CDH 5.3.4 or higher
Environment Type | Kerberos Delegation Token Setting | |
---|---|---|
SOURCE trusts DESTINATION | Distcp job runs on the DESTINATION cluster | You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property. |
Distcp job runs on the SOURCE cluster | Set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of the hostnames of the NameNodes of the DESTINATION cluster. | |
DESTINATION trusts SOURCE | Distcp job runs on the DESTINATION cluster | Set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of the hostnames of the NameNodes of the SOURCE cluster. |
Distcp job runs on the SOURCE cluster | You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property. | |
Both SOURCE and DESTINATION trust each other | Set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of the hostnames of the NameNodes of the DESTINATION cluster. | |
Neither SOURCE nor DESTINATION trusts the other | If a common realm is usable (such as Active Directory), set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of hostnames of the NameNodes of the cluster that is not running the
distcp job. For example, if you are running the job on the DESTINATION cluster:
|
Distcp between Secure Clusters in Different Kerberos Realms
This section explains how to copy data between two secure clusters in different Kerberos realms:
Configure Source and Destination Realms in krb5.conf
[realms] QA.EXAMPLE.COM = { kdc = kdc01.qa.example.com:88 admin_server = kdc01.qa.example.com:749 } DEV.EXAMPLE.COM = { kdc = kdc01.dev.example.com:88 admin_server = kdc01.dev.example.com:749 } [domain_realm] .qa.example.com = QA.EXAMPLE.COM qa.example.com = QA.EXAMPLE.COM .dev.example.com = DEV.EXAMPLE.COM dev.example.com = DEV.EXAMPLE.COM
Configure HDFS RPC Protection and Acceptable Kerberos Principal Patterns
- Open the Cloudera Manager Admin Console.
- Go to the HDFS service.
- Click the Configuration tab.
- Select .
- Select .
- Locate the Hadoop RPC Protection property and select authentication.
- Click Save Changes to commit the changes.
The following steps are not required if the two realms are already set up to trust each other, or have the same principal pattern. However, this isn't usually the case.
- Open the Cloudera Manager Admin Console.
- Go to the HDFS service.
- Click the Configuration tab.
- Select .
- Select .
- Edit the HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml property to add:
<property> <name>dfs.namenode.kerberos.principal.pattern</name> <value>*</value> </property>
- Click Save Changes to commit the changes.
(If TLS/SSL is enabled) Specify Truststore Properties
<property> <name>ssl.client.truststore.location</name> <value>path_to_truststore</value> </property> <property> <name>ssl.client.truststore.password</name> <value>XXXXXX</value> </property> <property> <name>ssl.client.truststore.type</name> <value>jks</value> </property>
Set HADOOP_CONF to the Destination Cluster
Set the HADOOP_CONF path to be the destination environment. If you are not using HFTP, set the HADOOP_CONF path to the source environment instead.
Launch Distcp
hadoop distcp hdfs://xyz01.dev.example.com:8020/user/alice hdfs://abc01.qa.example.com:8020/user/alice
[libdefaults] udp_preference_limit = 1
Enabling Fallback Configuration
<property> <name>ipc.client.fallback-to-simple-auth-allowed</name> <value>true</value> </property>
Protocol Support for Distcp
The following table lists the protocols supported with the distcp command on different versions of CDH. "Secure" means that the cluster is configured to use Kerberos.
Source | Destination | Where to Issue distcp Command | Source Protocol | Source Config | Destination Protocol | Destination Config | Fallback Config Required |
---|---|---|---|---|---|---|---|
CDH 4 | CDH 4 | Destination | webhdfs | Secure | hdfs or webhdfs | Secure | |
CDH 4 | CDH 4 | Source or Destination | hdfs or webhdfs | Secure | hdfs or webhdfs | Secure | |
CDH 4 | CDH 4 | Source or Destination | hdfs or webhdfs | Insecure | hdfs or webhdfs | Insecure | |
CDH 4 | CDH 4 | Destination | Insecure | hdfs or webhdfs | Insecure | ||
CDH 4 | CDH 5 | Destination | webhdfs | Secure | webhdfs or hdfs | Secure | |
CDH 4 | CDH 5.1.3+ | Destination | webhdfs | Insecure | webhdfs | Secure | Yes |
CDH 4 | CDH 5 | Destination | webhdfs | Insecure | webhdfs or hdfs | Insecure | |
CDH 4 | CDH 5 | Source | hdfs or webhdfs | Insecure | webhdfs | Insecure | |
CDH 5 | CDH 4 | Source or Destination | webhdfs | Secure | webhdfs | Secure | |
CDH 5 | CDH 4 | Source | hdfs | Secure | webhdfs | Secure | |
CDH 5.1.3+ | CDH 4 | Source | hdfs or webhdfs | Secure | webhdfs | Insecure | Yes |
CDH 5 | CDH 4 | Source or Destination | webhdfs | Insecure | webhdfs | Insecure | |
CDH 5 | CDH 4 | Destination | webhdfs | Insecure | hdfs | Insecure | |
CDH 5 | CDH 4 | Source | hdfs | Insecure | webhdfs | Insecure | |
CDH 5 | CDH 4 | Destination | webhdfs | Insecure | hdfs or webhdfs | Insecure | |
CDH 5 | CDH 5 | Source or Destination | hdfs or webhdfs | Secure | hdfs or webhdfs | Secure | |
CDH 5 | CDH 5 | Destination | webhdfs | Secure | hdfs or webhdfs | Secure | |
CDH 5.1.3+ | CDH 5 | Source | hdfs or webhdfs | Secure | hdfs or webhdfs | Insecure | Yes |
CDH 5 | CDH 5.1.3+ | Destination | hdfs or webhdfs | Insecure | hdfs or webhdfs | Secure | Yes |
CDH 5 | CDH 5 | Source or Destination | hdfs or webhdfs | Insecure | hdfs or webhdfs | Insecure | |
CDH 5 | CDH 5 | Destination | webhdfs | Insecure | hdfs or webhdfs | Insecure |