BDR Automation Examples
You can use the Cloudera Manager API to automate BDR tasks, such as creating a schedule for a replication. This page describes an automated solution for creating, running, and managing HDFS replication schedules in order to minimize Recovery Point Objectives (RPOs) for late arriving data or to automate recovery after disaster recovery.
Automating HDFS Replication Schedules
Automating HDFS replication with the API is a multi-step process that involves the following tasks:
Step 1. Create a Peer
Before you can create or run a replication schedule, you need a peer Cloudera Manager instance. This peer acts as the source Cloudera Manager instance where data is pulled from. See Designating a Replication Source for more information.
The following code sample shows you how to create a peer:
#!/usr/bin/env python from cm_api.api_client import ApiResource from cm_api.endpoints.types import * TARGET_CM_HOST = "<destination_cluster>" SOURCE_CM_URL = "<source_cluster>:7180/" api_root = ApiResource(TARGET_CM_HOST, username="<username>", password="<password>") cm = api_root.get_cloudera_manager() cm.create_peer("peer1", SOURCE_CM_URL, '<username>', '<password>')
- Replace <destination_cluster> with the domain name of the destination, for example target.cm.cloudera.com.
- Replace <source_cluster> with the domain name of the source, for example src.cm.cloudera.com:7180/.
- The user you specify must possess a role that is capable of creating a peer, such as the Cluster Administrator role.
Step 2. Create the HDFS Replication Schedule
After you have add a peer Cloudera Manager instance that functions as the source, you can create a replication schedule:
PEER_NAME='peer1' SOURCE_CLUSTER_NAME='Cluster-src-1' SOURCE_HDFS_NAME='HDFS-src-1' TARGET_CLUSTER_NAME='Cluster-tgt-1' TARGET_HDFS_NAME='HDFS-tgt-1' TARGET_YARN_SERVICE='YARN-1' hdfs = api_root.get_cluster(TARGET_CLUSTER_NAME).get_service(TARGET_HDFS_NAME) hdfs_args = ApiHdfsReplicationArguments(None) hdfs_args.sourceService = ApiServiceRef(None, peerName=PEER_NAME, clusterName=SOURCE_CLUSTER_NAME, serviceName=SOURCE_HDFS_NAME) hdfs_args.sourcePath = '/src/path/' hdfs_args.destinationPath = '/target/path' hdfs_args.mapreduceServiceName = TARGET_YARN_SERVICE # creating a schedule with daily frequency start = datetime.datetime.now() # The time at which the scheduled activity is triggered for the first time. end = start + datetime.timedelta(days=365) # The time after which the scheduled activity will no longer be triggered. schedule = hdfs.create_replication_schedule(start, end, "DAY", 1, True, hdfs_args)
The example creates ApiHdfsReplicationArguments and populate attributes such as source path, destination name, MapReduce service to use, and others. For the source service, you will need to provide the HDFS service name and cluster name on the source Cloudera Manager instance. See the API documentation for the complete list of attributes for ApiHdfsReplicationArguments.
At the end of the example, hdfs_args is used to create an HDFS replication schedule.
Step 3. Run the Replication Schedule
The replication schedule created in step 2 has a frequency of 1 DAY, so the schedule will run at the initial start time every day. You can also manually run the schedule using the following:
cmd = hdfs.trigger_replication_schedule(schedule.id)
Step 4. Monitor the Schedule
cmd = cmd.wait() result = hdfs.get_replication_schedule(schedule.id).history[0].hdfsResult
Configuring Replication to/from Cloud Providers
Step 1. Add a Cloud Account
Instead of adding a peer Cloudera Manager instance like a cluster-to-cluster replication, replicating to or from a cloud provider requires an account for that provider.
The following example shows how to add an S3 account:
ACCESS_KEY="...." SECRET_KEY="...." TYPE_NAME = 'AWS_ACCESS_KEY_AUTH' account_configs ={'aws_access_key': ACCESS_KEY, 'aws_secret_key': SECRET_KEY} cm.api.create_external_account("cloudAccount1", "cloudAccount1", TYPE_NAME, account_configs=account_configs)
Step 2. Create the Replication Schedule
CLUSTER_NAME='Cluster-tgt-1' HDFS_NAME='HDFS-tgt-1' CLOUD_ACCOUNT='cloudAccount1' YARN_SERVICE='YARN-1' hdfs = api_root.get_cluster(CLUSTER_NAME).get_service(HDFS_NAME) hdfs_cloud_args = ApiHdfsCloudReplicationArguments(None) hdfs_cloud_args.sourceService = ApiServiceRef(None, peerName=None, clusterName=CLUSTER_NAME, serviceName=HDFS_NAME) hdfs_cloud_args.sourcePath = '/src/path' hdfs_cloud_args.destinationPath = 's3a://bucket/target/path/' hdfs_cloud_args.destinationAccount = CLOUD_ACCOUNT hdfs_cloud_args.mapreduceServiceName = YARN_SERVICE # creating a schedule with daily frequency start = datetime.datetime.now() # The time at which the scheduled activity is triggered for the first time. end = start + datetime.timedelta(days=365) # The time after which the scheduled activity will no longer be triggered. schedule = hdfs.create_replication_schedule(start, end, "DAY", 1, True, hdfs_args)
The example creates ApiHdfsCloudReplicationArguments, populates it, and creates an HDFS to S3 backup schedule. In addition to specifying attributes such as the source path and destination path, the example provides destinationAccount as CLOUD_ACCOUNT and peerName as None in sourceService. The peerName is None since there is no peer for cloud replication schedules.
hdfs_cloud_args is then used to create a HDFS-S3 replication schedule with a frequency of 1 day.
Step 3. Run the Replication Schedule
The replication schedule created in step 2 has a frequency of 1 DAY, so the schedule will run at the initial start time every day. You can also manually run the schedule using the following:
cmd = hdfs.trigger_replication_schedule(schedule.id)
Step 4. Monitor the Schedule
cmd = cmd.wait() result = hdfs.get_replication_schedule(schedule.id).history[0].hdfsResult
Maintaining Replication Schedules
The following actions can be performed on replication schedules that are cluster-to-cluster or cluster to/from a cloud provider:
- Get all replication schedules for a given service:
-
schs = hdfs.get_replication_schedules()
- Get a given replication schedule by schedule id for a given service:
-
sch = hdfs.get_replication_schedule(schedule_id)
- Delete a given replication schedule by schedule id for a given service:
-
sch = hdfs.delete_replication_schedule(schedule_id)
- Update a given replication schedule by schedule id for a given service:
-
sch.hdfsArguments.removeMissingFiles = True sch = hdfs.update_replication_schedule(sch.id, sch)
- Debugging failures during replication
- If a replication job fails, you can download replication diagnostic data for the replication command to troubleshoot and diagnose any issues.
The diagnostic data includes all the logs generated, including the MapReduce logs. You can also upload the logs to a support case for further analysis. Collecting a replication diagnostic bundle is available for API v11+ and Cloudera Manager version 5.5+.
args = {} resp = hdfs.collect_replication_diagnostic_data(schedule_id=schedule.id, args) # Download replication diagnostic bundle to a temp directory tmpdir = tempfile.mkdtemp(prefix="support-bundle-replication") support_bundle_path = os.path.join(tmpdir, "support-bundle.zip") cm.download_from_url(resp.resultDataUrl, support_bundle_path)