BDR Automation Examples

You can use the Cloudera Manager API to automate BDR tasks, such as creating a schedule for a replication. This page describes an automated solution for creating, running, and managing HDFS replication schedules in order to minimize Recovery Point Objectives (RPOs) for late arriving data or to automate recovery after disaster recovery.

For more information about the Cloudera Manager API and how to install the API client, see the following:

Automating HDFS Replication Schedules

Automating HDFS replication with the API is a multi-step process that involves the following tasks:

Step 1. Create a Peer

Before you can create or run a replication schedule, you need a peer Cloudera Manager instance. This peer acts as the source Cloudera Manager instance where data is pulled from. See Designating a Replication Source for more information.

The following code sample shows you how to create a peer:

#!/usr/bin/env python
from cm_api.api_client import ApiResource
from cm_api.endpoints.types import *
TARGET_CM_HOST = "<destination_cluster>"
SOURCE_CM_URL = "<source_cluster>:7180/"
api_root = ApiResource(TARGET_CM_HOST, username="<username>", password="<password>")
cm = api_root.get_cloudera_manager()
cm.create_peer("peer1", SOURCE_CM_URL, '<username>', '<password>')
The above sample creates an API root handle and gets a Cloudera Manager instance from it before creating the peer. To implement a similar solution to the example, keep the following guidelines in mind:
  • Replace <destination_cluster> with the domain name of the destination, for example target.cm.cloudera.com.
  • Replace <source_cluster> with the domain name of the source, for example src.cm.cloudera.com:7180/.
  • The user you specify must possess a role that is capable of creating a peer, such as the Cluster Administrator role.

Step 2. Create the HDFS Replication Schedule

After you have add a peer Cloudera Manager instance that functions as the source, you can create a replication schedule:

PEER_NAME='peer1'
SOURCE_CLUSTER_NAME='Cluster-src-1'
SOURCE_HDFS_NAME='HDFS-src-1'
TARGET_CLUSTER_NAME='Cluster-tgt-1'
TARGET_HDFS_NAME='HDFS-tgt-1'
TARGET_YARN_SERVICE='YARN-1'
hdfs = api_root.get_cluster(TARGET_CLUSTER_NAME).get_service(TARGET_HDFS_NAME)
hdfs_args = ApiHdfsReplicationArguments(None)
hdfs_args.sourceService = ApiServiceRef(None,
                                peerName=PEER_NAME,
                                clusterName=SOURCE_CLUSTER_NAME,
                                serviceName=SOURCE_HDFS_NAME)
hdfs_args.sourcePath = '/src/path/'
hdfs_args.destinationPath = '/target/path'
hdfs_args.mapreduceServiceName = TARGET_YARN_SERVICE
# creating a schedule with daily frequency
start = datetime.datetime.now() # The time at which the scheduled activity is triggered for the first time.
end = start + datetime.timedelta(days=365) # The time after which the scheduled activity will no longer be triggered.
schedule = hdfs.create_replication_schedule(start, end, "DAY", 1, True, hdfs_args)

The example creates ApiHdfsReplicationArguments and populate attributes such as source path, destination name, MapReduce service to use, and others. For the source service, you will need to provide the HDFS service name and cluster name on the source Cloudera Manager instance. See the API documentation for the complete list of attributes for ApiHdfsReplicationArguments.

At the end of the example, hdfs_args is used to create an HDFS replication schedule.

Step 3. Run the Replication Schedule

The replication schedule created in step 2 has a frequency of 1 DAY, so the schedule will run at the initial start time every day. You can also manually run the schedule using the following:

cmd = hdfs.trigger_replication_schedule(schedule.id)

Step 4. Monitor the Schedule

Once you get a command (cmd), you can wait for the command to finish and then get the results:
cmd = cmd.wait()
result = hdfs.get_replication_schedule(schedule.id).history[0].hdfsResult

Configuring Replication to/from Cloud Providers

Step 1. Add a Cloud Account

Instead of adding a peer Cloudera Manager instance like a cluster-to-cluster replication, replicating to or from a cloud provider requires an account for that provider.

The following example shows how to add an S3 account:

ACCESS_KEY="...."
SECRET_KEY="...."
TYPE_NAME = 'AWS_ACCESS_KEY_AUTH'
account_configs ={'aws_access_key': ACCESS_KEY,
   'aws_secret_key': SECRET_KEY}
cm.api.create_external_account("cloudAccount1",
    "cloudAccount1",
    TYPE_NAME,
    account_configs=account_configs)

Step 2. Create the Replication Schedule

CLUSTER_NAME='Cluster-tgt-1'
HDFS_NAME='HDFS-tgt-1'
CLOUD_ACCOUNT='cloudAccount1'
YARN_SERVICE='YARN-1'
hdfs = api_root.get_cluster(CLUSTER_NAME).get_service(HDFS_NAME)
hdfs_cloud_args = ApiHdfsCloudReplicationArguments(None)
hdfs_cloud_args.sourceService = ApiServiceRef(None,
                                peerName=None,
                                clusterName=CLUSTER_NAME,
                                serviceName=HDFS_NAME)
hdfs_cloud_args.sourcePath = '/src/path'
hdfs_cloud_args.destinationPath = 's3a://bucket/target/path/'
hdfs_cloud_args.destinationAccount = CLOUD_ACCOUNT
hdfs_cloud_args.mapreduceServiceName = YARN_SERVICE

# creating a schedule with daily frequency
start = datetime.datetime.now() # The time at which the scheduled activity is triggered for the first time.
end = start + datetime.timedelta(days=365) # The time after which the scheduled activity will no longer be triggered.

schedule = hdfs.create_replication_schedule(start, end, "DAY", 1, True, hdfs_args)

The example creates ApiHdfsCloudReplicationArguments, populates it, and creates an HDFS to S3 backup schedule. In addition to specifying attributes such as the source path and destination path, the example provides destinationAccount as CLOUD_ACCOUNT and peerName as None in sourceService. The peerName is None since there is no peer for cloud replication schedules.

hdfs_cloud_args is then used to create a HDFS-S3 replication schedule with a frequency of 1 day.

Step 3. Run the Replication Schedule

The replication schedule created in step 2 has a frequency of 1 DAY, so the schedule will run at the initial start time every day. You can also manually run the schedule using the following:

cmd = hdfs.trigger_replication_schedule(schedule.id)

Step 4. Monitor the Schedule

Once you get a command (cmd), you can wait for the command to finish and then get the results:
cmd = cmd.wait()
result = hdfs.get_replication_schedule(schedule.id).history[0].hdfsResult

Maintaining Replication Schedules

The following actions can be performed on replication schedules that are cluster-to-cluster or cluster to/from a cloud provider:

Get all replication schedules for a given service:
schs = hdfs.get_replication_schedules()
Get a given replication schedule by schedule id for a given service:
sch = hdfs.get_replication_schedule(schedule_id)
Delete a given replication schedule by schedule id for a given service:
sch = hdfs.delete_replication_schedule(schedule_id)
Update a given replication schedule by schedule id for a given service:
sch.hdfsArguments.removeMissingFiles = True
sch = hdfs.update_replication_schedule(sch.id, sch)
Debugging failures during replication
If a replication job fails, you can download replication diagnostic data for the replication command to troubleshoot and diagnose any issues.

The diagnostic data includes all the logs generated, including the MapReduce logs. You can also upload the logs to a support case for further analysis. Collecting a replication diagnostic bundle is available for API v11+ and Cloudera Manager version 5.5+.

args = {}
resp = hdfs.collect_replication_diagnostic_data(schedule_id=schedule.id, args)

# Download replication diagnostic bundle to a temp directory
tmpdir = tempfile.mkdtemp(prefix="support-bundle-replication")
support_bundle_path = os.path.join(tmpdir, "support-bundle.zip")
cm.download_from_url(resp.resultDataUrl, support_bundle_path)