Preparing to create Iceberg replication policies

Before you create an Iceberg replication policy, you must understand the guidelines and limitations while using Iceberg replication policies, and then complete the prerequisites. Iceberg replication policies can replicate Iceberg V1 and V2 tables, created using Spark (read-only with Impala), between Cloudera Private Cloud Base 7.1.9 or higher clusters using Cloudera Manager 7.11.3 or higher versions. In Cloudera Private Cloud Base 7.3.1 and in higher versions, Replication Manager can also replicate V1 and V2 Iceberg tables created using Hive.

  • Consider the following guidelines and limitations before you prepare to create Iceberg replication policies:
    • If you already have a replication policy to replicate it from the source to target, Cloudera highly recommends that you do not replicate this database from the target to source. This is because bidirectional replication is not supported, and reverse replication creates issues.
    • You can use one or more Iceberg replication policies to replicate a database from the source cluster to the target cluster. You must ensure that you replicate the database only from the source cluster to the target cluster to maintain a single source of truth for the database.
    • If you want to implement HDFS HA in your environment for Iceberg tables, Cloudera recommends that you enable HDFS High Availability (HA), and then create the Iceberg tables. For more information, see Create table feature using Apache Iceberg.
  • Ensure that the source cluster and target cluster versions are Cloudera Private Cloud Base 7.1.9 or higher using Cloudera Manager 7.11.3 or higher versions.
  • Activate the Iceberg Replication parcel. The parcel might be included in your Cloudera Runtime distribution or in a separate distribution. For more information, contact your Cloudera account team.
  • Add the Iceberg Replication service on both clusters.
    To add a service, go to the Cloudera Manager > Clusters > [***CLUSTER NAME***] page, and select the Actions > Add Service action. For more information see, Adding a Service.
  • Ensure that you have the Atlas user credentials in addition to the Replication Administrator or Full Administrator roles to replicate Atlas metadata. The atlas user must also have relevant read and write permissions to the staging locations.
  • Ensure that Cloudera Lakehouse Optimizer is disabled and is not available in your target cluster if you have enabled the service in your AWS or Azure environment. This service is available in Cloudera on cloud 7.3.1.500 and in higher versions.
    If the Cloudera Lakehouse Optimizer service is available in the target cluster and if a compaction maintenance task is scheduled to run on the replicated tables, the Cloudera Lakehouse Optimizer policy runs a compaction maintenance task on the replicated Iceberg tables. By default, if the metadata.json file of the target cluster is absent on the source cluster, Replication Manager initiates a bootstrap replication in the subsequent Iceberg replication policy job. During the bootstrap replication, Replication Manager copies the already replicated small files from the source cluster to the target cluster, and the Cloudera Lakehouse Optimizer policy detects these small files and triggers a compaction maintenance task. This leads to a repetitive cycle and negates the benefit of the compaction task.

    For more information about Cloudera Lakehouse Optimizer, see Lakehouse Optimizer.