Using HashTable and SyncTable Tool
HashTable/SyncTable tool overview
HashTable/SyncTable is a two steps tool for synchronizing table data without copying all cells in a specified row key/time period range.
The HashTable/SyncTable tool can be used for partial or entire table data synchronization, under the same or remote cluster. Both the HashTable and the SyncTable step are implemented as a MapReduce job.
The first step, HashTable, creates hashed indexes for batch of cells on the source table and output those as results. The source table is the table whose state is copied to its counterpart.
The second step, SyncTable, scans the target table and calculates hash indexes for table cells. Then these hashes are compared to the HashTable step outputs. So, SyncTable scans and compares cells for diverging hashes and updating only the mismatching cells.
This results in less network traffic or data transfers than other methods, for example CopyTable, which can impact performance when large tables are synchronized on remote clusters.
Remote clusters are often deployed on different Kerberos Realms. SyncTable support cross realm authentication, allowing a SyncTable process running on the target cluster to connect to the source cluster and read both the HashTable output files and the given HBase table when performing the required comparisons.
HashTable/SyncTable tool configuration
You can configure the HashTable/SyncTable tool for your specific needs.
Using the batchsize option
You can define the amount of cell data for a given region that is hashed together in a single hash value using the batchsize option, which sets the batchsize property. Sizing this property has a direct impact on the synchronization efficiency. If the batch size is increased, larger chunks are hashed.
If only a few differences are expected between the two tables, using a bit larger batch size can be beneficial, as less scans are executed by mapper tasks of SyncTable.
However, if relatively frequent differences are expected between the tables, using a large batch size can cause frequent mismatches of hash values, as the probability of finding at least one mismatch in a batch is increased.
$ hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTableA /hashes/testTable
Creating a read only report
You can use the dryrun option in the second, SyncTable, step to create a read only report. It produces only COUNTERS indicating the differences between the two tables, but does not perform any actual changes. It can be used as an alternative of the VerifyReplication tool.
$ hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:8020/hashes/testTable testTableA testTableB
Synchronize table data using HashTable/SyncTable
The HashTable/SyncTable tool can be used for partial or entire table data synchronization, under the same or remote cluster.
Prerequisites
- Ensure that all RegionServers/DataNodes on the source cluster is accessible by the NodeManagers on the target cluster where SyncTable job tasks will be running.
- In the case of secured clusters, the user on the target cluster who executes the SyncTable job must be able to do the following on the HDFS and HBase services of the source cluster:
- Authenticate: for example, using centralized authentication.
- Be authorized: having at least read permission.
Steps
- Run HashTable on the source table cluster: HashTable [options] <tablename> <outputpath>.
The following is an example, to has the TesTable in 32kB batches for a 1 hour window into 50 files:
$ hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTableA /hashes/testTable
For more detailed information regarding HashTable options, use hbase org.apache.hadoop.hbase.mapreduce.HashTable --help .
- Run SyncTable on the target cluster: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>.
The following is an example, for a dry run SyncTable of tableA from a remote source cluster to a local tableB on the target cluster:
$ hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:8020/hashes/testTable testTableA testTableB
For more detailed information regarding HashTable options, use hbase org.apache.hadoop.hbase.mapreduce.SyncTable --help.