Writing Data to HBase
To write data to HBase, you use methods of the HTableInterface class. You can use the Java API directly, or use the HBase Shell, the REST API, the Thrift API, , or another client which uses the Java API indirectly. When you issue a Put, the coordinates of the data are the row, the column, and the timestamp. The timestamp is unique per version of the cell, and can be generated automatically or specified programmatically by your application, and must be a long integer.
Variations on Put
- A Put operation writes data into HBase.
- A Delete operation deletes data from HBase. What actually happens during a Delete depends upon several factors.
- A CheckAndPut operation performs a Scan before attempting the Put, and only does the Put if a value matches what is expected, and provides row-level atomicity.
- A CheckAndDelete operation performs a Scan before attempting the Delete, and only does the Delete if a value matches what is expected.
- An Increment operation increments values of one or more columns within a single row, and provides row-level atomicity.
Refer to the API documentation for a full list of methods provided for writing data to HBase.Different methods require different access levels and have other differences.
Versions
When you put data into HBase, a timestamp is required. The timestamp can be generated automatically by the RegionServer or can be supplied by you. The timestamp must be unique per version of a given cell, because the timestamp identifies the version. To modify a previous version of a cell, for instance, you would issue a Put with a different value for the data itself, but the same timestamp.
HBase's behavior regarding versions is highly configurable. The maximum number of versions defaults to 1 in CDH 5, and 3 in previous versions. You can change the default value for HBase by configuring hbase.column.max.version in hbase-site.xml, either using an advanced configuration snippet if you use Cloudera Manager, or by editing the file directly otherwise.
hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5 hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2 hbase> alter ‘t1′, NAME => ‘f1′, TTL => 15
HBase sorts the versions of a cell from newest to oldest, by sorting the timestamps lexicographically. When a version needs to be deleted because a threshold has been reached, HBase always chooses the "oldest" version, even if it is in fact the most recent version to be inserted. Keep this in mind when designing your timestamps. Consider using the default generated timestamps and storing other version-specific data elsewhere in the row, such as in the row key. If MIN_VERSIONS and TTL conflict, MIN_VERSIONS takes precedence.
Deletion
When you request for HBase to delete data, either explicitly using a Delete method or implicitly using a threshold such as the maximum number of versions or the TTL, HBase does not delete the data immediately. Instead, it writes a deletion marker, called a tombstone, to the HFile, which is the physical file where a given RegionServer stores its region of a column family. The tombstone markers are processed during major compaction operations, when HFiles are rewritten without the deleted data included.
Even after major compactions, "deleted" data may not actually be deleted. You can specify the KEEP_DELETED_CELLS option for a given column family, and the tombstones will be preserved in the HFile even after major compaction. One scenario where this approach might be useful is for data retention policies.
Another reason deleted data may not actually be deleted is if the data would be required to restore a table from a snapshot which has not been deleted. In this case, the data is moved to an archive during a major compaction, and only deleted when the snapshot is deleted. This is a good reason to monitor the number of snapshots saved in HBase.
Examples
hbase> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in 0.1770 seconds hbase> put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in 0.0160 seconds hbase> put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in 0.0260 seconds hbase> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1403759475114, value=value1 row2 column=cf:b, timestamp=1403759492807, value=value2 row3 column=cf:c, timestamp=1403759503155, value=value3 3 row(s) in 0.0440 seconds
publicstaticfinalbyte[] CF = "cf".getBytes(); publicstaticfinalbyte[] ATTR = "attr".getBytes(); ... Put put = new Put(Bytes.toBytes(row)); put.add(CF, ATTR, Bytes.toBytes( data)); htable.put(put);
publicstaticfinalbyte[] CF = "cf".getBytes(); publicstaticfinalbyte[] ATTR = "attr".getBytes(); ... Put put = new Put( Bytes.toBytes(row)); long explicitTimeInMs = 555; // just an example put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data)); htable.put(put);
Further Reading
- Refer to the HTableInterface and HColumnDescriptor API documentation for more details about configuring tables and columns, as well as reading and writing to HBase.
- Refer to the Apache HBase Reference Guide for more in-depth information about HBase, including details about versions and deletions not covered here.