Storing Medium Objects (MOBs) in HBase
Data comes in many sizes, and saving all of your data in HBase, including binary data such as images and documents, is convenient. HBase can technically handle binary objects with cells that are up to 10 MB in size. However, HBase normal read and write paths are optimized for values smaller than 100 KB in size. When HBase handles large numbers of values up to 10 MB (medium objects, or MOBs), performance is degraded because of write amplification caused by splits and compactions.
One way to solve this problem is by storing objects larger than 100KB directly in HDFS, and storing references to their locations in HBase. CDH 5.4 and higher includes optimizations for storing MOBs directly in HBase) based on HBASE-11339.
To use MOB, you must use HFile version 3. Optionally, you can configure the MOB file reader's cache settings Service-Wide and for each RegionServer, and then configure specific columns to hold MOB data. No change to client code is required for HBase MOB support.
Enabling HFile Version 3 Using Cloudera Manager
Minimum Required Role: Full Administrator
- Go to the HBase service.
- Click the Configuration tab.
- Search for the property HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml.
- Paste the following XML into the Value field and save your changes.
<property> <name>hfile.format.version</name> <value>3</value> </property>
Configuring Columns to Store MOBs
- IS_MOB is a Boolean option, which specifies whether or not the column can store MOBs.
- MOB_THRESHOLD configures the number of bytes at which an object is considered to be a MOB. If you do not specify a value for MOB_THRESHOLD, the default is 100 KB. If you write a value larger than this threshold, it is treated as a MOB.
You can configure a column to store MOBs using the HBase Shell or the Java API.
Using HBase Shell:
hbase> create 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400} hbase> alter 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400}
Using the Java API:
HColumnDescriptor hcd = new HColumnDescriptor(“f”); hcd.setMobEnabled(true); hcd.setMobThreshold(102400L);
HBase MOB Cache Properties
Because there can be a large number of MOB files at any time, as compared to the number of HFiles, MOB files are not always kept open. The MOB file reader cache is a LRU cache which keeps the most recently used MOB files open.
Property | Default | Description |
---|---|---|
hbase.mob.file.cache.size | 1000 | The of opened file handlers to cache. A larger value will benefit reads by providing more file handlers per MOB file cache and would reduce frequent file opening and closing of files. However, if the value is too high, errors such as "Too many opened file handlers" may be logged. |
hbase.mob.cache.evict.period | 3600 | The amount of time in seconds after a file is opened before the MOB cache evicts cached files. The default value is 3600 seconds. |
hbase.mob.cache.evict.remain.ratio | 0.5f | The ratio, expressed as a float between 0.0 and 1.0, that controls how manyfiles remain cached after an eviction is triggered due to the number of cached files exceeding the hbase.mob.file.cache.size. The default value is 0.5f. |
Configuring the MOB Cache Using Cloudera Manager
- Go to the HBase service.
- Click the Configuration tab.
- Search for the property HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml.
- Paste your configuration into the Value field and save your changes. The following example
sets the hbase.mob.cache.evict.period property to 5000 seconds. See HBase MOB Cache
Properties for a full list of configurable properties for HBase MOB.
<property> <name>hbase.mob.cache.evict.period</name> <value>5000</value> </property>
- Restart your cluster for the changes to take effect.
Configuring the MOB Cache Using the Command Line
<property> <name>hbase.mob.cache.evict.period</name> <value>5000</value> </property>
Testing MOB Storage and Retrieval Performance
$ sudo -u hbase hbase org.apache.hadoop.hbase.IntegrationTestIngestMOB \ -threshold 102400 \ -minMobDataSize 512 \ -maxMobDataSize 5120
- threshold is the threshold at which cells are considered to be MOBs. The default is 1 kB, expressed in bytes.
- minMobDataSize is the minimum value for the size of MOB data. The default is 512 B, expressed in bytes.
- maxMobDataSize is the maximum value for the size of MOB data. The default is 5 kB, expressed in bytes.
Compacting MOB Files Manually
hbase> compact_mob 't1' hbase> compact_mob 't1', 'f1' hbase> major_compact_mob 't1' hbase> major_compact_mob 't1', 'f1'
This functionality is also available using the API, using the Admin.compact and Admin.majorCompact methods.