Configuring the Blocksize for HBase
- The blocksize for a given column family determines the smallest unit of data HBase can read from the column family's HFiles.
- It is also the basic unit of measure cached by a RegionServer in the BlockCache.
The default blocksize is 64 KB. The appropriate blocksize is dependent upon your data and usage patterns. Use the following guidelines to tune the blocksize size, in combination with testing and benchmarking as appropriate.
- Consider the average key/value size for the column family when tuning the blocksize. You can find the average key/value size using the HFile utility:
$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /path/to/HFILE -m -v ... Block index size as per heapsize: 296 reader=hdfs://srv1.example.com:9000/path/to/HFILE, \ compression=none, inMemory=false, \ firstKey=US6683275_20040127/mimetype:/1251853756871/Put, \ lastKey=US6684814_20040203/mimetype:/1251864683374/Put, \ avgKeyLen=37, avgValueLen=8, \ entries=1554, length=84447 ...
- Consider the pattern of reads to the table or column family. For instance, if it is common to scan for 500 rows on various parts of the table, performance might be increased if the
blocksize is large enough to encompass 500-1000 rows, so that often, only one read operation on the HFile is required. If your typical scan size is only 3 rows, returning 500-1000 rows would be
overkill.
It is difficult to predict the size of a row before it is written, because the data will be compressed when it is written to the HFile. Perform testing to determine the correct blocksize for your data.
Configuring the Blocksize for a Column Family
You can configure the blocksize of a column family at table creation or by disabling and altering an existing table. These instructions are valid whether or not you use Cloudera Manager to manage your cluster.
hbase> create ‘test_table′,{NAME => ‘test_cf′, BLOCKSIZE => '262144'} hbase> disable 'test_table' hbase> alter 'test_table', {NAME => 'test_cf', BLOCKSIZE => '524288'} hbase> enable 'test_table'
hbase> major_compact 'test_table'
Depending on the size of the table, the major compaction can take some time and have a performance impact while it is running.
Monitoring Blocksize Metrics
Several metrics are exposed for monitoring the blocksize by monitoring the blockcache itself. See the block_cache* entries in RegionServer Metrics.