Saturday, 20 June 2015

Apache HBase – Support for Medium Objects (MOBs)



Use case:

It is quite useful to save the binary data like images, documents into the HBase. The traditional database has the ability to save the MOB, for example, Oracle Database.
Apache HBase is technically designed to handle binary objects up to 10MB size. However, it is designed for storing data with <10K in each cell with low latency reads and writes. However, the performance can degrade when we use moderately sized objects (medium objects) 100K – 10MB. This is because of increasing I/O pressure created by compactions.


After Effects:

The increase in I/O pressure will lead to slower compactions, which eventually blocks memstore flushing and hence blocking updates. This will also increase the frequency of region splits reducing the availability of affected regions.


Characteristics of MOBs:

  •        Write intensive
  •        Data size is quite big
  •        Seldom deletes and updates
  •        Infrequent read (MOB data are accessed much less than the corresponding meta data)
  •        Stored along with metadata


MOB Design:


The key is to treat MOBs as a separate region. This separates the MOBs from normal region splits and compactions thus decreasing the I/O pressure.

The idea is HBase + HDFS in managing the data. The memstore caches the MOB files before they are flushed onto the disk. The MOBs are written to HFile called ‘MOB’ file which may contain multiple MOB objects. The meta data is stored in HBase and there is a reference column that links to MOB file. The meta data and MOB are stored in different column families.
In order to take advantage of HBase consistency feature we need to use the memstore flushing. If we save the MOBs directly into sequence file, then it will make compaction difficult and add load to HBase when updating pointers. The MOB data will not take part in split and compaction in HBase.

The actions that take place while writing the MOB data

The MOB data is written into KeyValue of MOB column. When the memstore is full, then MOB data is flushed to MOB files in the format of HFiles and metadata are flushed to StoreFiles. The values of MOB keyvalues are replaced by the path of MOB files. The MOB KeyValue in StoreFile have a tag that is a reference which links to MOB files.

The file path is /rootPath/tableName/.mob/columnFamilyName/${filename}


The chain of the random read against the MOB data
       1.       Find the metadata, and the path of the HFile in the metadata
       2.       Find the HFile by the path
       3.       Seek the KeyValue with the keyrowkey,columnFamily:column,ts in this HFile, and            retrieve it




MOB Files Cleaner and Sweep Tool
·         The MOB file cleaner cleans the expired MOB files.
·         The sweep tool uses a MapReduce job to clean the unused MOB data. This tool also sweeps the small MOB files to larger files.


How to take advantage of MOBs?

In order take advantage of MOB feature, we need to use HFile version 3. We need to edit the ‘hbase-site.xml’ and restart RegionServer. Changes will take place after major compaction.
<property>
  <name>hfile.format.version</name>
  <value>3</value>
</property>


Configuring Columns to Store MOBs
The properties to be set for handling MOB data are ‘IS_MOB’ and ‘MOB_THRESHOLD’. ‘IS_MOB’ specifies whether a column can store MOB data. The ‘MOB_THRESHOLD’ specifies the size of data to be considered as a MOB. Default is 100 bytes.
HBase Shell command:
hbase> create 'sample', {NAME => 'm', IS_MOB => true, MOB_THRESHOLD => 10500}
hbase> alter 'sample', {NAME => 'm', IS_MOB => true, MOB_THRESHOLD => 10500}



References:



No comments:

Post a Comment