Thursday 13 July 2017


HBase Compaction - Part 1 - Why we need them?


HBase has two types of compaction. Minor compaction and Major compaction.


Why we need minor compaction?


When you insert some data to HBase, in normal write path (the other being bulk load, which does not go through normal write path), the data goes to WAL - Write Ahead Log. 
Regionserver reads the data and puts to its memstore. When data in memstore reaches a threshold, it is flushed to disk generating storefiles.
The threshold for memstore is normally in MBs. Hence, memstore flush will generate large number of small storefiles.
This will in turn cause performance issues while read operation. Hence we have minor compactions. 
Ideally minor compaction deals with small sized files. Minor compaction kicks in automatically when certain conditions are met (Discussed in HBase Compaction - Part 2).
It is not recommended to turn off minor compaction.

StoreFile contains many metadata information, some of the important ones are:
[1] MAX_SEQ_ID_KEY – maximum sequence ID in FileInfo
[2] MAJOR_COMPACTION_KEY – Major compaction flag info
[3] EXCLUDE_FROM_MINOR_COMPACTION_KEY  – Major compaction flag info
[4] HFILE_NAME_REGEX - HFiles are uuid ([0-9a-z]+). Bulk loaded hfiles has (_SeqId_[0-9]+_) has suffix.


Why we need major compaction?


Minor compaction deals with small files, while major compaction deals with larger files.
Major compaction combines multiple storefiles and generate one big HFile.
Also, when we update or delete a row in HBase table, HBase just write a deletion marker and masks the data from being accessed in further operations.
During major compaction, HBase reads each file and removes all the data marked for deletion.
This is impoertant, since if it is not performed it will eat up the disk space and will reduce the performance.
Hence major compactions. As of HBase 1.1.8, major compaction automatically happens once in 7 days unless turned off manually.
In busy clusters it is recommended to turn off automatic major compaction and trigger it manually at off peak hours.

Major compaction will read the complete table data and writes into a new file.
As a result it requires at least one full read and one full write of table data.
The two main benefits of major compaction are:
[1] Saving disk space by removing the expired and deleted data
[2] Improve the read performance by increasing disk seek performance

No comments:

Post a Comment