Wednesday 5 April 2017


HBase Archive Folder (/hbase/archive)


AIM: To understand the concept of HBase Archive folder.

Essentially, whenever an hfile would be deleted, it is moved to the archive directory. Removing hfiles occurs in - compaction, merge, catalog janitor and table deletes. In the below section, we see how this happens when compaction and table deletion takes place. 


Case 1 – Compaction:

Screenshot showing store files inside hbase region ‘fba74ad028a1532723da6e4cd1d9f4a6‘ for table ‘cfs’ with column family ‘c’. At this point, we have data inside ‘/hbase/data’ folder and ‘/hbase/archive’ folder is empty.




After minor compaction of table:
The original store files are moved to ‘/hbase/archive’ folder and ‘/hbase/data’ folder contains the new compacted hfile.



The files inside ‘/hbase/archive’ is cleaned up at specific intervals.


Case 2 - Deletion of a table

When a table is deleted, then the files are not instantly removed. It is moved to ‘/hbase/archive’ folder. One of the design consideration for this approach is that, there may be snapshots associated with the deleted table. This snapshots may in turn be used to create a new table. The new table uses references to old hfiles until compaction is completed for the table and it recreates the necessary files in the ‘/hbase/data’ folder.




HFileCleaner HBase chore

The cleaning of archive folder is taken care by HFileCleaner hbase chore (org.apache.hadoop.hbase.master.cleaner.HFileCleaner). This Chore, every time it runs, will clear the HFiles in the hfile archive folder that can be deleted. The hbase master will cleanup the FS for that table backup; first writes an 'invalid' marker for that table directory and then recursively deletes the regions and then the table directory in the archive.

(HBase CHORES: Chore is a task performed on a period in hbase. The chore is run in its own thread. This base abstract class provides while loop and sleeping facility.)

It is a single threaded activity, it collects the list of files in the archive directory, validates whether it can be deleted and then deletes it. Once one cycle is completed, it will wait for the specified time period and then kicks in again after that. One argument to this chore is ‘period’ - the period of time to sleep between each run.

If HBase snapshots are taken, then the snapshot have references to archive folder files. In that case, the HFileCleaner will not remove data from the ‘archive’ folder. (In this case manually removing data is snapshot data lose.)

Let’s assume the ‘period’ given for the activity is 10 minutes, then once it finishes one cycle, it will wait for another 10 minutes to kick in the next cycle. So, there can be a chance that the Hbase HFileCleaner cannot cope with the required deletion rate. Hence increasing the size of ‘/hbase/archive’ folder.

Related open bug for the same:
https://issues.apache.org/jira/browse/HBASE-5547 à hbase/archive was introduced


A simple example explaining the issue caused due to above bug:

Let’s assume the following:
Deletion rate of files per minute in archive folder
50
Incoming rate of files per minute into archive folder
100


(Assumption: the cleaner kicks in for the next cycle as soon as it completes its one cycle and none of the files in archive folder is retained)


Minutes
Total number of files to be cleaned when the cleaner first kicks in - 100
Cleaning cycle
1

Cleaning cycle – 1
(Time taken – 2 minutes)
2
3
Total number of files to be cleaned at 3rd minute - 200
Cleaning cycle – 2
(Time taken – 4 minutes)
4
5
6

7
Total number of files to be cleaned at 7th minute - 400
Cleaning cycle – 3
(Time taken – 8 minutes)
8
9
10
11
12
13
14







1 comment:

  1. Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Big Data Hadoop Online Training Bangalore

    ReplyDelete