HBase Archive Folder (/hbase/archive)
AIM: To understand the concept of HBase Archive folder.
Essentially, whenever an hfile would be deleted, it is moved
to the archive directory. Removing hfiles occurs in - compaction, merge,
catalog janitor and table deletes. In the below section, we see how this
happens when compaction and table deletion takes place.
Case 1 – Compaction:
Screenshot showing store files inside hbase region ‘fba74ad028a1532723da6e4cd1d9f4a6‘
for table ‘cfs’ with column family ‘c’. At this point, we have data inside
‘/hbase/data’ folder and ‘/hbase/archive’ folder is empty.
After minor compaction
of table:
The original store files are moved to ‘/hbase/archive’
folder and ‘/hbase/data’ folder contains the new compacted hfile.
The files inside ‘/hbase/archive’ is cleaned up at specific
intervals.
Case 2 - Deletion of
a table
When a table is deleted, then the files are not instantly
removed. It is moved to ‘/hbase/archive’ folder. One of the design
consideration for this approach is that, there may be snapshots associated with
the deleted table. This snapshots may in turn be used to create a new table.
The new table uses references to old hfiles until compaction is completed for
the table and it recreates the necessary files in the ‘/hbase/data’ folder.
HFileCleaner HBase
chore
The cleaning of archive folder is taken care by HFileCleaner
hbase chore (org.apache.hadoop.hbase.master.cleaner.HFileCleaner). This Chore,
every time it runs, will clear the HFiles in the hfile archive folder that can
be deleted. The hbase master will cleanup the FS for that table backup; first
writes an 'invalid' marker for that table directory and then recursively
deletes the regions and then the table directory in the archive.
(HBase CHORES: Chore is a task performed
on a period in hbase. The chore is run in its own thread. This base abstract
class provides while loop and sleeping facility.)
It is a single threaded activity, it collects the list of
files in the archive directory, validates whether it can be deleted and then
deletes it. Once one cycle is completed, it will wait for the specified time
period and then kicks in again after that. One argument to this chore is ‘period’ -
the period of time to sleep between each run.
If HBase snapshots are taken, then the snapshot have
references to archive folder files. In that case, the HFileCleaner will not
remove data from the ‘archive’ folder. (In this case manually removing data is snapshot
data lose.)
Let’s assume the ‘period’ given for the activity is 10
minutes, then once it finishes one cycle, it will wait for another 10 minutes
to kick in the next cycle. So, there can be a chance that the Hbase HFileCleaner
cannot cope with the required deletion rate. Hence increasing the size of
‘/hbase/archive’ folder.
Related open bug for the same:
A simple example
explaining the issue caused due to above bug:
Let’s assume the following:
Deletion rate of
files per minute in archive folder
|
50
|
Incoming rate of
files per minute into archive folder
|
100
|
(Assumption: the
cleaner kicks in for the next cycle as soon as it completes its one cycle and
none of the files in archive folder is retained)
Minutes
|
Total number of files to be cleaned when
the cleaner first kicks in - 100
|
Cleaning
cycle
|
1
|
|
Cleaning
cycle – 1
(Time taken –
2 minutes)
|
2
|
||
3
|
Total number of files to be cleaned at 3rd
minute - 200
|
Cleaning
cycle – 2
(Time taken –
4 minutes)
|
4
|
||
5
|
||
6
|
|
|
7
|
Total number of files to be cleaned at 7th minute - 400
|
Cleaning
cycle – 3
(Time taken –
8 minutes)
|
8
|
||
9
|
||
10
|
||
11
|
||
12
|
||
13
|
||
14
|
|
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Big Data Hadoop Online Training Bangalore
ReplyDeleteThe cleaning of archive folder is taken care by HFileCleaner hbase chore (org.apache.hadoop.hbase.master.cleaner.HFileCleaner). This Chore, every time it runs, will clear the HFiles in the hfile archive folder that can be deleted. The hbase master will cleanup the FS for that table backup; first writes an 'invalid' marker for that table directory and then recursively deletes the regions and then the table directory in the archive.
ReplyDeleteBig Data Projects For Final Year Students
Image Processing Projects For Final Year
Deep Learning Projects for Final Year