Bigdata In Our Palm: April 2017

Thursday 13 April 2017

Connect DRILL from JBOSS

AIM:

Integrate DRILL and JBOSS.

STEPS:

1. Download JBOSS from http://jbossas.jboss.org/downloads (I used the following)

http://download.jboss.org/jbossas/7.0/jboss-as-7.0.1.Final/jboss-as-web-7.0.1.Final.zip

2. Extract the zip file (Installed in my local windows machine)

Goto <JBOSS_HOME>\bin and run standalone.bat

3. Open the JBOSS administartor console in Web Browser

http://localhost:9990/console/App.html#server/datasources

4. Our DRILL is up and running in following node

http://<drillbit-node>:8047/

DRILL Server Version - 1.9r1 (MapR)

Make sure that we are using DRILL Server version which is newer or equal to following: (Earlier packages has a bug which will throw a 'NullPointerException' on DrillConnectionImpl.isReadOnly().

5. Add drill jdbc driver to this folder: $JBOSS_HOME/modules/org/apache/drill/main/

(Create folder 'org/apache/drill/main/' if it does not exist)

Added the following jar to specified location: drill-jdbc-all-1.9.0.jar

(The jar is available with DRILL installation under - '/opt/mapr/drill/drill-1.9.0/jars/jdbc-driver' for MapR installtion for DRILL Server 1.9)

6. Add module.xml file to '$JBOSS_HOME/modules/org/apache/drill/main/' folder with the following content:

<resource-root path="drill-jdbc-all-1.9.0.jar"/>

</resources>

</dependencies>

</module>

7. Change datasources part in the $JBOSS_HOME/standalone/configuration/standalone.xml file:

<connection-url>jdbc:drill:drillbit=<drillbit-node>:31010</connection-url>

<driver>drill</driver>

<user-name><username></user-name>

</security>

<check-valid-connection-sql>select * from sys.version</check-valid-connection-sql>

<validate-on-match>false</validate-on-match>

<background-validation>false</background-validation>

<use-fast-fail>false</use-fast-fail>

</validation>

</datasource>

8. Add the following in 'drivers' section inside standalone.xml

</drivers>

9. Restart the JBOSS server. Once you restart, you will see DRILL-DS1 under 'Registered Datasources'

10. If you are using RED HAT' JBOSS' ENTERPROSE APPLICATION PLATFORM, then you should see something like below:

11. You can test the connection by clicking on the particular datasource (DRILL-DS1) and then on 'Test Connection'.

12. If the connection is successful, then we will see following screen:

Wednesday 5 April 2017

HBase Archive Folder (/hbase/archive)

AIM: To understand the concept of HBase Archive folder.

Essentially, whenever an hfile would be deleted, it is moved to the archive directory. Removing hfiles occurs in - compaction, merge, catalog janitor and table deletes. In the below section, we see how this happens when compaction and table deletion takes place.

Case 1 – Compaction:

Screenshot showing store files inside hbase region ‘fba74ad028a1532723da6e4cd1d9f4a6‘ for table ‘cfs’ with column family ‘c’. At this point, we have data inside ‘/hbase/data’ folder and ‘/hbase/archive’ folder is empty.

After minor compaction of table:

The original store files are moved to ‘/hbase/archive’ folder and ‘/hbase/data’ folder contains the new compacted hfile.

The files inside ‘/hbase/archive’ is cleaned up at specific intervals.

Case 2 - Deletion of a table

When a table is deleted, then the files are not instantly removed. It is moved to ‘/hbase/archive’ folder. One of the design consideration for this approach is that, there may be snapshots associated with the deleted table. This snapshots may in turn be used to create a new table. The new table uses references to old hfiles until compaction is completed for the table and it recreates the necessary files in the ‘/hbase/data’ folder.

HFileCleaner HBase chore

The cleaning of archive folder is taken care by HFileCleaner hbase chore (org.apache.hadoop.hbase.master.cleaner.HFileCleaner). This Chore, every time it runs, will clear the HFiles in the hfile archive folder that can be deleted. The hbase master will cleanup the FS for that table backup; first writes an 'invalid' marker for that table directory and then recursively deletes the regions and then the table directory in the archive.

(HBase CHORES: Chore is a task performed on a period in hbase. The chore is run in its own thread. This base abstract class provides while loop and sleeping facility.)

It is a single threaded activity, it collects the list of files in the archive directory, validates whether it can be deleted and then deletes it. Once one cycle is completed, it will wait for the specified time period and then kicks in again after that. One argument to this chore is ‘period’ - the period of time to sleep between each run.

If HBase snapshots are taken, then the snapshot have references to archive folder files. In that case, the HFileCleaner will not remove data from the ‘archive’ folder. (In this case manually removing data is snapshot data lose.)

Let’s assume the ‘period’ given for the activity is 10 minutes, then once it finishes one cycle, it will wait for another 10 minutes to kick in the next cycle. So, there can be a chance that the Hbase HFileCleaner cannot cope with the required deletion rate. Hence increasing the size of ‘/hbase/archive’ folder.

Related open bug for the same:

https://issues.apache.org/jira/browse/HBASE-12626

https://issues.apache.org/jira/browse/HBASE-5547 à hbase/archive was introduced

A simple example explaining the issue caused due to above bug:

Let’s assume the following:

Deletion rate of files per minute in archive folder	50
Incoming rate of files per minute into archive folder	100

(Assumption: the cleaner kicks in for the next cycle as soon as it completes its one cycle and none of the files in archive folder is retained)

Minutes	Total number of files to be cleaned when the cleaner first kicks in - 100	Cleaning cycle
1		Cleaning cycle – 1 (Time taken – 2 minutes)
2		Cleaning cycle – 1 (Time taken – 2 minutes)
3	Total number of files to be cleaned at 3rd minute - 200	Cleaning cycle – 2 (Time taken – 4 minutes)
4
5
6
7	Total number of files to be cleaned at 7th minute - 400	Cleaning cycle – 3 (Time taken – 8 minutes)
8
9
10
11
12
13
14

Tuesday 4 April 2017

Enable DRILL ODBC debug in Windows client

AIM:

Steps to set log to DEBUG level for DRILL ODBC driver in Windows machine

STEPS:

If the client machine is Windows, then we can enable debug trace for Drill ODBC by following steps

- Go to ODBC Data Source Administrator

- Navigate to System DSN

- Select the specific driver, it will pop up the Driver DSN Setup window

- Click on 'Logging Options'

- Change the 'Log Level' to 'LOG_DEBUG'

- Specify the path