Monday, 13 March 2017


Accessing data from a pre-existing HBase table through Hive in MapR cluster


AIM

Convert a pre-existing table to Hive-Hbase table and access data making use of 'group by' operation.


Make the following changes in all the nodes in hive-site.xml:


<property>
  <name>hive.aux.jars.path</name>
  <value>file:///opt/mapr/hive/hive-<version>/lib/hive-hbase-handler-<version>-mapr.jar,
file:///opt/mapr/hbase/hbase-<version>/lib/hbase-client-<version>-mapr.jar, file:///opt/mapr/hbase/hbase-
<version>/lib/hbase-server-<version>-mapr.jar,file:///opt/mapr/hbase/hbase-<version>/lib/hbase-protocol-<version>-
mapr.jar,file:///opt/mapr/zookeeper/zookeeper-<version>/zookeeper-<version>.jar</value>
  <description>A comma separated list (with no spaces) of the jar files required for Hive-HBase integration</description>
</property>

 <property>
  <name>hbase.zookeeper.quorum</name>
  <value>xx.xx.x.xxx,xx.xx.x.xxx,xx.xx.x.xxx</value>
  <description>A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers in the 
cluster.</description>
</property>

 <property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>5181</value>
  <description>The Zookeeper client port. The MapR default clientPort is 5181.</description>
</property>


Once above steps are done, let's concentrate on the Hive-Hbase integration

HBase Side:

[1] Create Hbase table:
hbase(main):006:0> create 'myhbase','cfg1'
0 row(s) in 1.2330 seconds

=> Hbase::Table - myhbase
hbase(main):007:0> put 'myhbase','200','cfg1:val','right'
0 row(s) in 0.1260 seconds

[2] Insert data to HBase table
hbase(main):008:0> put 'myhbase','230','cfg1:val','left'
0 row(s) in 0.0100 seconds

hbase(main):009:0> scan 'myhbase'
ROW                                      COLUMN+CELL
 200                                     column=cfg1:val, timestamp=1489427368644, value=right
 230                                     column=cfg1:val, timestamp=1489427381590, value=left
2 row(s) in 0.0480 seconds

hbase(main):010:0> put 'myhbase','2300','cfg1:val','left'
0 row(s) in 0.0240 seconds

hbase(main):011:0> scan 'myhbase'
ROW                                      COLUMN+CELL
 200                                     column=cfg1:val, timestamp=1489427368644, value=right
 230                                     column=cfg1:val, timestamp=1489427381590, value=left
 2300                                    column=cfg1:val, timestamp=1489427494185, value=left
3 row(s) in 0.0170 seconds



Hive Side:

[1] Create Hive-Hbase table:
hive> CREATE EXTERNAL TABLE hbase_table_4(key int, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cfg1:val")
    > TBLPROPERTIES("hbase.table.name" = "myhbase");
OK

[2] Check for data in Hive table
hive> select * from hbase_table_4;
OK
200     right
230     left
2300    left
Time taken: 0.172 seconds, Fetched: 3 row(s)

[3] Example with 'group by':
hive> select value,count(value) from hbase_table_4 group by value;
Query ID = mapr_20170313135543_574ef721-612f-4705-ab6c-1ea56f808f47
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1489190722556_0018, Tracking URL = http://vm51-154:8088/proxy/application_1489190722556_0018/
Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job  -kill job_1489190722556_0018
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-03-13 13:55:55,743 Stage-1 map = 0%,  reduce = 0%
2017-03-13 13:56:09,321 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.84 sec
2017-03-13 13:56:17,702 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.4 sec
MapReduce Total cumulative CPU time: 5 seconds 400 msec
Ended Job = job_1489190722556_0018
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.4 sec   MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 400 msec
OK
left    2
right   1
Time taken: 35.143 seconds, Fetched: 2 row(s)

No comments:

Post a Comment