Wednesday, 29 March 2017


Monitoring HBase using MapR SpyGlass

MapR 5.2 comes with SpyGlass. This feature can be used to gather many useful information for monitoring. Following explains how ti setup a sample dashboard for HBase monitoring in Grafana.

Steps to setup HBase monitoring dashboard in Grafana

Step 1:  Login to Grafana




Step 2: Go to node dashboard (I have a single node)



Step 3: Add Panel -> Add Graph



Step 4: Click on General, add a name for the graph, here I gave ‘HBaseRegionServerTracker’



Step 5: Click on ‘Metrics’, in the ‘Metric’ column search for mapr.hbase* metrics and add suitable metric you need.



Step 6: I selected ‘mapr.hbase_master.region_servers’ and then save the dashboard








Monday, 13 March 2017


Accessing data from a pre-existing HBase table through Hive in MapR cluster


AIM

Convert a pre-existing table to Hive-Hbase table and access data making use of 'group by' operation.


Make the following changes in all the nodes in hive-site.xml:


<property>
  <name>hive.aux.jars.path</name>
  <value>file:///opt/mapr/hive/hive-<version>/lib/hive-hbase-handler-<version>-mapr.jar,
file:///opt/mapr/hbase/hbase-<version>/lib/hbase-client-<version>-mapr.jar, file:///opt/mapr/hbase/hbase-
<version>/lib/hbase-server-<version>-mapr.jar,file:///opt/mapr/hbase/hbase-<version>/lib/hbase-protocol-<version>-
mapr.jar,file:///opt/mapr/zookeeper/zookeeper-<version>/zookeeper-<version>.jar</value>
  <description>A comma separated list (with no spaces) of the jar files required for Hive-HBase integration</description>
</property>

 <property>
  <name>hbase.zookeeper.quorum</name>
  <value>xx.xx.x.xxx,xx.xx.x.xxx,xx.xx.x.xxx</value>
  <description>A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers in the 
cluster.</description>
</property>

 <property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>5181</value>
  <description>The Zookeeper client port. The MapR default clientPort is 5181.</description>
</property>


Once above steps are done, let's concentrate on the Hive-Hbase integration

HBase Side:

[1] Create Hbase table:
hbase(main):006:0> create 'myhbase','cfg1'
0 row(s) in 1.2330 seconds

=> Hbase::Table - myhbase
hbase(main):007:0> put 'myhbase','200','cfg1:val','right'
0 row(s) in 0.1260 seconds

[2] Insert data to HBase table
hbase(main):008:0> put 'myhbase','230','cfg1:val','left'
0 row(s) in 0.0100 seconds

hbase(main):009:0> scan 'myhbase'
ROW                                      COLUMN+CELL
 200                                     column=cfg1:val, timestamp=1489427368644, value=right
 230                                     column=cfg1:val, timestamp=1489427381590, value=left
2 row(s) in 0.0480 seconds

hbase(main):010:0> put 'myhbase','2300','cfg1:val','left'
0 row(s) in 0.0240 seconds

hbase(main):011:0> scan 'myhbase'
ROW                                      COLUMN+CELL
 200                                     column=cfg1:val, timestamp=1489427368644, value=right
 230                                     column=cfg1:val, timestamp=1489427381590, value=left
 2300                                    column=cfg1:val, timestamp=1489427494185, value=left
3 row(s) in 0.0170 seconds



Hive Side:

[1] Create Hive-Hbase table:
hive> CREATE EXTERNAL TABLE hbase_table_4(key int, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cfg1:val")
    > TBLPROPERTIES("hbase.table.name" = "myhbase");
OK

[2] Check for data in Hive table
hive> select * from hbase_table_4;
OK
200     right
230     left
2300    left
Time taken: 0.172 seconds, Fetched: 3 row(s)

[3] Example with 'group by':
hive> select value,count(value) from hbase_table_4 group by value;
Query ID = mapr_20170313135543_574ef721-612f-4705-ab6c-1ea56f808f47
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1489190722556_0018, Tracking URL = http://vm51-154:8088/proxy/application_1489190722556_0018/
Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job  -kill job_1489190722556_0018
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-03-13 13:55:55,743 Stage-1 map = 0%,  reduce = 0%
2017-03-13 13:56:09,321 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.84 sec
2017-03-13 13:56:17,702 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.4 sec
MapReduce Total cumulative CPU time: 5 seconds 400 msec
Ended Job = job_1489190722556_0018
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.4 sec   MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 400 msec
OK
left    2
right   1
Time taken: 35.143 seconds, Fetched: 2 row(s)

Thursday, 9 March 2017


Hue Unable to retrieve logs when executing Hive query from Hue editor


Issue

Hue Unable to retrieve logs when executing Hive query from Hue editor. Hue version is 3.9 and Hive version is 1.2. In the Hue UI, it throws ‘Invalid method name: 'GetLog'’ error.

Error in runscpserver.log


TApplicationException: Invalid method name: 'GetLog'
[09/Mar/2017 14:39:04 -0800] thrift_util  INFO     Thrift saw an application exception: Invalid method name: 'GetLog'
[09/Mar/2017 14:39:04 -0800] hive_server2_lib ERROR    server does not support GetLog
Traceback (most recent call last):
  File "/opt/mapr/hue/hue-3.9.0/apps/beeswax/src/beeswax/server/hive_server2_lib.py", line 750, in get_log
    res = self.call(self._client.GetLog, req)
  File "/opt/mapr/hue/hue-3.9.0/apps/beeswax/src/beeswax/server/hive_server2_lib.py", line 555, in call
    res = fn(req)
  File "/opt/mapr/hue/hue-3.9.0/desktop/core/src/desktop/lib/thrift_util.py", line 377, in wrapper
    raise StructuredException('THRIFTAPPLICATION', str(e), data=None, error_code=502

Solution


Check whether ‘use_get_log_api’ is set to ‘true’ in hue.ini file. Comment it out or make it ‘false’ and restart Hue.

Wednesday, 8 March 2017


Configuring HBase Thrift HA in MapR Clusters


AIM

Make HBase thrift HA with multiple thrift servers as ACTIVE.

Default behavior

If we have hbase thrift installed in to nodes, one will be shown as active and the other as standby.

Steps


Make changes in following file in all the hbase thrift server nodes

/opt/mapr/conf/conf.d/warden.hbasethrift.conf
#
# sed -i s/\1.1.1/`cat /opt/mapr/hbase/hbaseversion`/g /opt/mapr/conf/conf.d/warden.hbasethrift.conf
#
services=hbasethrift:all
service.displayname=HBaseThriftServer
service.command.start=/opt/mapr/hbase/hbase-1.1.1/bin/hbase-daemon.sh start thrift
service.command.stop=/opt/mapr/hbase/hbase-1.1.1/bin/hbase-daemon.sh stop thrift
service.command.type=BACKGROUND
service.command.monitorcommand=/opt/mapr/hbase/hbase-1.1.1/bin/hbase-daemon.sh status thrift
service.port=9090
service.ui.port=9095
service.logs.location=/opt/mapr/hbase/hbase-1.1.1/logs
service.process.type=JAVA
service.alarm.tersename=hbasethrift
service.alarm.label=HbaseThriftServiceDown


Once the changes are made, restart warden in all nodes

service mapr-warden restart

Once the warden comes up, you will see all the thrift servers in active state.

I have installed hbase thrift in 2 nodes. Below MCS screenshot shows that hbase thrift is up and running in two nodes.





Wednesday, 1 March 2017


Oozie workflow shows in running state even after job is completed successfully


Issue:


Oozie workflow shows to be in running state even after the job has completed successfully. oozie job -info <workflow_ID> will show that job has completed successfully. However in web UI it still remains as running.
If we try to kill the workflow in oozie using oozie job -kill <workflow_ID>, it will throw the following error:

Error: E0607 : E0607: Other error in operation [kill], java.io.EOFException

In the oozie.log you can find following exception: (exception captured while trying to suspend the job)

2017-03-01 15:10:59,753  WARN V2JobServlet:523 - SERVER[phpvcoredev03.chicago.local] USER[vcoredevuser] GROUP[-] TOKEN[] APP[Lab-EdwardLabResults] JOB[0000237-160624170341227-oozie-mapr-W] ACTION[] URL[PUT http://phpvcoredev03:11000/oozie/v2/job/0000237-160624170341227-oozie-mapr-W?action=suspend] error[E0607], E0607: Other error in operation [suspend], java.io.EOFException
org.apache.oozie.servlet.XServletException: E0607: Other error in operation [suspend], java.io.EOFException
        at org.apache.oozie.servlet.V1JobServlet.suspendWorkflowJob(V1JobServlet.java:430)
        at org.apache.oozie.servlet.V1JobServlet.suspendJob(V1JobServlet.java:127)
        at org.apache.oozie.servlet.BaseJobServlet.doPut(BaseJobServlet.java:92)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
        at org.apache.oozie.servlet.JsonRestServlet.service(JsonRestServlet.java:304)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.oozie.servlet.AuthFilter$2.doFilter(AuthFilter.java:171)
        at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:604)
        at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:567)
        at org.apache.oozie.servlet.AuthFilter.doFilter(AuthFilter.java:176)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.oozie.servlet.HostnameFilter.doFilter(HostnameFilter.java:86)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)

Resolution:


Delete the job from the oozie database.
The entry must be deleted from both WF_JOBS and WF_ACTIONS table.

HBase replication stops abruptly


Issue: 

HBase replication between source and DR cluster stops abruptly. Replication will not happen for existing tables, however replication will work as expected for newly created tables. All the necessary configuration for replication is specified correctly. The HBase services are up and running.


Root cause:

When a regionserver crashes, a different regionserver will try to take over the hlogs queue from the crashed regionserver to finish the replication activity. This will create a persistent zk node named "lock". This will help other regionservers to take over the replication queue again.

public boolean lockOtherRS(String znode) {
    try {
      String parent = ZKUtil.joinZNode(this.rsZNode, znode);
      if (parent.equals(rsServerNameZnode)) {
        LOG.warn("Won't lock because this is us, we're dead!");
        return false;
      }
      String p = ZKUtil.joinZNode(parent, RS_LOCK_ZNODE);
      ZKUtil.createAndWatch(this.zookeeper, p, Bytes.toBytes(rsServerNameZnode));
    } catch (KeeperException e) {
      ...
      return false;
    }
    return true;
  }


If the 'hbase.zookeeper.useMulti​' in hbase-site.xml is set to 'false', then if the regionserver crashes after creating the lock and before copying the replication queue of previously crashed server to its replication queue, the "lock" will not be deleted and no other regionserver can take over the replication queue.


Symptoms in HBase regionserver logs: (DR cluster)

2017-02-14 14:37:36,109 INFO  [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl: Won't transfer the queue, another RS took care of it because of: KeeperErrorCode = NodeExists for /hbase/replication/rs/xxxx.com,60020,1469048347755/lock


Resolution:

The solution is setting hbase.zookeeper.useMulti=true in hbase-site.xml.
Remove the /hbase/replication/rs from the DR cluster.


Installing Ganglia on MapR to monitor HBase


Node details:

Node1 – 10.10.YY.X1
Node2 - 10.10.YY.X2
Node3 - 10.10.YY.X3

 The basic packages required for ganglia to run are:

[1] ganglia-gmond – required on all nodes from where metrics needs to be collected
[2] ganglia-gmetad – required on node which will perform aggregation (typically on one node)
[3] ganglia-web – required on the node running the Ganglia web UI

Following is my setup: (Concentrating on Ganglia and HBase)

Node1 – 10.10.YY.X1 – Hbase regionserver, gmond
Node2 - 10.10.YY.X3 – Hbase regionserver, gmond
Node3 - 10.10.YY.X2 – Hbase regionserver, Hbase master, gmond, gemetad, ganglia-web

For installing the packages in CentOS, use the following command:

yum install ganglia-gmond -y
yum install ganglia-gmetad -y
yum install ganglia-web -y

Edit ‘/etc/ganglia/gmond.conf’ on all nodes running the gmond service (sample is shown below)

Part 1:
cluster {
  name = "ThreeNodeClusterAJames"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}

Part 2:
udp_send_channel {
  bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
  #mcast_join = 239.2.11.71
  host = 10.10.YY.XX
  port = 8649
  ttl = 1
}

NOTE:
·        We are using unicast protocol instead of multicast protocol. Please comment ‘mcast_join = 239.2.11.71’ and add “host = <IP of localhost running the gmond service>”
·        If there is any change in the port used, please change the same in ‘port’ property. Here I am using the default port which is 8649.

Part 3:
udp_recv_channel {
  #mcast_join = 239.2.11.71
  port = 8649
  #bind = 239.2.11.71
  #bind = 10.10.72.154
  #retry_bind = true
  # Size of the UDP buffer. If you are handling lots of metrics you really
  # should bump it up to e.g. 10MB or even higher.
  # buffer = 10485760
}

Edit ‘/etc/ganglia/gmetad.conf’ in gmetad server and add the following:

data_source "ThreeNodeClusterAJames" 10.10.YY.X1:8649 10.10.YY.X2:8649 10.10.YY.X3:8649

The cluster name ‘ThreeNodeClusterAJames’ should be same in all the gmond servers and gmetad server.

Edit ‘/opt/mapr/conf/hadoop-metrics.properties’ file and add

Part 1:
# Configuration of the "cldb" context for ganglia
cldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31
cldb.period=10
cldb.servers=10.10.YY.X1:8649,10.10.YY.X2:8649,10.10.YY.X3:8649
cldb.spoof=1

Part 2:
# Configuration of the "fileserver" context for ganglia
fileserver.class=com.mapr.fs.cldb.counters.MapRGangliaContext31
fileserver.period=37
fileserver.servers=10.10.YY.X1:8649,10.10.YY.X2:8649,10.10.YY.X3:8649
fileserver.spoof=1

Execute the following in any of the node and restart cldb service on all cldb nodes:

maprcli config save -values {"cldb.ganglia.cldb.metrics":"1"}
maprcli config save -values {"cldb.ganglia.fileserver.metrics":"1"}

 Execute the following in all nodes

sudo setenforce 0

Execute the following gmetad server:

chown -R apache:apache /usr/share/ganglia
chown -R ganglia:ganglia /var/lib/ganglia/rrd*
chcon -R -t httpd_sys_content_t *

Edit ‘/etc/httpd/conf.d/ganglia.conf’ file:

<Location /ganglia>
  Order deny,allow
#  Deny from all
  Allow from 127.0.0.1
  Allow from ::1
  # Allow from .example.com
</Location>

Restart the httpd, gmond and gmetad:

/etc/init.d/httpd restart
/etc/init.d/gmond restart
/etc/init.d/httpd restart

Edit ‘/opt/mapr/hbase/hbase-<version>/conf/hadoop-metrics2-hbase.properties’ file

hbase.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
hbase.sink.ganglia.servers=<ganglia-server>:8649
hbase.sink.ganglia.period=10

hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
hbase.period=10
hbase.servers=<server-running-hbase-regionserver-1>:8649, <server-running-hbase-regionserver-2>:8649

# Configuration of the "jvm" context for ganglia
# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
# jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=<server-running-hbase-regionserver-1>:8649, <server-running-hbase-regionserver-2>:8649
...
# Configuration of the "rpc" context for ganglia
# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
# rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext
rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rpc.period=10
rpc.servers=<server-running-hbase-regionserver-1>:8649, <server-running-hbase-regionserver-2>:8649
...
# Configuration of the "rest" context for ganglia
# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
# rest.class=org.apache.hadoop.metrics.ganglia.GangliaContext
rest.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rest.period=10
rest.servers=<server-running-hbase-regionserver-1>:8649, <server-running-hbase-regionserver-2>:8649

Restart Hbase master and regionserver in all nodes.
You should be able to see the hbase metrics in ganglia web UI.


Ganglia UI listing the metrics collected:



Ganglia UI showing HBase metrics: