Bigdata In Our Palm: August 2017

Friday, 18 August 2017

How to add DEBUG for hbase shell?

Test is performed in a MapR cluster.

To enable DEBUG in 'hbase shell', make the following change:

Inside '/opt/mapr/hbase/hbase-<version>/bin/hbase' file

Change : export HBASE_ROOT_LOGGER="INFO,DRFA"

to : export HBASE_ROOT_LOGGER="DEBUG,console"

in 'if [ "$COMMAND" = "shell" ] ; then' section.

Sample is given below:

# figure out which class to run
if [ "$COMMAND" = "shell" ] ; then
  # send hbase shell log messages to the log file
  export HBASE_LOGFILE="hbase-${HBASE_IDENT_STRING}-shell-${HOSTNAME}.log"
  export HBASE_ROOT_LOGGER="DEBUG,console"
  # eg export JRUBY_HOME=/usr/local/share/jruby
  if [ "$JRUBY_HOME" != "" ] ; then
    CLASSPATH="$JRUBY_HOME/lib/jruby.jar:$CLASSPATH"
    HBASE_OPTS="$HBASE_OPTS -Djruby.home=$JRUBY_HOME -Djruby.lib=$JRUBY_HOME/lib"
  fi
        #find the hbase ruby sources
  if [ -d "$HBASE_HOME/lib/ruby" ]; then
    HBASE_OPTS="$HBASE_OPTS -Dhbase.ruby.sources=$HBASE_HOME/lib/ruby"
  else
    HBASE_OPTS="$HBASE_OPTS -Dhbase.ruby.sources=$HBASE_HOME/hbase-shell/src/main/ruby"
  fi
  HBASE_OPTS="$HBASE_OPTS $HBASE_SHELL_OPTS"
  CLASS="log4j.logger.org.jruby.Main -X+O ${JRUBY_OPTS} ${HBASE_HOME}/bin/hirb.rb"

Thursday, 17 August 2017

Can Classic Drill and Drill-On-Yarn co-exist?

This combination is not completely tested but we can get it working.

Following notes will help to configure Drill-Yarn and Drill-Classic to co-exist on the same cluster.

The activity is performed in a MapR cluster.

[root@vm78-218 ~]# maprcli node list -columns svc

service hostname ip

historyserver,hbmaster,webserver,nodemanager,spark-historyserver,drill-bits,cldb,fileserver,hoststats vm78-217 IP1

hbmaster,hbregionserver,webserver,cldb,fileserver,hoststats,hue vm78-210 IP2

hivemeta,httpfs,webserver,drill-bits,fileserver,resourcemanager,hue vm78-218 IP3

We have two classic drill-bits running on vm78-217 and vm78-218.

We will install Drill On YARN (1.10) on vm78-210. (Note, this machine do not have any drill installed).

yum install mapr-drill-yarn -y

Now, configure Drill-YARN to co-exist with Classic Drill.

[1] Change the drill log location in 'drill-env.sh'. Add the following:

export DRILL_LOG_DIR="/opt/mapr/drill/drill-1.1.0-YARN/logs"

Make sure that the directory is present in all the nodes running nodemanager with necessary permissions.

[2] Make necessary changes in 'drill-on-yarn.conf' based on http://maprdocs.mapr.com/home/Drill/configure_drill_to_run_under_yarn.html

[3] We need to change the following properties in 'drill-override.conf'

• drill.exec.cluster-id

• drill.exec.zk.root

• drill.drill.exec.rpc.user.server.port

• drill.exec.rpc.bit.server.port

• drill.exec.http.port

(The idea is to give different values to above properties other than the default properties. This is under the assumption that the default properties is used by classic drill.)

Sample is given below:

drill.exec: {

cluster-id: "ajames-drillbitsonyarn",

zk.root: "drillonyarn",

rpc.user.server.port: 21010,

rpc.bit.server.port: 21011,

http.port: 7047,

zk.connect: "<zk-qourum>"

}

[4] Follow steps in http://maprdocs.mapr.com/home/Drill/step_4_launch_drill_under_yarn.html to launch the DRILL-ON-YARN.

You can also refer https://community.mapr.com/docs/DOC-1188

Issue : Hive on MR and Hive on TEZ query output difference

Issue:

If we see different outputs for the same Hive query when executed in MR mode and TEZ mode, there could be multiple reasons. In this particular case we discuss a scenario where Hive on MR gives correct result while Hive on TEZ returns wrong result.

Analysis:

It was found that the issue is with the location where the actual data files are residing inside the hdfs hive table location. (For this particular query, issue was with one table)

Previously (before transactional feature was introduced in Hive) all files for a partition (or a table if the table is not partitioned) lived in a single directory. With these changes, any partitions (or tables) written with an ACID aware writer will have a directory for the base files and a directory for each set of delta files. (https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-BaseandDeltaDirectories). From the hive debug logs, it was seen TEZ was reading data multiple times which resulted in wrong results.

Following is part of the files and directories inside the table directory.

-rwxr-x--- 1 prodxx prodxx 813 Jun  12 00:04 000000_0_copy_1
-rwxr-x--- 1 prodxx prodxx 832 Jun  12 00:04 000000_0_copy_2
-rwxr-x--- 1 prodxx prodxx 822 Jun  12 00:04 000000_0_copy_3
-rwxr-x--- 1 prodxx prodxx 826 Jun  12 00:05 000000_0_copy_4
-rwxr-x--- 1 prodxx prodxx 836 Jun  12 00:05 000000_0_copy_5
drwxr-x--- 2 prodxx prodxx   5 Jul 16 05:50 delta_0000116_0000116
drwxr-x--- 2 prodxx prodxx   1 Jul 16 05:50 delta_0000117_0000117
drwxr-x--- 2 prodxx prodxx   1 Jul 16 05:50 delta_0000129_0000129
drwxr-x--- 2 prodxx prodxx   5 Jul 16 05:50 delta_0000118_0000118
drwxr-x--- 2 prodxx prodxx   5 Jul 16 05:50 delta_0000130_0000130

Here there are data files (example :- 000000_0_copy_4, 000000_0_copy_5, etc) outside ‘delta_*’ folder. Also, we do not see any ‘base_*’ folder.

Since the files were lying outside of ‘delta_*’ or ‘base_*’ folder, TEZ reader was giving wrong results.

One possible reason why the data files were residing outside designated directory could be using LOAD command to load data.

However, LOAD is not supported for transactional tables. (https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Limitations)

Another finding in this particular case was, there were no ‘base_*’ folder inside any hive transactional tables.

Compactor is the process responsible for combining multiple delta files into a single delta per bucket during minor compaction and combining delta and base files into single base per bucket during major compaction.

Hive metastore is responsible running the compactor. It is a background process inside hive metastore. So, we need hivemetastore package installed.

In this specific case, there as no hivemetastore package installed.

Tuesday, 8 August 2017

Issue with Hue Hive editor not showing verbose logs

Issue:

Hue Hive editor does not show logs for the jobs executed. Issue seen from Hive 2.1 version.

Reason:

From Hive 2.1.0 onwards (with HIVE-13027), Hive uses Log4j2's asynchronous logger by default. Setting hive.async.log.enabled to false will disable asynchronous logging and fallback to synchronous logging. Asynchronous logging can give significant performance improvement as logging will be handled in a separate thread that uses the LMAX disruptor queue for buffering log messages. Refer to https://logging.apache.org/log4j/2.x/manual/async.html for benefits and drawbacks. (https://cwiki.apache.org/confluence/display/Hive/GettingStarted)

There is a bug with beeline not showing verbose log when asynchronous logger is enabled. (https://issues.apache.org/jira/browse/HIVE-14183)

Workaround:

Set the following property in all hiverserver2 nodes and restart the service.

<property>
<name>hive.async.log.enabled</name>
<value>false</value>
</property>

Monday, 7 August 2017

Connect Apache Phoenix to MapR HBase

Aim:

Integrate Apache Phoenix and MapR HBase.

Environment details:

MapR Core Version : 5.2.1

Phoenix Version : 4.8.2

MapR HBase Version : 1.1.8

Pre-requisites:

MapR cluster (unsecured) up and running with HBase master and regionserver. Following example is performed on a single node cluster. The services running on the cluster is shown below:

historyserver,hbmaster,hbregionserver,webserver,nodemanager,cldb,fileserver,resourcemanager,hoststats

Steps:

[1] Download the required Phoenix from http://phoenix.apache.org/download.html

[2] Extract the tar file

[3] Copy the phoenix-4.8.2-HBase-1.1-server.jar to ‘/opt/mapr/hbase/hbase-1.1.8/lib/’ folder on all nodes that has HBase master or regionserver installed.

[4] Restart the HBase services on all nodes.

maprcli node services -action restart -name hbmaster -filter ["csvc==hbmaster"]

maprcli node services -action restart -name hbregionserver -filter ["csvc==hbregionserver"]

[5] Go to the Phoenix bin directory, execute the following command:

./sqlline.py <zookeeper_node>:5181

[root@ip-10-X-Y-Z bin]# ./sqlline.py `hostname`:5181

Setting property: [incremental, false]

Setting property: [isolation, TRANSACTION_READ_COMMITTED]

issuing: !connect jdbc:phoenix:ip-10-X-Y-Z:5181 none none org.apache.phoenix.jdbc.PhoenixDriver

Connecting to jdbc:phoenix:ip-10-X-Y-Z:5181

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/mapr/phoenix/apache-phoenix-4.8.2-HBase-1.1-bin/phoenix-4.8.2-HBase-1.1-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/opt/mapr/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

17/08/07 18:02:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Connected to: Phoenix (version 4.8)

Driver: PhoenixEmbeddedDriver (version 4.8)

Autocommit status: true

Transaction isolation: TRANSACTION_READ_COMMITTED

Building list of tables and columns for tab-completion (set fastconnect to true to skip)...

88/88 (100%) Done

Done

sqlline version 1.1.9

0: jdbc:phoenix:ip-10-X-Y-Z:5181>

[6] Listing tables in Phoenix:

0: jdbc:phoenix:ip-10-9-0-9:5181> !tables

[6] At this point you will see following tables in HBase:

hbase(main):006:0> list

Listing HBase tables. Specify a path or configure namespace mappings to list M7 tables.

TABLE

SYSTEM.CATALOG

SYSTEM.FUNCTION

SYSTEM.SEQUENCE

SYSTEM.STATS

4 row(s) in 0.0090 seconds

=> ["SYSTEM.CATALOG", "SYSTEM.FUNCTION", "SYSTEM.SEQUENCE", "SYSTEM.STATS"]