Bigdata In Our Palm: March 2018

Friday 30 March 2018

DRILL Query on Hive transaction table getting wrong stats

Aim:

This post discusses an interesting DRILL behavior when querying Hive transaction tables.

Details:

From DRILL 1.12 onwards, it supports querying over Hive transaction table.

This is as part of https://issues.apache.org/jira/browse/DRILL-5978. DRILL 1.12 now comes with Hive 2.1 package which supports operations on Hive transaction tables.

Environment setup:

DRILL 1.12/ Hive 2.1

1. Create a Hive transaction table in Hive. Let's call it txn_test2.

hive> show create table txn_test2;
OK
CREATE TABLE `txn_test2`(
  `a` int,
  `b` string)
PARTITIONED BY (
  `c` string)
CLUSTERED BY (
  a)
INTO 3 BUCKETS
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/txn_test'
TBLPROPERTIES (
  'transactional'='true',
  'transient_lastDdlTime'='1522439951')
Time taken: 0.151 seconds, Fetched: 19 row(s)

2. Insert some data into the table.

hive> select * from txn_test2;
OK
3       James   1
1       Alwin   1
4       James   1
2       Alwin   1
3       James   2
1       Alwin   2
4       James   2
2       Alwin   2
Time taken: 0.247 seconds, Fetched: 8 row(s)

3. Query the Hive table using DRILL.

0: jdbc:drill:> use hive;
+-------+-----------------------------------+
|  ok   |              summary              |
+-------+-----------------------------------+
| true  | Default schema changed to [hive]  |
+-------+-----------------------------------+
1 row selected (0.111 seconds)
0: jdbc:drill:> select count(1) from txn_test2;
+---------+
| EXPR$0  |
+---------+
| 8       |
+---------+
1 row selected (1.258 seconds)
0: jdbc:drill:>

4. Now check the physical plan for the corresponding query.

00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {3.1 rows, 17.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 646
00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {3.0 rows, 17.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 645
00-02        StreamAgg(group=[{}], EXPR$0=[COUNT($0)]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {2.0 rows, 16.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 644
00-03          Project($f0=[1]) : rowType = RecordType(INTEGER $f0): rowcount = 1.0, cumulative cost = {1.0 rows, 4.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 643
00-04            Scan(table=[[hive, txn_test2]], groupscan=[HiveScan [table=Table(dbName:default, tableName:txn_test2), columns=[], numPartitions=2, partitions= [Partition(values:[1]), Partition(values:[2])], inputDirectories=[maprfs:/user/hive/warehouse/txn_test/c=1, maprfs:/user/hive/warehouse/txn_test/c=2]]]) : rowType = RecordType(): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 642

Did you observe anything interesting? If not, take a closer look at the cumulative cost of the table scan operation. You can see that the cumulative cost shows the total number of rows to be 0. Now, that's weird! It should show me 8 since the Hive table has 8 rows. (Well, if not exactly 8, something nearby at least!)

If you are familiar with DRILL, then you will know that the one property that the number of minor fragments (~ number of threads) depends on is the slice_target. slice_target depends on the cost calculated. If the cost is 0, it will spin only one minor fragment. This will be a problem when you are dealing with millions of records.

Here, the issue is with statistics returned from Hive.

hive> describe formatted txn_test2 partition (c=1);
OK
# col_name              data_type               comment

a                       int
b                       string

# Partition Information
# col_name              data_type               comment

c                       string

# Detailed Partition Information
Partition Value:        [1]
Database:               default
Table:                  txn_test2
CreateTime:             Fri Mar 30 13:01:30 PDT 2018
LastAccessTime:         UNKNOWN
Location:               maprfs:/user/hive/warehouse/txn_test/c=1
Partition Parameters:
        numFiles                3
        numRows                 0
        rawDataSize             0
        totalSize               1940
        transient_lastDdlTime   1522440090

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:             No
Num Buckets:            3
Bucket Columns:         [a]
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.278 seconds, Fetched: 34 row(s)

You can see that the number of rows (numRows) is 0.

Solution:

Compute statistics in Hive using the command:

https://cwiki.apache.org/confluence/display/Hive/StatsDev

hive> analyze table txn_test2 partition(c) compute statistics;
Partition default.txn_test2{c=2} stats: [numFiles=3, numRows=4, totalSize=1940, rawDataSize=500]
Partition default.txn_test2{c=1} stats: [numFiles=3, numRows=4, totalSize=1940, rawDataSize=500]
OK
Time taken: 0.677 seconds
hive>

Now verify the statistics again:

hive> describe formatted txn_test2 partition (c=1);
OK
# col_name              data_type               comment

a                       int
b                       string

# Partition Information
# col_name              data_type               comment

c                       string

# Detailed Partition Information
Partition Value:        [1]
Database:               default
Table:                  txn_test2
CreateTime:             Fri Mar 30 13:01:30 PDT 2018
LastAccessTime:         UNKNOWN
Location:               maprfs:/user/hive/warehouse/txn_test/c=1
Partition Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                3
        numRows                 4
        rawDataSize             500
        totalSize               1940
        transient_lastDdlTime   1522440837

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:             No
Num Buckets:            3
Bucket Columns:         [a]
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.305 seconds, Fetched: 35 row(s)

You can see that the numRows have changed to 4. (This is because we are checking stats for partition c=1)

Fire the query from DRILL again and check the physical plan.

00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {25.1 rows, 289.1 cpu, 3880.0 io, 0.0 network, 0.0 memory}, id = 808
00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {25.0 rows, 289.0 cpu, 3880.0 io, 0.0 network, 0.0 memory}, id = 807
00-02        StreamAgg(group=[{}], EXPR$0=[COUNT($0)]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {24.0 rows, 288.0 cpu, 3880.0 io, 0.0 network, 0.0 memory}, id = 806
00-03          Project($f0=[1]) : rowType = RecordType(INTEGER $f0): rowcount = 8.0, cumulative cost = {16.0 rows, 192.0 cpu, 3880.0 io, 0.0 network, 0.0 memory}, id = 805
00-04            Scan(table=[[hive, txn_test2]], groupscan=[HiveScan [table=Table(dbName:default, tableName:txn_test2), columns=[], numPartitions=2, partitions= [Partition(values:[1]), Partition(values:[2])], inputDirectories=[maprfs:/user/hive/warehouse/txn_test/c=1, maprfs:/user/hive/warehouse/txn_test/c=2]]]) : rowType = RecordType(): rowcount = 8.0, cumulative cost = {8.0 rows, 160.0 cpu, 3880.0 io, 0.0 network, 0.0 memory}, id = 804

Now you can see that the estimate is correct. This is the expected behavior.

Happy Drilling!

How to control the number of DRILL minor fragments for the scan on MapRDB JSON table?

Aim:

This post discusses about the DRILL property that will help to control the number of minor fragments for a scan from MapRDB JSON table.

Details:

From 1.11 version of DRILL, it supports secondary indexing on MapRDB JSON table. This is a very cool feature which can increase the turn around for your query is reduced considerably.

The team also introduced a new logic where the number of minor fragment spawn for DRILL MapRDB JSON is based on the size of the data read. This is controlled by 'format-maprdb.json.scanSizeMB ' property.

The default value of 'format-maprdb.json.scanSizeMB' is 128MB. This means that if you have a MapRDB JSON table of size 200MB, while querying the table using DRILL, it will spin up 2 minor fragments to perform the table scan operation.

The current value of this property can be checked using the query:

select * from sys.boot where name='format-maprdb.json.scanSizeMB';

0: jdbc:drill:> select * from sys.boot where name='format-maprdb.json.scanSizeMB';
+--------------------------------+-------+-------------------+--------------+---------+----------+-------------+-----------+------------+
|              name              | kind  | accessibleScopes  | optionScope  | status  | num_val  | string_val  | bool_val  | float_val  |
+--------------------------------+-------+-------------------+--------------+---------+----------+-------------+-----------+------------+
| format-maprdb.json.scanSizeMB  | LONG  | BOOT              | BOOT         | BOOT    | 128      | null        | null      | null       |
+--------------------------------+-------+-------------------+--------------+---------+----------+-------------+-----------+------------+

As you noticed, this is a BOOT property. Hence, you will not be able to change it at a session level. To change the property modify '/opt/mapr/drill/drill-<version>/conf/drill-override.conf' file to add the following:

  format-maprdb: { json.scanSizeMB : <size in MB> }

Restart the drill-bits process. Verify the property is taken effect.

In the following example, the value is changed to 256MB.

1: jdbc:drill:> select * from sys.boot where name='format-maprdb.json.scanSizeMB';
+--------------------------------+-------+-------------------+--------------+---------+----------+-------------+-----------+------------+
|              name              | kind  | accessibleScopes  | optionScope  | status  | num_val  | string_val  | bool_val  | float_val  |
+--------------------------------+-------+-------------------+--------------+---------+----------+-------------+-----------+------------+
| format-maprdb.json.scanSizeMB  | LONG  | BOOT              | BOOT         | BOOT    | 256      | null        | null      | null       |
+--------------------------------+-------+-------------------+--------------+---------+----------+-------------+-----------+------------+

Tuesday 13 March 2018

Login to Kubernetes dashboard using token

Aim:

Steps to login to Kubernetes dashboard using Token.

Steps:

Kubernetes dashboard login page will look like below:

You have the option to log in to Kubernetes dashboard using 'Kubeconfig' or 'Token' method.

If you want to login using 'Token', to get the required token value, execute following commands:

kubectl get secrets

To get secrets in a specific namespace, say 'kube-system', execute following:

kubectl get secrets -n kube-system

This will display all the secrets in namespace 'kube-system'.

[root@vmdocc7176 ~]# kubectl get secrets -n kube-system
NAME                                             TYPE                                  DATA      AGE
attachdetach-controller-token-db5cj              kubernetes.io/service-account-token   3         1h
bootstrap-signer-token-xg79w                     kubernetes.io/service-account-token   3         1h
bootstrap-token-446d7b                           bootstrap.kubernetes.io/token         7         1h
calico-cni-plugin-token-ql769                    kubernetes.io/service-account-token   3         1h
calico-kube-controllers-token-2gz6p              kubernetes.io/service-account-token   3         1h
certificate-controller-token-rcfh2               kubernetes.io/service-account-token   3         1h
clusterrole-aggregation-controller-token-9bn58   kubernetes.io/service-account-token   3         1h
cronjob-controller-token-trcnj                   kubernetes.io/service-account-token   3         1h
daemon-set-controller-token-6vjm8                kubernetes.io/service-account-token   3         1h
default-token-74gbh                              kubernetes.io/service-account-token   3         1h
deployment-controller-token-5cxg6                kubernetes.io/service-account-token   3         1h
disruption-controller-token-fvm5w                kubernetes.io/service-account-token   3         1h
endpoint-controller-token-t78ds                  kubernetes.io/service-account-token   3         1h

To print token for a specific secret, execute the following command:

kubectl describe secret <secret_name>

For example to print token for 'default-token-74gbh' secret in namespace 'kube-system', execute the following command:

kubectl -n kube-system describe secret default-token-74gbh

Copy the token value from the above output and paste to kubernetes login page.

You will have privileges according to the token used.

Saturday 10 March 2018

Deploy drill-sqlline docker image in a Kubernetes cluster

Aim:

Deploy the drill-sqlline image build using Build drill-sqlline Docker image in a Kubernetes cluster.

Steps:

[1] Create a namespace for DRILL. (Optional, but this is useful if you want to implement RBAC.)

a. Create a file 'drillsqlline-namespace.yaml' with following contents:

apiVersion: v1
kind: Namespace
metadata:
  name: drill-system
  labels:
    name: drill-system

It specifies to create a NameSpace called 'drill-system'.

b. Execute the following command to create the namespace.

kubectl create -f drillsqlline-namespace.yaml

[2] Create pod drill-sqlline.

For this, we will use the docker image we created as per - Build drill-sqlline Docker image

a. Create a file 'drillsqlline-pod.yaml' with following contents:

apiVersion: v1
kind: Pod
metadata:
      name: drill-sqlline-pod
      namespace: drill-system
spec:
    containers:
    - name: base
      imagePullPolicy: Always
      image: jamealwi/drill-sqlline:latest
      resources:
        requests:
          memory: "1Gi"
          cpu: "500m"
      command:
      - /bin/bash
      - -c
      - exec /sbin/init

It specifies to create a pod inside 'drill-system' namespace by pulling image 'jamealwi/drill-sqlline:latest'.

b. Execute the following command to create the pod inside 'drill-system' namespace.

kubectl create -f drillsqlline-pod.yaml -n drill-system

At this point, your pod should be up and running.

You can verify it by running the following command:

kubectl get pods -n drill-system

[root@vmdocc7176 mapr]# kubectl get pods -n drill-system
NAME                                        READY     STATUS    RESTARTS   AGE
drill-sqlline-pod                           1/1       Running   0          1m
[root@vmdocc7176 mapr]#

Now you can log in to your pod and then log in to sqlline as given below:

kubectl exec -it drill-sqlline-pod -n drill-system -- bash

[root@vmdocc7176 mapr]# kubectl exec -it drill-sqlline-pod  -n drill-system -- bash
[root@drill-sqlline-pod /]# ./sqlline
apache drill
"just drill it"
1: jdbc:drill:zk=10.10.XX.YYY:5181> !connect jdbc:drill:drillbit=10.10.XX.YYY:31010
Enter username for jdbc:drill:drillbit=10.10.XX.YYY:31010: username
Enter password for jdbc:drill:drillbit=10.10.XX.YYY:31010: password
2: jdbc:drill:drillbit=10.10.XX.YYY:31010>

[3] Creating a deployment

However, in a PROD setup, you will work mostly with 'Deployments' which will internally create the specified number of pods and maintain them.

Follow below steps to create 'Deployment' for drill-sqlline.

a. Create a file 'drillsqlline-deployment.yaml' with the following contents:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: drill-sqlline-deployment
  labels:
    app: drill-sqlline
spec:
  replicas: 2
  selector:
    matchLabels:
      app: drill-sqlline
  template:
    metadata:
      labels:
        app: drill-sqlline
    spec:
      containers:
      - name: drill-sqlline
        image: jamealwi/drill-sqlline:latest
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
        command:
        - /bin/bash
        - -c
        - exec /sbin/init

It specifies to create 2 pods 'drill=sqlline' with docker image 'jamealwi/drill-sqlline:latest'.

b. Execute the following command to create the deployment inside 'drill-system' namespace.

kubectl create -f drillsqlline-pod.yaml -n drill-system

Verify the deployment is successful using below command:

[root@vmdocc7176 mapr]# kubectl get deployments -n drill-system
NAME                       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
drill-sqlline-deployment   2         2         2            2           55m
[root@vmdocc7176 mapr]# kubectl get pods -n drill-system
NAME                                        READY     STATUS    RESTARTS   AGE
drill-sqlline-deployment-6748b795f7-7xdvv   1/1       Running   0          55m
drill-sqlline-deployment-6748b795f7-jczrk   1/1       Running   0          55m

Friday 9 March 2018

How to enable Kubernetes dashboard

Aim:

To enable Kubernetes cluster dashboard.

Steps:

Please follow below steps to enable the dashboard:

kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yaml

With the above command, the dashboard will be available but only with very minimal privileges.
This is because of security issues.

However, if you want to grant full admin privileges, you can follow below steps:
(Be aware it is not recommended in a PROD cluster, this is just for learning purpose.)

Create a 'dashboard-admin.yaml' file with following content:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: kubernetes-dashboard
  labels:
    k8s-app: kubernetes-dashboard
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: kubernetes-dashboard
  namespace: kube-system

Execute the below command:

kubectl create -f dashboard-admin.yaml

At this point, kuernetes dashboard is up and running with admin privileges.

However, it will not be accessible outside of your cluster.

To make it available to outside world, execute the following:

[1] kubectl -n kube-system edit service kubernetes-dashboard

Change 'type: ClusterIp' to 'type: NodePort'.

[2] kubectl -n kube-system get service kubernetes-dashboard

[root@vmdocc7176 ~]# kubectl -n kube-system get service kubernetes-dashboard
NAME                   TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
kubernetes-dashboard   NodePort   10.115.144.121   <none>        443:32227/TCP   12m

Now you can access the kubernetes dashboard using the following URL:

https://<API Server IP>:32227

Thursday 8 March 2018

Build Docker image for DRILL 1.12 sqlline client

Aim:

To build a docker image for DRILL sqlline client.

Prerequisites:

[1] Docker installed, up and running.

[2] Download the following jars:

[a] drill-jdbc-all-1.12.0.jar (http://central.maven.org/maven2/org/apache/drill/exec/drill-jdbc-all/1.12.0/drill-jdbc-all-1.12.0.jar)

[b] jline-2.10.jar (http://central.maven.org/maven2/jline/jline/2.10/jline-2.10.jar)

[c] sqlline-1.1.9-drill-r7.jar (http://repository.mapr.com/nexus/content/repositories/drill/sqlline/sqlline/1.1.9-drill-r7/sqlline-1.1.9-drill-r7.jar)

Creating Dockerfile:

[1] Create a folder. Let's call it drill-sqlline.

[2] Copy above 3 jars to drill-sqlline directory.

[3] Create a text file 'sqlline' with the following content:

java -cp "/jars/drill-jdbc-all-1.12.0.jar:/jars/sqlline-1.1.9-drill-r7.jar:/jars/jline-2.10.jar" sqlline.SqlLine -d org.apache.drill.jdbc.Driver

[4] Create a 'Dockerfile' inside drill-sqlline with the following content:

FROM centos:latest

RUN yum install java-devel -y
RUN mkdir /jars
ADD drill-jdbc-all-1.12.0.jar /jars/
ADD jline-2.10.jar /jars/
ADD sqlline-1.1.9-drill-r7.jar /jars/
ADD sqlline /
RUN chmod +x /sqlline

Directory 'drill-sqlline' will contain:

Dockerfile

drill-jdbc-all-1.12.0.jar

jline-2.10.jar

sqlline

sqlline-1.1.9-drill-r7.jar

[5] Execute the following command in 'drill-sqlline' directory:

docker build -t jamealwi/drill-sqlline:latest .

(NOTE: Make sure you use your docker hub username instead of 'jamealwi')

[root@vmdocc7176 sqlline-docker]# docker build -t jamealwi/drill-sqlline:latest .
Sending build context to Docker daemon 28.12 MB
Step 1 : FROM centos:latest
 ---> 2d194b392dd1
Step 2 : RUN yum install java-devel -y
 ---> Using cache
 ---> 9928f31bf455
Step 3 : RUN mkdir /jars
 ---> Using cache
 ---> e0a46e253ebc
Step 4 : ADD drill-jdbc-all-1.12.0.jar /jars/
 ---> Using cache
 ---> d977165bcb13
Step 5 : ADD jline-2.10.jar /jars/
 ---> Using cache
 ---> 27e0cbb67ab7
Step 6 : ADD sqlline-1.1.9-drill-r7.jar /jars/
 ---> Using cache
 ---> 229c77630c35
Step 7 : ADD sqlline /
 ---> Using cache
 ---> c3545c155a61
Step 8 : RUN chmod +x /sqlline
 ---> Using cache
 ---> 62d1cbe78611
Step 9 : CMD ./sqlline
 ---> Using cache
 ---> 9f7c6cb3b3e9
Successfully built 9f7c6cb3b3e9
[root@vmdocc7176 sqlline-docker]#

(NOTE: You will see way more messages if you are building the image for the first time.)

Your docker image for DRILL sqlline is ready!

The docker image is available in my docker hub. You can pull the image using the following command:

docker pull jamealwi/drill-sqlline

Now you can run your Docker image by the following command:

docker exec -it jamealwi/drill-sqlline bash

Once you are inside the container, execute './sqlline'.