Thursday, 27 October 2016

HBase Table Replication

How HBase Replication Works?

HBase replication is based on Source-Push methodology. This means master pushes the data asynchronously. This asynchronous method will result in eventual consistency of both the tables. In a busy cluster the slave might lag from the master in order of minutes.

The underlying principle of replication is replaying of WALEntries. WALEdit is an object representing one transaction and can have more than one mutation operations (puts/ deletes), but will have for only one row. Please note that writing to WAL is optional for normal HBase operations, however for replication to work WAL must be enabled.

As per https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/replication/package-summary.html

Before trying out replication, make sure to review the following requirements:

[1] Zookeeper should be handled by yourself, not by HBase, and should always be available during the deployment.

[2] All machines from both clusters should be able to reach every other machine since replication goes from any region server to any other one on the slave cluster. That also includes the Zookeeper clusters.

[3] Both clusters should have the same HBase and Hadoop major revision. For example, having 0.90.1 on the master and 0.90.0 on the slave is correct but not 0.90.1 and 0.89.20100725.

[4] Every table that contains families that are scoped for replication should exist on every cluster with the exact same name, same for those replicated families.

[5] For multiple slaves, Master/Master, or cyclic replication version 0.92 or greater is needed.

Also, if both source and destination cluster uses same zookeeper quorum, then make sure that they use a different 'zookeeper.znode.parent' znode.

Different modes of replication:

[1] Master - Slave (Single direction)
[2] Master - Master (Bi-directional)
[3] Cyclic - more than 2 cluster in picture. Can have various combination of above two.

Enabling replication:

MODE: Master - Slave

Changes to be made on Master: (Make sure HBase table to be replicated exists in both the cluster)

[1] Edit ${HBASE_HOME}/conf/hbase-site.xml on both clusters and add the following:

<property>
<name>hbase.replication</name>
<value>true</value>
</property>

[2] Push hbase-site.xml to all nodes.
[3] Restart hbase
[4] Run the following command in the HBase master's shell while it's running:
add_peer '<n>', "slave.zookeeper.quorum:zookeeper.clientport.:zookeeper.znode.parent"
[5] Once you have a peer, enable replication on your column families. One way to do this is to alter the table and set the scope like this:

disable 'your_table'
alter 'your_table', {NAME => 'family_name', REPLICATION_SCOPE => '1'}
enable 'your_table'

MODE: Master – Master

Make the above changes in both the clusters.

Verifying whether data is replicated or not:

VerifyReplication MR job - runs on the master node and we need to specify the peer id.

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath`

"${HADOOP_HOME}/bin/hadoop" jar "${HBASE_HOME}/hbase-server-VERSION.jar"

verifyrep --starttime=<timestamp> --stoptime=<timestamp> --families=<myFam> <ID> <tableName>

The VerifyReplication command prints out GOODROWS and BADROWS counters to indicate rows that did and did not replicate correctly.

Example:

hadoop jar /opt/mapr/hbase/hbase-1.1.1/lib/hbase-server-1.1.1-mapr-1602.jar verifyrep --starttime=1477346187024 --stoptime 1478346187024 --families=c 1 hreplica

Also, the status of replication can be viewed from the HBase shell using 'status' command.

hbase(main):001:0> status 'replication'

version 1.1.1-mapr-1602

1 live servers

m51-d16-2:

SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=0, TimeStampsOfLastShippedOp=Mon Oct 24 22:53:04 EDT 2016, Replication Lag=0

SINK : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Wed Oct 26 19:16:57 EDT 2016

MapR cluster status can be viewed using the 'maprcli dashboard info' command or the UI.

Common errors and their reasons:

[A] ERROR when host details are not added in source cluster:

2016-10-24 17:56:32,632 WARN [main-EventThread.replicationSource,1] regionserver.HBaseInterClusterReplicationEndpoint: Can't replicate because of a local or network error:

java.net.UnknownHostException: unknown host: m51-d16-2

at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.<init>(RpcClientImpl.java:301)

at org.apache.hadoop.hbase.ipc.RpcClientImpl.createConnection(RpcClientImpl.java:131)

at org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1286)

at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1164)

at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)

at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)

at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:23209)

at org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:65)

at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:161)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:694)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:406)

[B] ERROR while the destination cluster is down:

2016-10-24 17:59:53,040 WARN [main-EventThread.replicationSource,1] regionserver.HBaseInterClusterReplicationEndpoint: Can't replicate because of a local or network error:

java.io.IOException: No replication sinks are available

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager.getReplicationSink(ReplicationSinkManager.java:115)

at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:155)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:694)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:406)

[C] ERROR when table does not exist in destination cluster as in source cluster:

2016-10-24 17:59:16,036 WARN [main-EventThread.replicationSource,1] regionserver.HBaseInterClusterReplicationEndpoint: Can't replicate because of an error on the remote cluster:

org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException): org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: Table 'hreplica' was not found, got: hbase:namespace.: 1 time,

at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:229)

at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:209)

at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1595)

at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:1185)

at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:1202)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:236)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:160)

at org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:198)

at org.apache.hadoop.hbase.regionserver.RSRpcServices.replicateWALEntry(RSRpcServices.java:1708)

at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22253)

at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)

at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)

at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)

at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)

at java.lang.Thread.run(Thread.java:745)

at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1208)

at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)

at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)

at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:23209)

at org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:65)

at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:161)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:694)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:406)

[D] ERROR when regionserver is down/ table is disabled:

2016-10-24 18:19:28,877 WARN [main-EventThread.replicationSource,1] regionserver.HBaseInterClusterReplicationEndpoint: Can't replicate because of an error on the remote cluster: