Wednesday 1 March 2017


HBase replication stops abruptly


Issue: 

HBase replication between source and DR cluster stops abruptly. Replication will not happen for existing tables, however replication will work as expected for newly created tables. All the necessary configuration for replication is specified correctly. The HBase services are up and running.


Root cause:

When a regionserver crashes, a different regionserver will try to take over the hlogs queue from the crashed regionserver to finish the replication activity. This will create a persistent zk node named "lock". This will help other regionservers to take over the replication queue again.

public boolean lockOtherRS(String znode) {
    try {
      String parent = ZKUtil.joinZNode(this.rsZNode, znode);
      if (parent.equals(rsServerNameZnode)) {
        LOG.warn("Won't lock because this is us, we're dead!");
        return false;
      }
      String p = ZKUtil.joinZNode(parent, RS_LOCK_ZNODE);
      ZKUtil.createAndWatch(this.zookeeper, p, Bytes.toBytes(rsServerNameZnode));
    } catch (KeeperException e) {
      ...
      return false;
    }
    return true;
  }


If the 'hbase.zookeeper.useMulti​' in hbase-site.xml is set to 'false', then if the regionserver crashes after creating the lock and before copying the replication queue of previously crashed server to its replication queue, the "lock" will not be deleted and no other regionserver can take over the replication queue.


Symptoms in HBase regionserver logs: (DR cluster)

2017-02-14 14:37:36,109 INFO  [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl: Won't transfer the queue, another RS took care of it because of: KeeperErrorCode = NodeExists for /hbase/replication/rs/xxxx.com,60020,1469048347755/lock


Resolution:

The solution is setting hbase.zookeeper.useMulti=true in hbase-site.xml.
Remove the /hbase/replication/rs from the DR cluster.

No comments:

Post a Comment