转载自:http://iwinit.iteye.com/blog/1818897
我们创建表t1,列族c1,hbase.root目录为/new。当创建空表时,系统会自动生成一个空region,我们以这个region分配过程看下Region是如何在HMaster和Region server(以下简称rs)中创建的。大致过程如下:
1.HMaster指定分配计划,一个region只会分配给一个rs,多个rs均匀分配
2.多个rs并发执行assiagnment操作
3.先在zk的/hbase/assiangment目录下创建region节点,状态为‘offline’
4.RPC对应rs,请求分配region
5.master端开始等待所有region都被分配,通过zk的节点状态通信
6.rs端收到请求,执行异步OpenRegion操作
7.rs先把zk节点状态改为'opening'
8.rs执行open region操作,并初始化region,主要是创建region的HDFS目录,初始化Store
9.rs修改meta表中region对应的记录信息
10.rs修改zk节点中的状态为'opened'
11.master收到'opened'信息,认为该region已经assiagnment成功
12.所有region都成功后,master认为region批量创建成功
大概类图
在HMaster端提供了BulkAssigner,用来批量分配region,默认采用随即均匀分配,分配过程是一个rpc调用
- public boolean bulkAssign(boolean sync) throws InterruptedException,
- IOException {
- boolean result = false;
- ThreadFactoryBuilder builder = new ThreadFactoryBuilder();
- builder.setDaemon(true);
- builder.setNameFormat(getThreadNamePrefix() + "-%1$d");
- builder.setUncaughtExceptionHandler(getUncaughtExceptionHandler());
- int threadCount = getThreadCount();
- java.util.concurrent.ExecutorService pool =
- Executors.newFixedThreadPool(threadCount, builder.build());
- try {
-
- populatePool(pool);
-
-
-
- if (sync) result = waitUntilDone(getTimeoutOnRIT());
- } finally {
-
- pool.shutdown();
- }
- return result;
- }
等待过程
- boolean waitUntilNoRegionsInTransition(final long timeout, Set<HRegionInfo> regions)
- throws InterruptedException {
-
-
-
- long startTime = System.currentTimeMillis();
- long remaining = timeout;
- boolean stillInTransition = true;
- synchronized (regionsInTransition) {
- while (regionsInTransition.size() > 0 && !this.master.isStopped() &&
- remaining > 0 && stillInTransition) {
- int count = 0;
- for (RegionState rs : regionsInTransition.values()) {
- if (regions.contains(rs.getRegion())) {
- count++;
- break;
- }
- }
- if (count == 0) {
- stillInTransition = false;
- break;
- }
- regionsInTransition.wait(remaining);
- remaining = timeout - (System.currentTimeMillis() - startTime);
- }
- }
- return stillInTransition;
- }
AssignmentManager提供了assign(final ServerName destination,final List<HRegionInfo> regions)给每个rs批量assign region
- void assign(final ServerName destination,
- final List<HRegionInfo> regions) {
- ....
-
- List<RegionState> states = new ArrayList<RegionState>(regions.size());
- synchronized (this.regionsInTransition) {
- for (HRegionInfo region: regions) {
- states.add(forceRegionStateToOffline(region));
- }
- }
- .....
-
-
-
-
-
-
- AtomicInteger counter = new AtomicInteger(0);
- CreateUnassignedAsyncCallback cb =
- new CreateUnassignedAsyncCallback(this.watcher, destination, counter);
- for (RegionState state: states) {
- if (!asyncSetOfflineInZooKeeper(state, destination, cb, state)) {
- return;
- }
- }
-
- int total = regions.size();
- for (int oldCounter = 0; true;) {
- int count = counter.get();
- if (oldCounter != count) {
- LOG.info(destination.toString() + " unassigned znodes=" + count +
- " of total=" + total);
- oldCounter = count;
- }
- if (count == total) break;
- Threads.sleep(1);
- }
-
- try {
-
-
-
- long maxWaitTime = System.currentTimeMillis() +
- this.master.getConfiguration().
- getLong("hbase.regionserver.rpc.startup.waittime", 60000);
- while (!this.master.isStopped()) {
- try {
- this.serverManager.sendRegionOpen(destination, regions);
- break;
- } catch (RemoteException e) {
- IOException decodedException = e.unwrapRemoteException();
- if (decodedException instanceof RegionServerStoppedException) {
- LOG.warn("The region server was shut down, ", decodedException);
-
- return;
- } else if (decodedException instanceof ServerNotRunningYetException) {
-
-
- long now = System.currentTimeMillis();
- if (now > maxWaitTime) throw e;
- LOG.debug("Server is not yet up; waiting up to " +
- (maxWaitTime - now) + "ms", e);
- Thread.sleep(1000);
- }
-
- throw decodedException;
- }
- }
- }
- .......
- }
rs的RPC接口HRegionInterface.openRegions(final List<HRegionInfo> regions),rs初始化region,并通过zk状态告知master是否成功,这是一个异步过程。
用户表open region为OpenRegionHandler,处理
- public void process() throws IOException {
- try {
- .....
-
-
-
-
- if (!transitionZookeeperOfflineToOpening(encodedName,
- versionOfOfflineNode)) {
- LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
- encodedName);
- return;
- }
-
-
-
-
- region = openRegion();
- if (region == null) {
- tryTransitionToFailedOpen(regionInfo);
- return;
- }
- boolean failed = true;
-
-
- if (tickleOpening("post_region_open")) {
- if (updateMeta(region)) {
- failed = false;
- }
- }
-
- if (failed || this.server.isStopped() ||
- this.rsServices.isStopping()) {
- cleanupFailedOpen(region);
- tryTransitionToFailedOpen(regionInfo);
- return;
- }
-
- if (!transitionToOpened(region)) {
-
-
-
-
-
-
- cleanupFailedOpen(region);
- return;
- }
-
-
- this.rsServices.addToOnlineRegions(region);
-
- .....
- }
Region初始化
- private long initializeRegionInternals(final CancelableProgressable reporter,
- MonitoredTask status) throws IOException, UnsupportedEncodingException {
- .....
-
-
- status.setStatus("Writing region info on filesystem");
-
- checkRegioninfoOnFilesystem();
-
-
- status.setStatus("Cleaning up temporary data from old regions");
-
- cleanupTmpDir();
-
-
-
-
-
-
-
-
-
- Map<byte[], Long> maxSeqIdInStores = new TreeMap<byte[], Long>(
- Bytes.BYTES_COMPARATOR);
- long maxSeqId = -1;
-
- long maxMemstoreTS = -1;
-
- if (this.htableDescriptor != null &&
- !htableDescriptor.getFamilies().isEmpty()) {
-
- ThreadPoolExecutor storeOpenerThreadPool =
- getStoreOpenAndCloseThreadPool(
- "StoreOpenerThread-" + this.regionInfo.getRegionNameAsString());
- CompletionService<Store> completionService =
- new ExecutorCompletionService<Store>(storeOpenerThreadPool);
-
-
- for (final HColumnDescriptor family : htableDescriptor.getFamilies()) {
- status.setStatus("Instantiating store for column family " + family);
- completionService.submit(new Callable<Store>() {
- public Store call() throws IOException {
- return instantiateHStore(tableDir, family);
- }
- });
- }
- try {
- for (int i = 0; i < htableDescriptor.getFamilies().size(); i++) {
- Future<Store> future = completionService.take();
- Store store = future.get();
-
- this.stores.put(store.getColumnFamilyName().getBytes(), store);
- long storeSeqId = store.getMaxSequenceId();
- maxSeqIdInStores.put(store.getColumnFamilyName().getBytes(),
- storeSeqId);
- if (maxSeqId == -1 || storeSeqId > maxSeqId) {
- maxSeqId = storeSeqId;
- }
- long maxStoreMemstoreTS = store.getMaxMemstoreTS();
- if (maxStoreMemstoreTS > maxMemstoreTS) {
- maxMemstoreTS = maxStoreMemstoreTS;
- }
- }
- ......
- }
- mvcc.initialize(maxMemstoreTS + 1);
-
- maxSeqId = Math.max(maxSeqId, replayRecoveredEditsIfAny(
- this.regiondir, maxSeqIdInStores, reporter, status));
-
- .......
-
- this.lastFlushTime = EnvironmentEdgeManager.currentTimeMillis();
-
-
-
- long nextSeqid = maxSeqId + 1;
- ......
- return nextSeqid;
- }
rs端的处理就是这些,master端通过zk的watcher监听rs端的region状态修改,AssignmentManager的nodeDataChanged方法就是用来处理这个的。
- public void nodeDataChanged(String path) {
- if(path.startsWith(watcher.assignmentZNode)) {
- try {
- Stat stat = new Stat();
-
- RegionTransitionData data = ZKAssign.getDataAndWatch(watcher, path, stat);
- if (data == null) {
- return;
- }
- handleRegion(data, stat.getVersion());
- } catch (KeeperException e) {
- master.abort("Unexpected ZK exception reading unassigned node data", e);
- }
- }
- }
当rs把region状态设为opening时
- case RS_ZK_REGION_OPENING:
- .....
-
-
- regionState.update(RegionState.State.OPENING,
- data.getStamp(), data.getOrigin());
- break;
当rs把region状态设为‘opened‘时
- case RS_ZK_REGION_OPENED:
- ......
-
-
- regionState.update(RegionState.State.OPEN,
- data.getStamp(), data.getOrigin());
- this.executorService.submit(
- new OpenedRegionHandler(master, this, regionState.getRegion(),
- data.getOrigin(), expectedVersion));
- break;
OpenedRegionHandler主要是删除之前创建的/hbase/unassigned下的region节点
- public void process() {
-
-
- RegionState regionState = this.assignmentManager.isRegionInTransition(regionInfo);
- boolean openedNodeDeleted = false;
- if (regionState != null
- && regionState.getState().equals(RegionState.State.OPEN)) {
- openedNodeDeleted = deleteOpenedNode(expectedVersion);
- if (!openedNodeDeleted) {
- LOG.error("The znode of region " + regionInfo.getRegionNameAsString()
- + " could not be deleted.");
- }
- }
- .....
- }
节点删除后,又有zk通知,AssignmentManager的nodeDeleted方法
- public void nodeDeleted(final String path) {
- if (path.startsWith(this.watcher.assignmentZNode)) {
- String regionName = ZKAssign.getRegionName(this.master.getZooKeeper(), path);
- RegionState rs = this.regionsInTransition.get(regionName);
- if (rs != null) {
- HRegionInfo regionInfo = rs.getRegion();
- if (rs.isSplit()) {
- LOG.debug("Ephemeral node deleted, regionserver crashed?, " +
- "clearing from RIT; rs=" + rs);
- regionOffline(rs.getRegion());
- } else {
- LOG.debug("The znode of region " + regionInfo.getRegionNameAsString()
- + " has been deleted.");
- if (rs.isOpened()) {
- makeRegionOnline(rs, regionInfo);
- }
- }
- }
- }
- }
region上线,将region从transition列表中删除,并更新servers和regions列表
- void regionOnline(HRegionInfo regionInfo, ServerName sn) {
- synchronized (this.regionsInTransition) {
- RegionState rs =
- this.regionsInTransition.remove(regionInfo.getEncodedName());
- if (rs != null) {
- this.regionsInTransition.notifyAll();
- }
- }
- synchronized (this.regions) {
-
- ServerName oldSn = this.regions.get(regionInfo);
- if (oldSn != null && serverManager.isServerOnline(oldSn)) {
- LOG.warn("Overwriting " + regionInfo.getEncodedName() + " on old:"
- + oldSn + " with new:" + sn);
-
- Set<HRegionInfo> hris = servers.get(oldSn);
- if (hris != null) {
- hris.remove(regionInfo);
- }
- }
-
- if (isServerOnline(sn)) {
- this.regions.put(regionInfo, sn);
- addToServers(sn, regionInfo);
- this.regions.notifyAll();
- } else {
- LOG.info("The server is not in online servers, ServerName=" +
- sn.getServerName() + ", region=" + regionInfo.getEncodedName());
- }
- }
-
- clearRegionPlan(regionInfo);
-
- addToServersInUpdatingTimer(sn);
- }
小节
region assignment主要关键点
1.region load balance,默认是随即均匀分配
2.master在/hbase/unassigned下建立region节点,方便后续和rs交互
3.rs初始化region在HDFS上的文件目录,包括.regioninfo文件和family目录
4.rs open region之后,将状态设为’opened‘,master认为region assignment成功,删除节点,并将region保存到online列表