源码为0.98.1
HRegionServer中起线程MemStoreFlusher
private void initializeThreads() throws IOException { // Cache flushing thread. this.cacheFlusher = new MemStoreFlusher(conf, this); // Compaction thread this.compactSplitThread = new CompactSplitThread(this); .......
private void startServiceThreads() throws IOException { String n = Thread.currentThread().getName(); ...... this.cacheFlusher.start(uncaughtExceptionHandler); Threads.setDaemonThreadRunning(this.compactionChecker.getThread(), n + ".compactionChecker", uncaughtExceptionHandler); .....
/* * Run init. Sets up hlog and starts up all server threads. * * @param c Extra configuration. */ protected void handleReportForDutyResponse(final RegionServerStartupResponse c) throws IOException { .... startServiceThreads(); .....
public void run() { try { // Do pre-registration initializations; zookeeper, lease threads, etc. preRegistrationInitialization(); } catch (Throwable e) { abort("Fatal exception during initialization", e); } try { // Try and register with the Master; tell it we are here. Break if // server is stopped or the clusterup flag is down or hdfs went wacky. while (keepLooping()) { RegionServerStartupResponse w = reportForDuty(); if (w == null) { LOG.warn("reportForDuty failed; sleeping and then retrying."); this.sleeper.sleep(); } else { handleReportForDutyResponse(w);//启动所有hregionserver线程服务 break; } } ....
主要的类,方法:memStoreFlusher的flushRegion
private boolean flushRegion(final HRegion region, final boolean emergencyFlush) { synchronized (this.regionsInQueue) { FlushRegionEntry fqe = this.regionsInQueue.remove(region); if (fqe != null && emergencyFlush) { // Need to remove from region from delay queue. When NOT an // emergencyFlush, then item was removed via a flushQueue.poll. flushQueue.remove(fqe); } } lock.readLock().lock(); try { boolean shouldCompact = region.flushcache(); // We just want to check the size boolean shouldSplit = region.checkSplit() != null; if (shouldSplit) { this.server.compactSplitThread.requestSplit(region); } else if (shouldCompact) { server.compactSplitThread.requestSystemCompaction( region, Thread.currentThread().getName()); } ......
从flushQueue中取出FlushRegionEntry进行flush
获取读锁
- 调用HRegion进行flush,并返回是否需要compact
- 调用HRegion查看是否需要split
- if(split) spliting elif(compact) compacting
以下是具体操作:
--------------------------------------------------------------------------------------------------------------------
1.HRegion
protected boolean internalFlushcache( final HLog wal, final long myseqid, MonitoredTask status) throws IOException { if (this.rsServices != null && this.rsServices.isAborted()) { // Don't flush when server aborting, it's unsafe throw new IOException("Aborting flush because server is abortted..."); } final long startTime = EnvironmentEdgeManager.currentTimeMillis(); // Clear flush flag. // If nothing to flush, return and avoid logging start/stop flush. if (this.memstoreSize.get() <= 0) { if(LOG.isDebugEnabled()) { LOG.debug("Empty memstore size for the current region "+this); } return false; } if (LOG.isDebugEnabled()) { LOG.debug("Started memstore flush for " + this + ", current region memstore size " + StringUtils.humanReadableInt(this.memstoreSize.get()) + ((wal != null)? "": "; wal is null, using passed sequenceid=" + myseqid)); } // Stop updates while we snapshot the memstore of all stores. We only have // to do this for a moment. Its quick. The subsequent sequence id that // goes into the HLog after we've flushed all these snapshots also goes // into the info file that sits beside the flushed files. // We also set the memstore size to zero here before we allow updates // again so its value will represent the size of the updates received // during the flush MultiVersionConsistencyControl.WriteEntry w = null; // We have to take a write lock during snapshot, or else a write could // end up in both snapshot and memstore (makes it difficult to do atomic // rows then) status.setStatus("Obtaining lock to block concurrent updates"); // block waiting for the lock for internal flush this.updatesLock.writeLock().lock(); long totalFlushableSize = 0; status.setStatus("Preparing to flush by snapshotting stores"); List<StoreFlushContext> storeFlushCtxs = new ArrayList<StoreFlushContext>(stores.size()); long flushSeqId = -1L; try { // Record the mvcc for all transactions in progress. w = mvcc.beginMemstoreInsert(); mvcc.advanceMemstore(w); // check if it is not closing. if (wal != null) { if (!wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes())) { status.setStatus("Flush will not be started for [" + this.getRegionInfo().getEncodedName() + "] - because the WAL is closing."); return false; } flushSeqId = this.sequenceId.incrementAndGet(); } else { // use the provided sequence Id as WAL is not being used for this flush. flushSeqId = myseqid; } for (Store s : stores.values()) { totalFlushableSize += s.getFlushableSize(); storeFlushCtxs.add(s.createFlushContext(flushSeqId)); } // prepare flush (take a snapshot) for (StoreFlushContext flush : storeFlushCtxs) { //步骤1 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ flush.prepare(); } } finally { this.updatesLock.writeLock().unlock(); } String s = "Finished memstore snapshotting " + this + ", syncing WAL and waiting on mvcc, flushsize=" + totalFlushableSize; status.setStatus(s); if (LOG.isTraceEnabled()) LOG.trace(s); // sync unflushed WAL changes when deferred log sync is enabled // see HBASE-8208 for details if (wal != null && !shouldSyncLog()) { //步骤2 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ wal.sync(); } // wait for all in-progress transactions to commit to HLog before // we can start the flush. This prevents // uncommitted transactions from being written into HFiles. // We have to block before we start the flush, otherwise keys that // were removed via a rollbackMemstore could be written to Hfiles. mvcc.waitForRead(w); s = "Flushing stores of " + this; status.setStatus(s); if (LOG.isTraceEnabled()) LOG.trace(s); // Any failure from here on out will be catastrophic requiring server // restart so hlog content can be replayed and put back into the memstore. // Otherwise, the snapshot content while backed up in the hlog, it will not // be part of the current running servers state. boolean compactionRequested = false; try { // A. Flush memstore to all the HStores. // Keep running vector of all store files that includes both old and the // just-made new flush store file. The new flushed file is still in the // tmp directory. for (StoreFlushContext flush : storeFlushCtxs) { //步骤3 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ flush.flushCache(status); } // Switch snapshot (in memstore) -> new hfile (thus causing // all the store scanners to reset/reseek). for (StoreFlushContext flush : storeFlushCtxs) { //步骤4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ boolean needsCompaction = flush.commit(status); if (needsCompaction) { compactionRequested = true; } } storeFlushCtxs.clear(); // Set down the memstore size by amount of flush. this.addAndGetGlobalMemstoreSize(-totalFlushableSize); } catch (Throwable t) { // An exception here means that the snapshot was not persisted. // The hlog needs to be replayed so its content is restored to memstore. // Currently, only a server restart will do this. // We used to only catch IOEs but its possible that we'd get other // exceptions -- e.g. HBASE-659 was about an NPE -- so now we catch // all and sundry. if (wal != null) { wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes()); } DroppedSnapshotException dse = new DroppedSnapshotException("region: " + Bytes.toStringBinary(getRegionName())); dse.initCause(t); status.abort("Flush failed: " + StringUtils.stringifyException(t)); throw dse; } // If we get to here, the HStores have been written. if (wal != null) { wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes()); } // Record latest flush time this.lastFlushTime = EnvironmentEdgeManager.currentTimeMillis(); // Update the last flushed sequence id for region completeSequenceId = flushSeqId; // C. Finally notify anyone waiting on memstore to clear: // e.g. checkResources(). synchronized (this) { notifyAll(); // FindBugs NN_NAKED_NOTIFY } long time = EnvironmentEdgeManager.currentTimeMillis() - startTime; long memstoresize = this.memstoreSize.get(); String msg = "Finished memstore flush of ~" + StringUtils.humanReadableInt(totalFlushableSize) + "/" + totalFlushableSize + ", currentsize=" + StringUtils.humanReadableInt(memstoresize) + "/" + memstoresize + " for region " + this + " in " + time + "ms, sequenceid=" + flushSeqId + ", compaction requested=" + compactionRequested + ((wal == null)? "; wal=null": ""); LOG.info(msg); status.setStatus(msg); this.recentFlushes.add(new Pair<Long,Long>(time/1000, totalFlushableSize)); return compactionRequested; }
调用HRegion的internalFlushcache方法
1.HRegion 1661 HStore 1941 prepare (获取写锁) 用memStore类复制kvset生成snapshot作为本次mem flush的内存
(每次flush会触发region内的所有store的flush,所以flush的最小单位是region,不是store,这也是不太建议多个cf理由的一个原因)
2.HRegion 1674 调用wal 等待wal完成
3.HRegion 1700 HStore flushCache生成tmpfile(一个HStore一个tmpfile,虽然用的tmpfiles是个List)
在
4.HRegion 1706 HStore将新生成的tmpfiles封装为HStorefile,
HStore调用updateStorefiles方法,获得写锁添加到StoreFileManager的List中,提供服务,清空snapshot
HStore 951 needsCompaction方法, 调用RatioBasedCompactionPolicy.needsCompaction方法,判断storm是否需要compact
(判断方法hfile数量大于hbase.hstore.compaction.min 和 hbase.hstore.compactionThreshold的最大值数(默认值为3))
--------------------------------------------------------------------------------------------------------------------
2. hregion查看是否split,实现类为split策略类:IncreasingToUpperBoundRegionSplitPolicy
@Override protected boolean shouldSplit() { if (region.shouldForceSplit()) return true; boolean foundABigStore = false; // Get count of regions that have the same common table as this.region int tableRegionsCount = getCountOfCommonTableRegions(); // Get size to check long sizeToCheck = getSizeToCheck(tableRegionsCount); for (Store store : region.getStores().values()) { // If any of the stores is unable to split (eg they contain reference files) // then don't split if ((!store.canSplit())) { return false; } // Mark if any store is big enough long size = store.getSize(); if (size > sizeToCheck) { LOG.debug("ShouldSplit because " + store.getColumnFamilyName() + " size=" + size + ", sizeToCheck=" + sizeToCheck + ", regionsWithCommonTable=" + tableRegionsCount); foundABigStore = true; } } return foundABigStore; }
调用IncreasingToUpperBoundRegionSplitPolicy 65 shouldSplit方法,判断,这个region是否需要split
(又是以一个region查看是否需要split的,所以多个cf真的不好)
((init)initialSize = hbase.increasing.policy.initial.size(预先设置初始值大小) 或hbase.hregion.memstore.flush.size (memflush大小))
获取this.region所在表的所有region数 getCountOfCommonTableRegions 为regioncount
当regioncount在0到100之间,取配置hbase.hregion.max.filesize(默认10G)和initialSize*(regioncount^3)的最小值 否则取配置hbase.hregion.max.filesize(默认10G)
如,只有一个region,128*1^3=128M
128*2^3=1024M
128*3^3=3456M
128*4^3=8192M
128*5^3=16000M(15G) => 10G 当有5个region就可以用配置了
--------------------------------------------------------------------------------------------------------------------
3.if(split) spliting elif(compact) compacting
http://blackproof.iteye.com/blog/2037159
之前做过笔记,自己都快忘了
又写了一份region split的
public PairOfSameType<HRegion> stepsBeforePONR(final Server server, final RegionServerServices services, boolean testing) throws IOException { // Set ephemeral SPLITTING znode up in zk. Mocked servers sometimes don't // have zookeeper so don't do zk stuff if server or zookeeper is null if (server != null && server.getZooKeeper() != null) { try { //步骤1@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ createNodeSplitting(server.getZooKeeper(), parent.getRegionInfo(), server.getServerName(), hri_a, hri_b); } catch (KeeperException e) { throw new IOException("Failed creating PENDING_SPLIT znode on " + this.parent.getRegionNameAsString(), e); } } this.journal.add(JournalEntry.SET_SPLITTING_IN_ZK); if (server != null && server.getZooKeeper() != null) { // After creating the split node, wait for master to transition it // from PENDING_SPLIT to SPLITTING so that we can move on. We want master // knows about it and won't transition any region which is splitting. //步骤2@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ znodeVersion = getZKNode(server, services); } //步骤3@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ this.parent.getRegionFileSystem().createSplitsDir(); this.journal.add(JournalEntry.CREATE_SPLIT_DIR); Map<byte[], List<StoreFile>> hstoreFilesToSplit = null; Exception exceptionToThrow = null; try{ //步骤4@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ hstoreFilesToSplit = this.parent.close(false); } catch (Exception e) { exceptionToThrow = e; } if (exceptionToThrow == null && hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. exceptionToThrow = closedByOtherException; } if (exceptionToThrow != closedByOtherException) { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } if (exceptionToThrow != null) { if (exceptionToThrow instanceof IOException) throw (IOException)exceptionToThrow; throw new IOException(exceptionToThrow); } if (!testing) { //步骤5@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ services.removeFromOnlineRegions(this.parent, null); } this.journal.add(JournalEntry.OFFLINED_PARENT); // TODO: If splitStoreFiles were multithreaded would we complete steps in // less elapsed time? St.Ack 20100920 // // splitStoreFiles creates daughter region dirs under the parent splits dir // Nothing to unroll here if failure -- clean up of CREATE_SPLIT_DIR will // clean this up. //步骤6@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ splitStoreFiles(hstoreFilesToSplit); // Log to the journal that we are creating region A, the first daughter // region. We could fail halfway through. If we do, we could have left // stuff in fs that needs cleanup -- a storefile or two. Thats why we // add entry to journal BEFORE rather than AFTER the change. //步骤7@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ this.journal.add(JournalEntry.STARTED_REGION_A_CREATION); HRegion a = this.parent.createDaughterRegionFromSplits(this.hri_a); // Ditto this.journal.add(JournalEntry.STARTED_REGION_B_CREATION); HRegion b = this.parent.createDaughterRegionFromSplits(this.hri_b); return new PairOfSameType<HRegion>(a, b); }
1.RegionSplitPolicy.getSplitPoint()获得region split的split point ,最大store的中间点midpoint最为split point
2.SplitRequest.run()
实例化SplitTransaction
st.prepare():split前准备:region是否关闭,所有hfile是否被引用
st.execute:执行split操作
1.createDaughters 创建两个region,获得parent region的写锁
1在zk上创建一个临时的node splitting point,
2等待master直到这个region转为splitting状态
3之后建立splitting的文件夹,
4等待region的flush和compact都完成后,关闭这个region
5从HRegionServer上移除,加入到下线region中
6进行regionsplit操作,创建线程池,用StoreFileSplitter类将region下的所有Hfile(StoreFile)进行split,
(split row在hfile中的不管,其他的都进行引用,把引用文件分别写到region下边)
7.生成左右两个子region,删除meta上parent,根据引用文件生成子region的regioninfo,写到hdfs上
2.stepsAfterPONR 调用DaughterOpener类run打开两个子region,调用initilize
a)向hdfs上写入.regionInfo文件以便meta挂掉以便恢复
b)初始化其下的HStore,主要是LoadStoreFiles函数:
对于该store函数会构造storefile对象,从hdfs上获取路径和文件,每个文件一个
storefile对象,对每个storefile对象会读取文件上的内容创建一个
HalfStoreFileReader读对象来操作该region的父region上的相应的文件,及该
region上目前存储的是引用文件,其指向的是其父region上的相应的文件,对该
region的所有读或写都将关联到父region上
将子Region添加到rs的online region列表上,并添加到meta表上