memstore的flush发起主要从以下几个地方进行:
a.在HRegionServer调用multi进行更新时,检查是否超过全局的memstore配置的最大值与最小值,
如果是,发起一个WakeupFlushThread的flush请求,如果超过全局memory的最大值,需要等待flush完成。
b.在HRegionServer进行数据更新时,调用HRegion.batchMutate更新store中数据时,
如果region.memstore的大小超过配置的regionmemstore size时,发起一个FlushRegionEntry的flush请求,
c.client端显示调用HRegionServer.flushRegion请求
d.通过hbase.regionserver.optionalcacheflushinterval配置,
默认3600000ms的HRegionServer.PeriodicMemstoreFlusher定时flush线程
flush的具体执行通过MemStoreFlusher完成,当发起flushRequest时,
会把flush的request添加到flushQueue队列中,同时把request添加到regionsInQueue列表中。
MemStoreFlusher实例生成时会启动MemStoreFlusher.FlushHandler线程实例,
此线程个数通过hbase.hstore.flusher.count配置,默认为1
privateclassFlushHandler extendsHasThread{
@Override
publicvoidrun(){
while(!server.isStopped()){
FlushQueueEntryfqe = null;
try{
wakeupPending.set(false);// allow someone to wake us up again
从队列中取出一个flushrequest,此队列是一个阻塞队列,如果flushQueue队列中没有值,
等待hbase.server.thread.wakefrequency配置的ms,默认为10*1000
fqe= flushQueue.poll(threadWakeFrequency,TimeUnit.MILLISECONDS);
if(fqe ==null|| fqeinstanceofWakeupFlushThread) {
如果没有flushrequest或者flushrequest是一个全局flush的request
检查所有的memstore是否超过hbase.regionserver.global.memstore.lowerLimit配置的值,默认0.35
if(isAboveLowWaterMark()){
LOG.debug("Flushthread woke up because memory above low water="
StringUtils.humanReadableInt(globalMemStoreLimitLowMark));
超过配置的最小memstore的值,flsuh掉最大的一个memstore的region
此执行方法的流程分析见MemStoreFlusher.flushOneForGlobalPressure流程分析
if(!flushOneForGlobalPressure()){
....................此处部分代码没有显示
Thread.sleep(1000);
没有需要flush的region,叫醒更新线程的等待,
HregionServer执行数据更新的相关方法如果发现memstore的总和超过配置的最大值时,会wait更新线程,等待flush
wakeUpIfBlocking();
}
//Enqueue another one of these tokens so we'll wake up again
发起另一个叫醒的全局flushrequest,生成WakeupFlushThread的request
wakeupFlushThread();
}
continue;
}
正常的flushrequest,
单个regionmemstore大小超过hbase.hregion.memstore.flush.size配置的值,默认1024*1024*128L
此执行方法的流程分析见MemStoreFlusher.flushRegion
FlushRegionEntry fre= (FlushRegionEntry) fqe;
if(!flushRegion(fre)){
break;
}
} catch(InterruptedException ex){
continue;
} catch(ConcurrentModificationException ex){
continue;
} catch(Exception ex){
LOG.error("Cacheflusher failed for entry " + fqe,ex);
if(!server.checkFileSystem()){
break;
}
}
}
结束MemStoreFlusher的线程调用,通常是regionserverstop
synchronized(regionsInQueue){
regionsInQueue.clear();
flushQueue.clear();
}
//Signal anyone waiting, so they see the close flag
wakeUpIfBlocking();
LOG.info(getName()+ " exiting");
}
}
此方法主要用来取出所有region是memstore最大的一个region,并执行flush操作。
privatebooleanflushOneForGlobalPressure(){
SortedMap<Long,HRegion>regionsBySize=
server.getCopyOfOnlineRegionsSortedBySize();
Set<HRegion>excludedRegions= newHashSet<HRegion>();
booleanflushedOne= false;
while(!flushedOne){
//Find the biggest region that doesn't have too many storefiles
//(might be null!)
取出memstore占用最大的一个region,但这个region需要满足以下条件:
a.region的writestate.flushing==false,同时writestate.writesEnabled==true,非readonly
b.region中所有的store中的storefile的个数小于hbase.hstore.blockingStoreFiles配置的值,默认为7
此处去找region时,是按region的memstore的大小从大到小排序组成。取出满足以上条件的最大的memstore的region
如果都不满足,返回null
HRegion bestFlushableRegion= getBiggestMemstoreRegion(
regionsBySize,excludedRegions,true);
//Find the biggest region, total, even if it might have too manyflushes.
取出memstore占用最大的一个region,但这个region需要满足以下条件:
a.region的writestate.flushing==false,同时writestate.writesEnabled==true,非readonly
b.按region的memstore的大小从大到小排序组成。取出满足以上条件的最大的memstore的region
如果都不满足,返回null,此处不检查region中是否有store的文件个数超过指定的配置值。
HRegion bestAnyRegion= getBiggestMemstoreRegion(
regionsBySize,excludedRegions,false);
如果没有拿到上面第二处检查的region,那么表示没有需要flush的region,返回,不进行flush操作。
if(bestAnyRegion== null){
LOG.error("Abovememory mark but there are no flushable regions!");
returnfalse;
}
得到最需要进行flush的region,
如果memstore最大的region的memory使用大小已经超过了没有storefile个数超过配置的region的memory大小的2倍
那么优先flush掉此region的memstore
HRegion regionToFlush;
if(bestFlushableRegion!= null&&
bestAnyRegion.memstoreSize.get()> 2 * bestFlushableRegion.memstoreSize.get()){
....................此处部分代码没有显示
if(LOG.isDebugEnabled()){
....................此处部分代码没有显示
}
regionToFlush= bestAnyRegion;
} else{
如果要flush的region中没有一个region的storefile个数没有超过配置的值,
(所有region中都有store的file个数超过了配置的store最大storefile个数),
优先flush掉memstore的占用最大的region
if(bestFlushableRegion== null){
regionToFlush= bestAnyRegion;
} else{
如果要flush的region中,有region的store还没有超过配置的最大storefile个数,优先flush掉此region
这样做的目的是为了减少一小部分region数据写入过热,compact太多,而数据写入较冷的region一直没有被flush
regionToFlush= bestFlushableRegion;
}
}
Preconditions.checkState(regionToFlush.memstoreSize.get()> 0);
LOG.info("Flushof region " + regionToFlush+ " due to global heap pressure");
执行flush操作,设置全局flush的标识为true,见memStoreFlusher.flushRegion全局流程
如果flush操作出现错误,需要把此region添加到excludedRegions列表中,
表示这次flush一个region的行为中跳过此region,找下一个memstore最大的region进行flush
flushedOne= flushRegion(regionToFlush,true);
if(!flushedOne){
LOG.info("Excludingunflushable region " +regionToFlush+
"- trying to find a different region to flush.");
excludedRegions.add(regionToFlush);
}
}
returntrue;
}
此方法传入的第二个参数=true表示全局flush,否则表示region的memstore达到指定大小
返回true表示flush成功,否则表示flush失败
privateboolean flushRegion(finalHRegion region, finalboolean emergencyFlush){
synchronized(this.regionsInQueue){
从regionsInQueue列表中移出此region,并得到region的flush请求
FlushRegionEntry fqe= this.regionsInQueue.remove(region);
如果是全局的flush请求,从flushQueue队列中移出此flush请求
if(fqe !=null&& emergencyFlush){
//Need to remove from region from delay queue. When NOT an
//emergencyFlush, then item was removed via a flushQueue.poll.
flushQueue.remove(fqe);
}
}
lock.readLock().lock();
try{
执行HRegion.flushcache操作,返回true表示需要做compact,否则表示不需要发起compact请求
booleanshouldCompact= region.flushcache();
//We just want to check the size
检查是否需要进行split操作,以下条件不做split
a.如果是meta表,不做split操作。
b.如果region配置有distributedLogReplay,同时region在open后,还没有做replay,isRecovering=true
c.splitRequest的值为false,true表示通过client调用过regionServer.splitregion操作。
d.如果c为false,同时当前region中有store的大小
不超过hbase.hregion.max.filesize的配置值,默认10* 1024 * 1024 * 1024L(10g)
或者不超过了hbase.hregion.memstore.flush.size配置的值,默认为1024*1024*128L(128m)*
(此region所在的table在当前rs中的所有region个数*此region所在的table在当前rs中的所有region个数)
e.如果c为false,或者store中有storefile的类型为reference,也就是此storefile引用了另外一个storefile
f.如果cde的检查结果为true,同时client发起过split请求,
如果client发起请求时指定了在具体的splitrow时,但此row在当前region中并不存在,不需要做split
g.以上检查都是相反的值时,此时需要做split操作。
booleanshouldSplit= region.checkSplit()!= null;
if(shouldSplit){
如果需要进行region的split操作,发起split请求
this.server.compactSplitThread.requestSplit(region);
} elseif(shouldCompact){
如果需要做compact发起一个系统的compact请求
server.compactSplitThread.requestSystemCompaction(
region,Thread.currentThread().getName());
}
}catch(DroppedSnapshotException ex){
....................此处部分代码没有显示
server.abort("Replayof HLog required. Forcing server shutdown",ex);
returnfalse;
}catch(IOException ex){
....................此处部分代码没有显示
if(!server.checkFileSystem()){
returnfalse;
}
}finally{
lock.readLock().unlock();
叫醒所有对region中数据更新的请求线程,让更新数据向下执行(全局flush会wait做更新)
wakeUpIfBlocking();
}
returntrue;
}
Hregion.flushcache执行流程分析
执行flush流程,并在执行flush前调用cp的preFlush方法与在执行后调用cp.postFlush方法,
在flush前把writestate.flushing设置为true,表示region正在做flush操作,完成后设置为false
publicboolean flushcache()throws IOException {
//fail-fast instead of waiting on the lock
检查region是否正在进行close。返回false表示不做compact
if(this.closing.get()){
LOG.debug("Skippingflush on " + this+ " because closing");
returnfalse;
}
MonitoredTaskstatus =TaskMonitor.get().createStatus("Flushing" + this);
status.setStatus("Acquiringreadlock on region");
//block waiting for the lock for flushing cache
lock.readLock().lock();
try{
如果当前region已经被close掉,不执行flush操作。返回false表示不做compact
if(this.closed.get()){
LOG.debug("Skippingflush on " + this+ " because closed");
status.abort("Skipped:closed");
returnfalse;
}
执行cp的flush前操作
if(coprocessorHost!= null){
status.setStatus("Runningcoprocessor pre-flush hooks");
coprocessorHost.preFlush();
}
if(numMutationsWithoutWAL.get()> 0) {
numMutationsWithoutWAL.set(0);
dataInMemoryWithoutWAL.set(0);
}
synchronized(writestate){
把region的状态设置为正在flush
if(!writestate.flushing&& writestate.writesEnabled){
this.writestate.flushing= true;
} else{
....................此处部分代码没有显示
如果当前region正在做flush,或者region是readonly状态,不执行flush操作。返回false表示不做compact
returnfalse;
}
}
try{
执行flush操作,对region中所有的store的memstore进行flush操作。
返回是否需要做compact操作的一个boolean值
booleanresult =internalFlushcache(status);
执行cp的flush后操作
if(coprocessorHost!= null){
status.setStatus("Runningpost-flush coprocessor hooks");
coprocessorHost.postFlush();
}
status.markComplete("Flushsuccessful");
returnresult;
} finally{
synchronized(writestate){
设置正在做flush的状态flushing的值为false,表示flush结束
writestate.flushing= false;
设置region的flush请求为false
this.writestate.flushRequested= false;
叫醒所有等待中的更新线程
writestate.notifyAll();
}
}
}finally{
lock.readLock().unlock();
status.cleanup();
}
}
flushcache方法调用此方法,而此方法又掉其的一个重载方法
protectedbooleaninternalFlushcache(MonitoredTaskstatus)
throwsIOException {
returninternalFlushcache(this.log,-1, status);
}
执行flush操作,通过flushcache调用而来,返回是否需要compact
protectedboolean internalFlushcache(
finalHLogwal, finallongmyseqid,MonitoredTaskstatus)
throwsIOException {
if(this.rsServices!= null&& this.rsServices.isAborted()){
//Don't flush when server aborting, it's unsafe
thrownewIOException("Abortingflush because server is abortted...");
}
设置flush的开始时间为当前系统时间,计算flush的耗时用
finallongstartTime =EnvironmentEdgeManager.currentTimeMillis();
//Clear flush flag.
//If nothing to flush, return and avoid logging start/stop flush.
如果memstore的大小没有值,不执行flsuh直接返回false
if(this.memstoreSize.get()<= 0) {
returnfalse;
}
if(LOG.isDebugEnabled()){
LOG.debug("Startedmemstore flush for " + this+
",current region memstore size " +
StringUtils.humanReadableInt(this.memstoreSize.get())+
((wal!= null)?"":"; wal is null, using passedsequenceid=" + myseqid));
}
//Stop updates while we snapshot the memstoreof all stores. We only have
//to do this for a moment. Its quick. The subsequent sequence id that
//goes into the HLog after we've flushed all these snapshots also goes
//into the info file that sits beside the flushed files.
//We also set the memstoresize to zero here before we allow updates
//again so its value will represent the size of the updates received
//during the flush
MultiVersionConsistencyControl.WriteEntryw = null;
//We have to take a write lock during snapshot, or else a write could
//end up in both snapshot and memstore(makes it difficult to do atomic
//rows then)
status.setStatus("Obtaininglock to block concurrent updates");
//block waiting for the lock for internal flush
this.updatesLock.writeLock().lock();
longflushsize =this.memstoreSize.get();
status.setStatus("Preparingto flush by snapshotting stores");
List<StoreFlushContext>storeFlushCtxs= newArrayList<StoreFlushContext>(stores.size());
longflushSeqId= -1L;
try{
//Record the mvccfor all transactions in progress.
生成一个MultiVersionConsistencyControl.WriteEntry实例,此实例的writernumber为mvcc的++memstoreWrite
把WriteEntry添加到mvcc的writeQueue队列中
w= mvcc.beginMemstoreInsert();
取出并移出writeQueue队列中的WriteEntry实例,得到writerNumber的值,
并把最大的writerNumber(最后一个)的值复制给memstoreRead,
叫醒readWaiters的等待(mvcc.waitForRead(w)会等待叫醒)
mvcc.advanceMemstore(w);
if(wal !=null){
把wal中oldestUnflushedSeqNums列表中此region未flush的seqid(appendedits日志后最大的seqid)移出
把wal中oldestUnflushedSeqNums中此region的seqid添加到oldestFlushingSeqNums列表中。
得到进行flush的seqid,此值通过wal(FSHLog)的logSeqNum加一得到,
logSeqNum的值通过openRegion调用后得到的regiwriteQueueon的seqid,此值是当前rs中所有region的最大的seqid
同时每次appendhlog日志时,会把logSeqNum加一的值加一,并把此值当成hlog的seqid,
Long startSeqId= wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
if(startSeqId== null){
status.setStatus("Flushwill not be started for [" +this.getRegionInfo().getEncodedName()
+ "]- WAL is going away");
returnfalse;
}
flushSeqId= startSeqId.longValue();
} else{
flushSeqId= myseqid;
}
for(Stores :stores.values()){
迭代region下的每一个store,生成HStore.StoreFlusherImpl实例
storeFlushCtxs.add(s.createFlushContext(flushSeqId));
}
//prepare flush (take a snapshot)
for(StoreFlushContextflush :storeFlushCtxs){
迭代region下的每一个store,把memstore下的kvset复制到memstore的snapshot中并清空kvset的值
把memstore的snapshot复制到HStore的snapshot中
flush.prepare();
}
}finally{
this.updatesLock.writeLock().unlock();
}
String s= "Finished memstore snapshotting "+ this+
",syncing WAL and waiting on mvcc, flushsize="+ flushsize;
status.setStatus(s);
if(LOG.isTraceEnabled())LOG.trace(s);
//sync unflushedWAL changes when deferred log sync is enabled
//see HBASE-8208 for details
if(wal !=null&& !shouldSyncLog()){
把wal中的日志写入到HDFS中
wal.sync();
}
//wait for all in-progress transactions to commit to HLog before
//we can start the flush. This prevents
//uncommitted transactions from being written into HFiles.
//We have to block before we start the flush, otherwise keys that
//were removed via a rollbackMemstore could be written to Hfiles.
等待mvcc中writeQueue队列处理完成,得到最大的memstoreRead值,
线程等待到mvcc.advanceMemstore(w)处理完成去叫醒。
mvcc.waitForRead(w);
s= "Flushing stores of "+ this;
status.setStatus(s);
if(LOG.isTraceEnabled())LOG.trace(s);
//Any failure from here on out will be catastrophic requiring server
//restart so hlogcontent can be replayed and put back into the memstore.
//Otherwise, the snapshot content while backed up in the hlog,it will not
//be part of the current running servers state.
booleancompactionRequested= false;
try{
//A. Flush memstoreto all the HStores.
//Keep running vector of all store files that includes both old and the
//just-made new flush store file. The new flushed file is still in the
//tmpdirectory.
for(StoreFlushContextflush :storeFlushCtxs){
迭代region下的每一个store,调用HStore.flushCache方法,把store中snapshot的数据flush到hfile中
使用从wal中得到的最新的seqid
通过hbase.hstore.flush.retries.number配置flush失败的重试次数,默认为10次
通过hbase.server.pause配置flush失败时的重试间隔,默认为1000ms
针对每一个Store的flush实例,
通过hbase.hstore.defaultengine.compactionpolicy.class配置,默认DefaultStoreFlusher进行
每一个HStore.StoreEngine通过hbase.hstore.engine.class配置,默认DefaultStoreEngine
生成StoreFile.Writer实例,此实例的路径为region的.tmp目录下生成一个UUID的文件名称,
调用storeFlusher的flushSnapshot方法,并得到flush的.tmp目录下的hfile文件路径,
检查文件是否合法(创建StoreFile.createReader不出错表示合法)
把memstore中的kv写入到此file文件中
把此hfile文件的metadata(fileinfo)中写入flush时的最大seqid.
把生成的hfile临时文件放入到HStore.StoreFlusherImpl实例的tempFiles列表中。
等待调用HStore.StoreFlusherImpl.commit
flush.flushCache(status);
}
//Switch snapshot (in memstore)-> new hfile(thus causing
//all the store scanners to reset/reseek).
for(StoreFlushContextflush :storeFlushCtxs){
通过HStore.StoreFlusherImpl.commit把.tmp目录下的刚flush的hfile文件移动到指定的cf目录下
针对Hfile文件生成StoreFile与Reader,并把StoreFile添加到HStore的storefiles列表中。
清空HStore.memstore.snapshot的值。
通过hbase.hstore.defaultengine.compactionpolicy.class配置的compactionPolicy,
默认为ExploringCompactionPolicy,检查是否需要做compaction,
通过hbase.hstore.compaction.min配置最小做compaction的文件个数,默认为3.
老版本通过hbase.hstore.compactionThreshold进行配置,最小值不能小于2
如果当前的Store中所有的Storefile的个数减去正在做compact的个数值大于或等于上面配置的值时,
表示需要做compact
booleanneedsCompaction= flush.commit(status);
if(needsCompaction){
compactionRequested= true;
}
}
storeFlushCtxs.clear();
//Set down the memstoresize by amount of flush.
this.addAndGetGlobalMemstoreSize(-flushsize);
}catch(Throwable t){
//An exception here means that the snapshot was not persisted.
//The hlogneeds to be replayed so its content is restored to memstore.
//Currently, only a server restart will do this.
//We used to only catch IOEs but its possible that we'd get other
//exceptions -- e.g. HBASE-659 was about an NPE -- so now we catch
//all and sundry.
if(wal !=null){
wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
}
DroppedSnapshotException dse= newDroppedSnapshotException("region:" +
Bytes.toStringBinary(getRegionName()));
dse.initCause(t);
status.abort("Flushfailed: " +StringUtils.stringifyException(t));
throwdse;
}
//If we get to here, the HStores have been written.
if(wal !=null){
把FSHLog.oldestFlushingSeqNums中此region的上一次flush的seqid移出
wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
}
//Record latest flush time
更新region的最后一次flush时间
this.lastFlushTime= EnvironmentEdgeManager.currentTimeMillis();
//Update the last flushed sequence id for region
if(this.rsServices!= null){
设置regionserver中completeSequenceId的值为最新进行过flush的wal中的seqid
completeSequenceId= flushSeqId;
}
//C. Finally notify anyone waiting on memstoreto clear:
//e.g. checkResources().
synchronized(this){
notifyAll();// FindBugs NN_NAKED_NOTIFY
}
longtime =EnvironmentEdgeManager.currentTimeMillis()- startTime;
longmemstoresize= this.memstoreSize.get();
String msg= "Finished memstore flush of ~"+
StringUtils.humanReadableInt(flushsize)+ "/"+ flushsize+
",currentsize=" +
StringUtils.humanReadableInt(memstoresize)+ "/"+ memstoresize+
"for region " + this+ " in "+ time +"ms, sequenceid="+ flushSeqId+
",compaction requested=" +compactionRequested+
((wal== null)?"; wal=null":"");
LOG.info(msg);
status.setStatus(msg);
this.recentFlushes.add(newPair<Long,Long>(time/1000,flushsize));
返回是否需要进行compaction操作。
returncompactionRequested;
}
此种flush是region中memstoresize的值达到配置的值上限时,发起的flushrequest,
通过MemStoreFlusher.FlusherHandler.run-->flushRegion(finalFlushRegionEntry fqe)发起
privatebooleanflushRegion(finalFlushRegionEntry fqe){
HRegion region= fqe.region;
如果region不是meta的region,同时region中有sotre中的storefile个数达到指定的值,
通过hbase.hstore.blockingStoreFiles配置,默认为7
if(!region.getRegionInfo().isMetaRegion()&&
isTooManyStoreFiles(region)){
检查flushrequest的等待时间是否超过了指定的等待时间,如果超过打印一些日志
通过hbase.hstore.blockingWaitTime配置,默认为90000ms
if(fqe.isMaximumWait(this.blockingWaitTime)){
LOG.info("Waited" + (System.currentTimeMillis()- fqe.createTime)+
"mson a compaction to clean up 'too many store files'; waited "+
"longenough... proceeding with flush of "+
region.getRegionNameAsString());
} else{
如果flushrequest的等待时间还不到指定可接受的最大等待时间,
同时还没有进行过重新flushrequest,(在队列中重新排队)
flushQueue队列按FlushRegionEntry的过期时间进行排序,默认情况下是先进先出,
除非调用过FlushRegionEntry.requeue方法显示指定过期时间
//If this is first time we've been put off, then emit a log message.
if(fqe.getRequeueCount()<= 0) {
//Note: We don't impose blockingStoreFiles constraint on meta regions
LOG.warn("Region" + region.getRegionNameAsString()+ " has too many "+
"storefiles; delaying flush up to " +this.blockingWaitTime+ "ms");
检查是否需要发起splitrequest,如果是发起splitrequest,如果不需要,发起compactionrequest.
if(!this.server.compactSplitThread.requestSplit(region)){
try{
发起compactionrequest.因为此时store中文件个数太多。
可以通过创建table时使用COMPACTION_ENABLED来控制是否做compaction操作,可设置值TRUE/FALSE
this.server.compactSplitThread.requestSystemCompaction(
region,Thread.currentThread().getName());
} catch(IOException e){
LOG.error(
"Cacheflush failed for region " +Bytes.toStringBinary(region.getRegionName()),
RemoteExceptionHandler.checkIOException(e));
}
}
}
//Put back on the queue. Have it come back out of the queue
//after a delay of this.blockingWaitTime / 100 ms.
重新对flushQueue中当前的flushrequest进行排队,排队到默认900ms后在执行
this.flushQueue.add(fqe.requeue(this.blockingWaitTime/ 100));
//Tell a lie, it's not flushed but it's ok
returntrue;
}
}
执行flush操作流程,把全局flush的参数设置为false,表示是memstoresize的值达到配置的值上限时
执行流程不重复分析,见MemStoreFlusher.flushRegion执行流程分析全局
returnflushRegion(region,false);
}