memstore的flush流程分析

memstoreflush流程分析

memstoreflush发起主要从以下几个地方进行:

a.HRegionServer调用multi进行更新时,检查是否超过全局的memstore配置的最大值与最小值,

如果是,发起一个WakeupFlushThreadflush请求,如果超过全局memory的最大值,需要等待flush完成。

b.HRegionServer进行数据更新时,调用HRegion.batchMutate更新store中数据时,

如果region.memstore的大小超过配置的regionmemstore size时,发起一个FlushRegionEntryflush请求,

c.client端显示调用HRegionServer.flushRegion请求

d.通过hbase.regionserver.optionalcacheflushinterval配置,

默认3600000msHRegionServer.PeriodicMemstoreFlusher定时flush线程


flush的执行过程

flush的具体执行通过MemStoreFlusher完成,当发起flushRequest时,

会把flushrequest添加到flushQueue队列中,同时把request添加到regionsInQueue列表中。

MemStoreFlusher实例生成时会启动MemStoreFlusher.FlushHandler线程实例,

此线程个数通过hbase.hstore.flusher.count配置,默认为1


privateclassFlushHandler extendsHasThread{

@Override

publicvoidrun(){

while(!server.isStopped()){

FlushQueueEntryfqe = null;

try{

wakeupPending.set(false);// allow someone to wake us up again

从队列中取出一个flushrequest,此队列是一个阻塞队列,如果flushQueue队列中没有值,

等待hbase.server.thread.wakefrequency配置的ms,默认为10*1000

fqe= flushQueue.poll(threadWakeFrequency,TimeUnit.MILLISECONDS);

if(fqe ==null|| fqeinstanceofWakeupFlushThread) {

如果没有flushrequest或者flushrequest是一个全局flushrequest

检查所有的memstore是否超过hbase.regionserver.global.memstore.lowerLimit配置的值,默认0.35

if(isAboveLowWaterMark()){

LOG.debug("Flushthread woke up because memory above low water="

              • StringUtils.humanReadableInt(globalMemStoreLimitLowMark));

超过配置的最小memstore的值,flsuh掉最大的一个memstoreregion

此执行方法的流程分析MemStoreFlusher.flushOneForGlobalPressure流程分析

if(!flushOneForGlobalPressure()){


....................此处部分代码没有显示


Thread.sleep(1000);

没有需要flushregion,叫醒更新线程的等待,

HregionServer执行数据更新的相关方法如果发现memstore的总和超过配置的最大值时,会wait更新线程,等待flush


wakeUpIfBlocking();

}

//Enqueue another one of these tokens so we'll wake up again

发起另一个叫醒的全局flushrequest,生成WakeupFlushThreadrequest

wakeupFlushThread();

}

continue;

}

正常的flushrequest,

单个regionmemstore大小超过hbase.hregion.memstore.flush.size配置的值,默认1024*1024*128L

此执行方法的流程分析见MemStoreFlusher.flushRegion

FlushRegionEntry fre= (FlushRegionEntry) fqe;

if(!flushRegion(fre)){

break;

}

} catch(InterruptedException ex){

continue;

} catch(ConcurrentModificationException ex){

continue;

} catch(Exception ex){

LOG.error("Cacheflusher failed for entry " + fqe,ex);

if(!server.checkFileSystem()){

break;

}

}

}

结束MemStoreFlusher的线程调用,通常是regionserverstop

synchronized(regionsInQueue){

regionsInQueue.clear();

flushQueue.clear();

}


//Signal anyone waiting, so they see the close flag

wakeUpIfBlocking();

LOG.info(getName()+ " exiting");

}

}


MemStoreFlusher.flushOneForGlobalPressure流程分析

此方法主要用来取出所有regionmemstore最大的一个region,并执行flush操作。


privatebooleanflushOneForGlobalPressure(){

SortedMap<Long,HRegion>regionsBySize=

server.getCopyOfOnlineRegionsSortedBySize();


Set<HRegion>excludedRegions= newHashSet<HRegion>();


booleanflushedOne= false;

while(!flushedOne){

//Find the biggest region that doesn't have too many storefiles

//(might be null!)

取出memstore占用最大的一个region,但这个region需要满足以下条件:

a.regionwritestate.flushing==false,同时writestate.writesEnabled==true,readonly

b.region中所有的store中的storefile的个数小于hbase.hstore.blockingStoreFiles配置的值,默认为7

此处去找region时,是按regionmemstore的大小从大到小排序组成。取出满足以上条件的最大的memstoreregion

如果都不满足,返回null

HRegion bestFlushableRegion= getBiggestMemstoreRegion(

regionsBySize,excludedRegions,true);

//Find the biggest region, total, even if it might have too manyflushes.

取出memstore占用最大的一个region,但这个region需要满足以下条件:

a.regionwritestate.flushing==false,同时writestate.writesEnabled==true,readonly

b.regionmemstore的大小从大到小排序组成。取出满足以上条件的最大的memstoreregion

如果都不满足,返回null,此处不检查region中是否有store的文件个数超过指定的配置值。

HRegion bestAnyRegion= getBiggestMemstoreRegion(

regionsBySize,excludedRegions,false);

如果没有拿到上面第二处检查的region,那么表示没有需要flushregion,返回,不进行flush操作。

if(bestAnyRegion== null){

LOG.error("Abovememory mark but there are no flushable regions!");

returnfalse;

}

得到最需要进行flushregion,

如果memstore最大的regionmemory使用大小已经超过了没有storefile个数超过配置的regionmemory大小的2

那么优先flush掉此regionmemstore

HRegion regionToFlush;

if(bestFlushableRegion!= null&&

bestAnyRegion.memstoreSize.get()> 2 * bestFlushableRegion.memstoreSize.get()){


....................此处部分代码没有显示


if(LOG.isDebugEnabled()){

....................此处部分代码没有显示

}

regionToFlush= bestAnyRegion;

} else{

如果要flushregion中没有一个regionstorefile个数没有超过配置的值,

(所有region中都有storefile个数超过了配置的store最大storefile个数)

优先flushmemstore的占用最大的region

if(bestFlushableRegion== null){

regionToFlush= bestAnyRegion;

} else{

如果要flushregion中,有regionstore还没有超过配置的最大storefile个数,优先flush掉此region

这样做的目的是为了减少一小部分region数据写入过热,compact太多,而数据写入较冷的region一直没有被flush

regionToFlush= bestFlushableRegion;

}

}


Preconditions.checkState(regionToFlush.memstoreSize.get()> 0);


LOG.info("Flushof region " + regionToFlush+ " due to global heap pressure");


执行flush操作,设置全局flush的标识为true,memStoreFlusher.flushRegion全局流程

如果flush操作出现错误,需要把此region添加到excludedRegions列表中,

表示这次flush一个region的行为中跳过此region,找下一个memstore最大的region进行flush


flushedOne= flushRegion(regionToFlush,true);

if(!flushedOne){

LOG.info("Excludingunflushable region " +regionToFlush+

"- trying to find a different region to flush.");

excludedRegions.add(regionToFlush);

}

}

returntrue;

}


MemStoreFlusher.flushRegion执行流程分析全局

此方法传入的第二个参数=true表示全局flush,否则表示regionmemstore达到指定大小

返回true表示flush成功,否则表示flush失败

privateboolean flushRegion(finalHRegion region, finalboolean emergencyFlush){

synchronized(this.regionsInQueue){

regionsInQueue列表中移出此region,并得到regionflush请求

FlushRegionEntry fqe= this.regionsInQueue.remove(region);

如果是全局的flush请求,从flushQueue队列中移出此flush请求

if(fqe !=null&& emergencyFlush){

//Need to remove from region from delay queue. When NOT an

//emergencyFlush, then item was removed via a flushQueue.poll.

flushQueue.remove(fqe);

}

}

lock.readLock().lock();

try{

执行HRegion.flushcache操作,返回true表示需要做compact,否则表示不需要发起compact请求

booleanshouldCompact= region.flushcache();

//We just want to check the size


检查是否需要进行split操作,以下条件不做split

a.如果是meta表,不做split操作。

b.如果region配置有distributedLogReplay,同时regionopen后,还没有做replayisRecovering=true

c.splitRequest的值为false,true表示通过client调用过regionServer.splitregion操作。

d.如果cfalse,同时当前region中有store的大小

不超过hbase.hregion.max.filesize的配置值,默认10* 1024 * 1024 * 1024L(10g)

或者不超过了hbase.hregion.memstore.flush.size配置的值,默认为1024*1024*128L(128m)*

(region所在的table在当前rs中的所有region个数*region所在的table在当前rs中的所有region个数)

e.如果cfalse,或者store中有storefile的类型为reference,也就是此storefile引用了另外一个storefile

f.如果cde的检查结果为true,同时client发起过split请求,

如果client发起请求时指定了在具体的splitrow时,但此row在当前region中并不存在,不需要做split

g.以上检查都是相反的值时,此时需要做split操作。


booleanshouldSplit= region.checkSplit()!= null;

if(shouldSplit){

如果需要进行regionsplit操作,发起split请求

this.server.compactSplitThread.requestSplit(region);

} elseif(shouldCompact){

如果需要做compact发起一个系统的compact请求

server.compactSplitThread.requestSystemCompaction(

region,Thread.currentThread().getName());

}


}catch(DroppedSnapshotException ex){

....................此处部分代码没有显示

server.abort("Replayof HLog required. Forcing server shutdown",ex);

returnfalse;

}catch(IOException ex){

....................此处部分代码没有显示

if(!server.checkFileSystem()){

returnfalse;

}

}finally{

lock.readLock().unlock();

叫醒所有对region中数据更新的请求线程,让更新数据向下执行(全局flushwait做更新)

wakeUpIfBlocking();

}

returntrue;

}



Hregion.flushcache执行流程分析

执行flush流程,并在执行flush前调用cppreFlush方法与在执行后调用cp.postFlush方法,

flush前把writestate.flushing设置为true,表示region正在做flush操作,完成后设置为false


publicboolean flushcache()throws IOException {

//fail-fast instead of waiting on the lock

检查region是否正在进行close。返回false表示不做compact

if(this.closing.get()){

LOG.debug("Skippingflush on " + this+ " because closing");

returnfalse;

}

MonitoredTaskstatus =TaskMonitor.get().createStatus("Flushing" + this);

status.setStatus("Acquiringreadlock on region");

//block waiting for the lock for flushing cache

lock.readLock().lock();

try{

如果当前region已经被close掉,不执行flush操作。返回false表示不做compact

if(this.closed.get()){

LOG.debug("Skippingflush on " + this+ " because closed");

status.abort("Skipped:closed");

returnfalse;

}

执行cpflush前操作

if(coprocessorHost!= null){

status.setStatus("Runningcoprocessor pre-flush hooks");

coprocessorHost.preFlush();

}

if(numMutationsWithoutWAL.get()> 0) {

numMutationsWithoutWAL.set(0);

dataInMemoryWithoutWAL.set(0);

}

synchronized(writestate){

region的状态设置为正在flush

if(!writestate.flushing&& writestate.writesEnabled){

this.writestate.flushing= true;

} else{

....................此处部分代码没有显示


如果当前region正在做flush,或者regionreadonly状态,不执行flush操作。返回false表示不做compact

returnfalse;

}

}

try{

执行flush操作,对region中所有的storememstore进行flush操作。

返回是否需要做compact操作的一个boolean

booleanresult =internalFlushcache(status);

执行cpflush后操作

if(coprocessorHost!= null){

status.setStatus("Runningpost-flush coprocessor hooks");

coprocessorHost.postFlush();

}


status.markComplete("Flushsuccessful");

returnresult;

} finally{

synchronized(writestate){

设置正在做flush的状态flushing的值为false,表示flush结束

writestate.flushing= false;

设置regionflush请求为false

this.writestate.flushRequested= false;

叫醒所有等待中的更新线程

writestate.notifyAll();

}

}

}finally{

lock.readLock().unlock();

status.cleanup();

}

}

flushcache方法调用此方法,而此方法又掉其的一个重载方法

protectedbooleaninternalFlushcache(MonitoredTaskstatus)

throwsIOException {

returninternalFlushcache(this.log,-1, status);

}


执行flush操作,通过flushcache调用而来,返回是否需要compact

protectedboolean internalFlushcache(

finalHLogwal, finallongmyseqid,MonitoredTaskstatus)

throwsIOException {

if(this.rsServices!= null&& this.rsServices.isAborted()){

//Don't flush when server aborting, it's unsafe

thrownewIOException("Abortingflush because server is abortted...");

}

设置flush的开始时间为当前系统时间,计算flush的耗时用

finallongstartTime =EnvironmentEdgeManager.currentTimeMillis();

//Clear flush flag.

//If nothing to flush, return and avoid logging start/stop flush.

如果memstore的大小没有值,不执行flsuh直接返回false

if(this.memstoreSize.get()<= 0) {

returnfalse;

}

if(LOG.isDebugEnabled()){

LOG.debug("Startedmemstore flush for " + this+

",current region memstore size " +

StringUtils.humanReadableInt(this.memstoreSize.get())+

((wal!= null)?"":"; wal is null, using passedsequenceid=" + myseqid));

}


//Stop updates while we snapshot the memstoreof all stores. We only have

//to do this for a moment. Its quick. The subsequent sequence id that

//goes into the HLog after we've flushed all these snapshots also goes

//into the info file that sits beside the flushed files.

//We also set the memstoresize to zero here before we allow updates

//again so its value will represent the size of the updates received

//during the flush

MultiVersionConsistencyControl.WriteEntryw = null;


//We have to take a write lock during snapshot, or else a write could

//end up in both snapshot and memstore(makes it difficult to do atomic

//rows then)

status.setStatus("Obtaininglock to block concurrent updates");

//block waiting for the lock for internal flush

this.updatesLock.writeLock().lock();

longflushsize =this.memstoreSize.get();

status.setStatus("Preparingto flush by snapshotting stores");

List<StoreFlushContext>storeFlushCtxs= newArrayList<StoreFlushContext>(stores.size());

longflushSeqId= -1L;

try{

//Record the mvccfor all transactions in progress.


生成一个MultiVersionConsistencyControl.WriteEntry实例,此实例的writernumbermvcc++memstoreWrite

WriteEntry添加到mvccwriteQueue队列中

w= mvcc.beginMemstoreInsert();

取出并移出writeQueue队列中的WriteEntry实例,得到writerNumber的值,

并把最大的writerNumber(最后一个)的值复制给memstoreRead

叫醒readWaiters的等待(mvcc.waitForRead(w)会等待叫醒)

mvcc.advanceMemstore(w);


if(wal !=null){


waloldestUnflushedSeqNums列表中此regionflushseqid(appendedits日志后最大的seqid)移出

waloldestUnflushedSeqNums中此regionseqid添加到oldestFlushingSeqNums列表中。

得到进行flushseqid,此值通过wal(FSHLog)logSeqNum加一得到,

logSeqNum的值通过openRegion调用后得到的regiwriteQueueonseqid,此值是当前rs中所有region的最大的seqid

同时每次appendhlog日志时,会把logSeqNum加一的值加一,并把此值当成hlogseqid,


Long startSeqId= wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());

if(startSeqId== null){

status.setStatus("Flushwill not be started for [" +this.getRegionInfo().getEncodedName()

+ "]- WAL is going away");

returnfalse;

}

flushSeqId= startSeqId.longValue();

} else{

flushSeqId= myseqid;

}


for(Stores :stores.values()){

迭代region下的每一个store,生成HStore.StoreFlusherImpl实例

storeFlushCtxs.add(s.createFlushContext(flushSeqId));

}


//prepare flush (take a snapshot)

for(StoreFlushContextflush :storeFlushCtxs){

迭代region下的每一个store,memstore下的kvset复制到memstoresnapshot中并清空kvset的值

memstoresnapshot复制到HStoresnapshot

flush.prepare();

}

}finally{

this.updatesLock.writeLock().unlock();

}

String s= "Finished memstore snapshotting "+ this+

",syncing WAL and waiting on mvcc, flushsize="+ flushsize;

status.setStatus(s);

if(LOG.isTraceEnabled())LOG.trace(s);


//sync unflushedWAL changes when deferred log sync is enabled

//see HBASE-8208 for details

if(wal !=null&& !shouldSyncLog()){

wal中的日志写入到HDFS

wal.sync();

}


//wait for all in-progress transactions to commit to HLog before

//we can start the flush. This prevents

//uncommitted transactions from being written into HFiles.

//We have to block before we start the flush, otherwise keys that

//were removed via a rollbackMemstore could be written to Hfiles.


等待mvccwriteQueue队列处理完成,得到最大的memstoreRead值,

线程等待到mvcc.advanceMemstore(w)处理完成去叫醒。

mvcc.waitForRead(w);


s= "Flushing stores of "+ this;

status.setStatus(s);

if(LOG.isTraceEnabled())LOG.trace(s);


//Any failure from here on out will be catastrophic requiring server

//restart so hlogcontent can be replayed and put back into the memstore.

//Otherwise, the snapshot content while backed up in the hlog,it will not

//be part of the current running servers state.

booleancompactionRequested= false;

try{

//A. Flush memstoreto all the HStores.

//Keep running vector of all store files that includes both old and the

//just-made new flush store file. The new flushed file is still in the

//tmpdirectory.


for(StoreFlushContextflush :storeFlushCtxs){


迭代region下的每一个store,调用HStore.flushCache方法,把storesnapshot的数据flushhfile

使用从wal中得到的最新的seqid

通过hbase.hstore.flush.retries.number配置flush失败的重试次数,默认为10

通过hbase.server.pause配置flush失败时的重试间隔,默认为1000ms

针对每一个Storeflush实例,

通过hbase.hstore.defaultengine.compactionpolicy.class配置,默认DefaultStoreFlusher进行

每一个HStore.StoreEngine通过hbase.hstore.engine.class配置,默认DefaultStoreEngine

生成StoreFile.Writer实例,此实例的路径为region.tmp目录下生成一个UUID的文件名称,

调用storeFlusherflushSnapshot方法,并得到flush.tmp目录下的hfile文件路径,

检查文件是否合法(创建StoreFile.createReader不出错表示合法)

memstore中的kv写入到此file文件中

把此hfile文件的metadata(fileinfo)中写入flush时的最大seqid.

把生成的hfile临时文件放入到HStore.StoreFlusherImpl实例的tempFiles列表中。

等待调用HStore.StoreFlusherImpl.commit


flush.flushCache(status);

}


//Switch snapshot (in memstore)-> new hfile(thus causing

//all the store scanners to reset/reseek).

for(StoreFlushContextflush :storeFlushCtxs){


通过HStore.StoreFlusherImpl.commit.tmp目录下的刚flushhfile文件移动到指定的cf目录下

针对Hfile文件生成StoreFileReader,并把StoreFile添加到HStorestorefiles列表中。

清空HStore.memstore.snapshot的值。

通过hbase.hstore.defaultengine.compactionpolicy.class配置的compactionPolicy,

默认为ExploringCompactionPolicy,检查是否需要做compaction,

通过hbase.hstore.compaction.min配置最小做compaction的文件个数,默认为3.

老版本通过hbase.hstore.compactionThreshold进行配置,最小值不能小于2

如果当前的Store中所有的Storefile的个数减去正在做compact的个数值大于或等于上面配置的值时,

表示需要做compact


booleanneedsCompaction= flush.commit(status);

if(needsCompaction){

compactionRequested= true;

}

}

storeFlushCtxs.clear();


//Set down the memstoresize by amount of flush.

this.addAndGetGlobalMemstoreSize(-flushsize);

}catch(Throwable t){

//An exception here means that the snapshot was not persisted.

//The hlogneeds to be replayed so its content is restored to memstore.

//Currently, only a server restart will do this.

//We used to only catch IOEs but its possible that we'd get other

//exceptions -- e.g. HBASE-659 was about an NPE -- so now we catch

//all and sundry.

if(wal !=null){

wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());

}

DroppedSnapshotException dse= newDroppedSnapshotException("region:" +

Bytes.toStringBinary(getRegionName()));

dse.initCause(t);

status.abort("Flushfailed: " +StringUtils.stringifyException(t));

throwdse;

}


//If we get to here, the HStores have been written.

if(wal !=null){

FSHLog.oldestFlushingSeqNums中此region的上一次flushseqid移出

wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());

}


//Record latest flush time

更新region的最后一次flush时间

this.lastFlushTime= EnvironmentEdgeManager.currentTimeMillis();

//Update the last flushed sequence id for region

if(this.rsServices!= null){

设置regionservercompleteSequenceId的值为最新进行过flushwal中的seqid

completeSequenceId= flushSeqId;

}


//C. Finally notify anyone waiting on memstoreto clear:

//e.g. checkResources().

synchronized(this){

notifyAll();// FindBugs NN_NAKED_NOTIFY

}


longtime =EnvironmentEdgeManager.currentTimeMillis()- startTime;

longmemstoresize= this.memstoreSize.get();

String msg= "Finished memstore flush of ~"+

StringUtils.humanReadableInt(flushsize)+ "/"+ flushsize+

",currentsize=" +

StringUtils.humanReadableInt(memstoresize)+ "/"+ memstoresize+

"for region " + this+ " in "+ time +"ms, sequenceid="+ flushSeqId+

",compaction requested=" +compactionRequested+

((wal== null)?"; wal=null":"");

LOG.info(msg);

status.setStatus(msg);

this.recentFlushes.add(newPair<Long,Long>(time/1000,flushsize));


返回是否需要进行compaction操作。

returncompactionRequested;

}


RegionMemStore达到指定值时的flush

此种flushregionmemstoresize的值达到配置的值上限时,发起的flushrequest,

通过MemStoreFlusher.FlusherHandler.run-->flushRegion(finalFlushRegionEntry fqe)发起


privatebooleanflushRegion(finalFlushRegionEntry fqe){

HRegion region= fqe.region;

如果region不是metaregion,同时region中有sotre中的storefile个数达到指定的值,

通过hbase.hstore.blockingStoreFiles配置,默认为7

if(!region.getRegionInfo().isMetaRegion()&&

isTooManyStoreFiles(region)){

检查flushrequest的等待时间是否超过了指定的等待时间,如果超过打印一些日志

通过hbase.hstore.blockingWaitTime配置,默认为90000ms

if(fqe.isMaximumWait(this.blockingWaitTime)){

LOG.info("Waited" + (System.currentTimeMillis()- fqe.createTime)+

"mson a compaction to clean up 'too many store files'; waited "+

"longenough... proceeding with flush of "+

region.getRegionNameAsString());

} else{

如果flushrequest的等待时间还不到指定可接受的最大等待时间,

同时还没有进行过重新flushrequest,(在队列中重新排队)

flushQueue队列按FlushRegionEntry的过期时间进行排序,默认情况下是先进先出,

除非调用过FlushRegionEntry.requeue方法显示指定过期时间

//If this is first time we've been put off, then emit a log message.

if(fqe.getRequeueCount()<= 0) {

//Note: We don't impose blockingStoreFiles constraint on meta regions

LOG.warn("Region" + region.getRegionNameAsString()+ " has too many "+

"storefiles; delaying flush up to " +this.blockingWaitTime+ "ms");

检查是否需要发起splitrequest,如果是发起splitrequest,如果不需要,发起compactionrequest.

if(!this.server.compactSplitThread.requestSplit(region)){

try{

发起compactionrequest.因为此时store中文件个数太多。

可以通过创建table时使用COMPACTION_ENABLED来控制是否做compaction操作,可设置值TRUE/FALSE

this.server.compactSplitThread.requestSystemCompaction(

region,Thread.currentThread().getName());

} catch(IOException e){

LOG.error(

"Cacheflush failed for region " +Bytes.toStringBinary(region.getRegionName()),

RemoteExceptionHandler.checkIOException(e));

}

}

}


//Put back on the queue. Have it come back out of the queue

//after a delay of this.blockingWaitTime / 100 ms.

重新对flushQueue中当前的flushrequest进行排队,排队到默认900ms后在执行

this.flushQueue.add(fqe.requeue(this.blockingWaitTime/ 100));

//Tell a lie, it's not flushed but it's ok

returntrue;

}

}

执行flush操作流程,把全局flush的参数设置为false,表示是memstoresize的值达到配置的值上限时

执行流程不重复分析,MemStoreFlusher.flushRegion执行流程分析全局

returnflushRegion(region,false);

}


你可能感兴趣的:(HADOOP,HBASE,大数据)