Memstore Flush流程
为了减少flush过程对读写的影响,HBase采用了类似于两阶段提交的方式,将整个flush过程分为三个阶段:
prepare阶段:遍历当前Region中的所有Memstore,将Memstore中当前数据集kvset做一个快照snapshot,然后再新建一个新的kvset。后期的所有写入操作都会写入新的kvset中,而整个flush阶段读操作会首先分别遍历kvset和snapshot,如果查找不到再会到HFile中查找。prepare阶段需要加一把updateLock对写请求阻塞,结束之后会释放该锁。因为此阶段没有任何费时操作,因此持锁时间很短。
flush阶段:遍历所有Memstore,将prepare阶段生成的snapshot持久化为临时文件,临时文件会统一放到目录.tmp下。这个过程因为涉及到磁盘IO操作,因此相对比较耗时。
commit阶段:遍历所有的Memstore,将flush阶段生成的临时文件移到指定的ColumnFamily目录下,针对HFile生成对应的storefile和Reader,把storefile添加到HStore的storefiles列表中,最后再清空prepare阶段生成的snapshot。
日志分析
/******* MemStoreFlush初始化阶段 ********/
2018-07-06 09:39:24,880 INFO [regionserver/host/ip:16020] regionserver.MemStoreFlusher: globalMemStoreLimit=1.5 G, globalMemStoreLimitLowMark=1.5 G, maxHeap=3.8 G
/******* prepare阶段 ********/
2018-07-06 18:33:31,329 INFO [MemStoreFlusher.1] regionserver.HRegion: Started memstore flush for [table],,1528539945017.80ab9764ae70fa97b75057c376726653., current region memstore size 21.73 MB, and 1/1 column families' memstores are being flushed.
/******* flush阶段 ********/
2018-07-06 18:33:31,696 INFO [MemStoreFlusher.1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=40056, memsize=21.7 M, hasBloomFilter=true, into tmp file hdfs://ns/hbase/data/default/[table]/80ab9764ae70fa97b75057c376726653/.tmp/f71e7e8c15774da683bdecaf7cf6cb99
/******* commit阶段 ********/
2018-07-06 18:33:31,718 INFO [MemStoreFlusher.1] regionserver.HStore: Added hdfs://ns/hbase/data/default/[table]/80ab9764ae70fa97b75057c376726653/d/f71e7e8c15774da683bdecaf7cf6cb99, entries=119995, sequenceid=40056, filesize=7.3 M
源码分析
MemStoreFlusher初始化
public MemStoreFlusher(final Configuration conf,
final HRegionServer server) {
...
// hbase.server.thread.wakefrequency,检查MemStore的线程周期
this.threadWakeFrequency =
conf.getLong(HConstants.THREAD_WAKE_FREQUENCY, 10 * 1000);
// 获取JVM使用内存
long max = -1L;
final MemoryUsage usage = HeapMemorySizeUtil.safeGetHeapMemoryUsage();
if (usage != null) {
max = usage.getMax();
}
float globalMemStorePercent = HeapMemorySizeUtil.getGlobalMemStorePercent(conf, true);
// 全部的MemStore占用超过heap的upperLimit和lowerLimit
this.globalMemStoreLimit = (long) (max * globalMemStorePercent);
this.globalMemStoreLimitLowMarkPercent =
HeapMemorySizeUtil.getGlobalMemStoreLowerMark(conf, globalMemStorePercent);
this.globalMemStoreLimitLowMark =
(long) (this.globalMemStoreLimit * this.globalMemStoreLimitLowMarkPercent);
// flush阻塞时间,如果调低会加快flush速度,但是Compact需要配个,否则文件会越来越多
this.blockingWaitTime = conf.getInt("hbase.hstore.blockingWaitTime",
90000);
// flush的线程数, 线程数越多, 增加HDFS的负载
int handlerCount = conf.getInt("hbase.hstore.flusher.count", 2);
this.flushHandlers = new FlushHandler[handlerCount];
LOG.info("globalMemStoreLimit="
+ TraditionalBinaryPrefix.long2String(this.globalMemStoreLimit, "", 1)
+ ", globalMemStoreLimitLowMark="
+ TraditionalBinaryPrefix.long2String(this.globalMemStoreLimitLowMark, "", 1)
+ ", maxHeap=" + TraditionalBinaryPrefix.long2String(max, "", 1));
}
Flush启动
public class HRegionServer {
private void startServiceThreads() {
...
this.cacheFlusher.start(uncaughtExceptionHandler);
}
}
public class MemStoreFlusher {
synchronized void start(UncaughtExceptionHandler eh) {
ThreadFactory flusherThreadFactory = Threads.newDaemonThreadFactory(
server.getServerName().toShortString() + "-MemStoreFlusher", eh);
for (int i = 0; i < flushHandlers.length; i++) {
flushHandlers[i] = new FlushHandler("MemStoreFlusher." + i);
flusherThreadFactory.newThread(flushHandlers[i]);
flushHandlers[i].start();
}
}
}
FlushHandler多线程执行flush
private final BlockingQueue flushQueue =
new DelayQueue(); // 无界的BlockingQueue
private class FlushHandler extends HasThread {
@Override
public void run() {
while (!server.isStopped()) {
FlushQueueEntry fqe = null;
try {
wakeupPending.set(false); // allow someone to wake us up again
fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS); // 从队列中取出一个flushrequest,如果flushQueue队列中没有值阻塞
if (fqe == null || fqe instanceof WakeupFlushThread) {
// 如果没有flush request或者flush request是一个全局flush的request。
if (isAboveLowWaterMark()) {
// 检查所有的memstore是否超过max_heap * hbase.regionserver.global.memstore.lowerLimit配置的值,默认0.35
// 超过配置的最小memstore的值,flush最大的一个memstore的region
if (!flushOneForGlobalPressure()) {
// 如果没有任何Region需要flush,但已经超过了lowerLimit。
// 这种情况不太可能发生,除非可能会在关闭整个服务器时发生,即有另一个线程正在执行flush regions。
// 只里只需要sleep一下,然后唤醒任何被阻塞的线程再次检查。
// HRegionServer执行数据更新的相关方法如果发现memstore的总和超过配置的最大值时,会wait更新线程,等待flush
Thread.sleep(1000);
wakeUpIfBlocking();
}
// 发起另一个唤醒的全局flush request,生成WakeupFlushThread的request
wakeupFlushThread();
}
continue;
}
// 如果是正常的flush request
// 单个region memstore大小超过hbase.hregion.memstore.flush.size配置的值,默认128M,执行flush操作
FlushRegionEntry fre = (FlushRegionEntry) fqe;
if (!flushRegion(fre)) {
break;
}
} catch (Exception ex) {
...
}
}
// 结束MemStoreFlusher的线程调用,通常是regionserver stop
synchronized (regionsInQueue) {
regionsInQueue.clear();
flushQueue.clear();
}
// 通知其他线程
wakeUpIfBlocking();
}
}
取出所有Region中MemStore最大的一个Region,并执行flush操作
private boolean flushOneForGlobalPressure() {
// 取出所有Region,以Size排序
SortedMap regionsBySize = server.getCopyOfOnlineRegionsSortedBySize();
Set excludedRegions = new HashSet();
// 2.0版本Replica新增
// 如果最大的replica region的memstore已经超过了最大的主region memstore的内存的4倍,就主动触发一次StoreFile Refresher去更新文件列表
// 即获取hbase.region.replica.storefile.refresh.memstore.multiplier
double secondaryMultiplier
= ServerRegionReplicaUtil.getRegionReplicaStoreFileRefreshMultiplier(conf);
boolean flushedOne = false;
while (!flushedOne) {
// 是按region的memstore的大小从大到小排序组成。取出满足以下条件的最大的memstore的region
// 如果都不满足,返回null
// bestFlushableRegion:
// 1.region的writestate.flushing==false && writestate.writesEnabled==true
// 2.region中所有的store中的storefile的个数小于hbase.hstore.blockingStoreFiles配置的值,默认为7
Region bestFlushableRegion = getBiggestMemstoreRegion(regionsBySize, excludedRegions, true);
// bestAnyRegion:
// 1.region的writestate.flushing==false && writestate.writesEnabled==true
// 此处不检查region中是否有store的文件个数超过指定的配置值。
Region bestAnyRegion = getBiggestMemstoreRegion(regionsBySize, excludedRegions, false);
// bestRegionReplica:
// 1.region的replicaId!=0
Region bestRegionReplica = getBiggestMemstoreOfRegionReplica(regionsBySize, excludedRegions);
// 如果没有拿到bestAnyRegion或bestRegionReplica,表示没有需要flush的region
if (bestAnyRegion == null && bestRegionReplica == null) {
return false;
}
Region regionToFlush;
if (bestFlushableRegion != null &&
bestAnyRegion.getMemstoreSize() > 2 * bestFlushableRegion.getMemstoreSize()) {
// 得到最需要进行flush的region
// 如果bestAnyRegion(memstore最大的region的region)memory使用大小
// 超过bestFlushableRegion(storefile个数没有超过配置的memstore最大的region)的memory大小的2倍
// 优先flush掉此region的memstore,这里的设计为了防止在低压下做非常多的小flush,导致compaction
// 代码注释:
// Even if it's not supposed to be flushed, pick a region if it's more than twice as big as the best flushable one - otherwise when we're under pressure we make lots of little flushes and cause lots of compactions, etc, which just makes life worse!
regionToFlush = bestAnyRegion;
} else {
if (bestFlushableRegion == null) {
// 如果要flush的region中没有一个region的storefile个数没有超过配置的值
// 即所有region中都有store的file个数超过了配置的store最大storefile个数,优先flush掉memstore的占用最大的region
regionToFlush = bestAnyRegion;
} else {
/**
* 如果要flush的region中,有Region的Store还没有超过配置的最大StoreFile个数,优先flush这个Region
* 目的是为了减少一小部分Region数据写入过热,compact太多,而数据写入较冷的region一直没有被flush
*/
regionToFlush = bestFlushableRegion
}
}
...
if (regionToFlush == null ||
(bestRegionReplica != null &&
ServerRegionReplicaUtil.isRegionReplicaStoreFileRefreshEnabled(conf) &&
(bestRegionReplica.getMemstoreSize()
> secondaryMultiplier * regionToFlush.getMemstoreSize()))) {
/**
* 开启Replica的逻辑
* RegionReplica存在,并且Replica的Size大于最大的主region memstore的内存的n倍
* 触发一次StoreFile Refresher去更新文件列表
*
* 参考replica memstore过大导致写阻塞的问题
*/
flushedOne = refreshStoreFilesAndReclaimMemory(bestRegionReplica);
if (!flushedOne) { // always false
excludedRegions.add(bestRegionReplica);
}
} else {
/**
* 执行flush操作,设置全局flush的标识为true
* 如果flush操作出现错误,需要把此region添加到excludedRegions列表中,表示这次flush一个region的行为中跳过此region,找下一个memstore最大的region进行flush
*/
flushedOne = flushRegion(regionToFlush, true, true);
if (!flushedOne) {
excludedRegions.add(regionToFlush);
}
}
}
return true;
}
flush region
Region数据落盘
private boolean flushRegion(final FlushRegionEntry fqe) {
Region region = fqe.region;
if (!region.getRegionInfo().isMetaRegion() &&
isTooManyStoreFiles(region)) {
if (fqe.isMaximumWait(this.blockingWaitTime)) {
LOG.info("Waited " + (EnvironmentEdgeManager.currentTime() - fqe.createTime) +
"ms on a compaction to clean up 'too many store files'; waited " +
"long enough... proceeding with flush of " +
region.getRegionInfo().getRegionNameAsString());
} else {
// If this is first time we've been put off, then emit a log message.
if (fqe.getRequeueCount() <= 0) {
// Note: We don't impose blockingStoreFiles constraint on meta regions
LOG.warn("Region " + region.getRegionInfo().getRegionNameAsString() + " has too many " +
"store files; delaying flush up to " + this.blockingWaitTime + "ms");
if (!this.server.compactSplitThread.requestSplit(region)) {
try {
this.server.compactSplitThread.requestSystemCompaction(
region, Thread.currentThread().getName());
} catch (IOException e) {
LOG.error("Cache flush failed for region " +
Bytes.toStringBinary(region.getRegionInfo().getRegionName()),
RemoteExceptionHandler.checkIOException(e));
}
}
}
// Put back on the queue. Have it come back out of the queue
// after a delay of this.blockingWaitTime / 100 ms.
this.flushQueue.add(fqe.requeue(this.blockingWaitTime / 100));
// Tell a lie, it's not flushed but it's ok
return true;
}
}
return flushRegion(region, false, fqe.isForceFlushAllStores());
}
refreshStoreFilesAndReclaimMemory
开启Replica的逻辑
RegionReplica存在,并且Replica的Size大于最大的主region memstore的内存的n倍,触发一次StoreFile Refresher去更新文件列表
参考replica memstore过大导致写阻塞的问题
private boolean refreshStoreFilesAndReclaimMemory(Region region) {
try {
return region.refreshStoreFiles();
} catch (IOException e) {
LOG.warn("Refreshing store files failed with exception", e);
}
return false;
}
replica memstore过大导致写阻塞的问题
replica的region中memstore是不会主动flush的,只有收到主region的flush操作,才会去flush。
同一台RegionServer上可能有一些region replica和其他的主region同时存在。
这些replica可能由于复制延迟(没有收到flush marker),或者主region没有发生flush,导致一直占用内存不释放。
这会造成整体的内存超过水位线,导致正常的写入被阻塞。
为了防止这个问题的出现,HBase中有一个参数叫做hbase.region.replica.storefile.refresh.memstore.multiplier,默认值是4。
这个参数的意思是说,如果最大的replica region的memstore已经超过了最大的主region memstore的内存的4倍,就主动触发一次StoreFile Refresher去更新文件列表,如果确实发生了flush,那么replica内存里的数据就能被释放掉。
但是,这只是解决了replication延迟导致的未flush问题,如果这个replica的主region确实没有flush过,内存还是不能被释放。写入阻塞还是会存在。