该文章基于业务需求背景,因场景需求进行参数调优,下文会尽可能针对段合并策略(SegmentMergePolicy)的全参数进行说明。
主要介绍TieredMergePolicy,它是Lucene4以后的默认段的合并策略,之前采用的合并策略为LogMergePolicy,建议自行熟悉LogMergePolicy后再了解TieredMergePolicy,这样对于两种段合并策略的优缺点就一目了然,后续即可根据不同业务使用对应的策略,下面对两种策略的差异做一个简单总结:
一句话描述:TieredMergePolicy:找出大小接近且最优的段集。下面对该概括进行详细分析。频繁且大量的段合并会造成进程CPU飙升。
MERGE_TYPE中描述了IndexWriter在不同状态下调用合并策略的三种类型:
三中类型触发过程处理逻辑一致,下文仅对NATURAL进行介绍。
maxMergeAtOnce的缺省值为10,描述了在NATURAL类型下执行一次合并操作最多包含的段的个数(Maximum number of segments to be merged at a time during "normal" mergin)。
segsPerTier的默认值为10,描述了每一层(层级的概念类似LogMergePolicy,这里不做赘述)中需要包含segsPerTier个段才允许合并,例外情况就是当段集中包含的被删除的文档数量达到某个值(下文会介绍),就不用考虑segsPerTier中的段的个数。
mergeFactor描述了执行一次合并操作最多包含的段的个数,该值计算方式如下:
final int mergeFactor = (int) Math.min(maxMergeAtOnce, segsPerTier);
SegmentSize描述了一个段的大小,他是该段中除去被删除文档的索引信息的所有索引文件的大小的综合。
maxMergedSegmentBytes缺省值为5G,它有两个用途:
另外值得一提的是,该值为内存杀手。内存有限的情况下,该值为必配项(5G的merge行为,在非受控堆外,将内存撑满)如下pmap图可见一般:
hitTooLarge是一个布尔值,当OneMerge中所有段的大小总和接近maxMergedSegmentBytes,hitTooLarge会被置为true,该值影响OneMerge的打分。
deletesPctAllowed的默认值为33(百分比),自定义该值时允许的值域为[20,50],该值有两个用途:
(SegmentSize > (maxMergedSegmentBytes / 2)) && (totalDelPct <= deletesPctAllowed || segDelPct <= deletesPctAllowed)
int allowedDelCount = (int) (deletesPctAllowed * totalMaxDoc / 100);
floorSegmentBytes缺省值为2MB(2 * 1024 * 1024),该值描述了段的大小segmentSize小于floorSegmentBytes的段,他们的segmentSize都当做floorSegmentBytes。
(源码原文:Segments smaller than this are "rounded up" to this size, ie treated as equal (floor) size for merge selection)
使计算出来的allowedSegCount较小,这样能尽快的将小段(tiny Segment)合并,另外该值还会影响OneMerge的打分(下文会介绍)。
设置了不合适的floorSegmentBytes后会发生以下的问题:
SegmentSize多小为小段(tiny Segment),这个定义取决于不同的业务,如果某个业务中认为小于TinySegmentSize的段都为小段,那么floorSegmentBytes的值大于TinySegmentSize即可。
详细处理逻辑可以通过阅读Lucene源码得到(org.apache.lucene.index.TieredMergePolicy#findMerges),如下代码段为入口:
public MergeSpecification findMerges(MergeTrigger mergeTrigger, SegmentInfos infos, MergeContext mergeContext) throws IOException {
final Set merging = mergeContext.getMergingSegments();
// Compute total index bytes & print details about the index
long totIndexBytes = 0;
long minSegmentBytes = Long.MAX_VALUE;
int totalDelDocs = 0;
int totalMaxDoc = 0;
long mergingBytes = 0;
List sortedInfos = getSortedBySegmentSize(infos, mergeContext);
Iterator iter = sortedInfos.iterator();
while (iter.hasNext()) {
SegmentSizeAndDocs segSizeDocs = iter.next();
final long segBytes = segSizeDocs.sizeInBytes;
if (verbose(mergeContext)) {
String extra = merging.contains(segSizeDocs.segInfo) ? " [merging]" : "";
if (segBytes >= maxMergedSegmentBytes) {
extra += " [skip: too large]";
} else if (segBytes < floorSegmentBytes) {
extra += " [floored]";
}
message(" seg=" + segString(mergeContext, Collections.singleton(segSizeDocs.segInfo)) + " size=" + String.format(Locale.ROOT, "%.3f", segBytes / 1024 / 1024.) + " MB" + extra, mergeContext);
}
if (merging.contains(segSizeDocs.segInfo)) {
mergingBytes += segSizeDocs.sizeInBytes;
iter.remove();
// if this segment is merging, then its deletes are being reclaimed already.
// only count live docs in the total max doc
totalMaxDoc += segSizeDocs.maxDoc - segSizeDocs.delCount;
} else {
totalDelDocs += segSizeDocs.delCount;
totalMaxDoc += segSizeDocs.maxDoc;
}
minSegmentBytes = Math.min(segBytes, minSegmentBytes);
totIndexBytes += segBytes;
}
assert totalMaxDoc >= 0;
assert totalDelDocs >= 0;
final double totalDelPct = 100 * (double) totalDelDocs / totalMaxDoc;
int allowedDelCount = (int) (deletesPctAllowed * totalMaxDoc / 100);
// If we have too-large segments, grace them out of the maximum segment count
// If we're above certain thresholds of deleted docs, we can merge very large segments.
int tooBigCount = 0;
iter = sortedInfos.iterator();
// remove large segments from consideration under two conditions.
// 1> Overall percent deleted docs relatively small and this segment is larger than 50% maxSegSize
// 2> overall percent deleted docs large and this segment is large and has few deleted docs
while (iter.hasNext()) {
SegmentSizeAndDocs segSizeDocs = iter.next();
double segDelPct = 100 * (double) segSizeDocs.delCount / (double) segSizeDocs.maxDoc;
if (segSizeDocs.sizeInBytes > maxMergedSegmentBytes / 2 && (totalDelPct <= deletesPctAllowed || segDelPct <= deletesPctAllowed)) {
iter.remove();
tooBigCount++; // Just for reporting purposes.
totIndexBytes -= segSizeDocs.sizeInBytes;
allowedDelCount -= segSizeDocs.delCount;
}
}
allowedDelCount = Math.max(0, allowedDelCount);
final int mergeFactor = (int) Math.min(maxMergeAtOnce, segsPerTier);
// Compute max allowed segments in the index
long levelSize = Math.max(minSegmentBytes, floorSegmentBytes);
long bytesLeft = totIndexBytes;
double allowedSegCount = 0;
while (true) {
final double segCountLevel = bytesLeft / (double) levelSize;
if (segCountLevel < segsPerTier || levelSize == maxMergedSegmentBytes) {
allowedSegCount += Math.ceil(segCountLevel);
break;
}
allowedSegCount += segsPerTier;
bytesLeft -= segsPerTier * levelSize;
levelSize = Math.min(maxMergedSegmentBytes, levelSize * mergeFactor);
}
// allowedSegCount may occasionally be less than segsPerTier
// if segment sizes are below the floor size
allowedSegCount = Math.max(allowedSegCount, segsPerTier);
if (verbose(mergeContext) && tooBigCount > 0) {
message(" allowedSegmentCount=" + allowedSegCount + " vs count=" + infos.size() +
" (eligible count=" + sortedInfos.size() + ") tooBigCount= " + tooBigCount, mergeContext);
}
return doFindMerges(sortedInfos, maxMergedSegmentBytes, mergeFactor, (int) allowedSegCount, allowedDelCount, MERGE_TYPE.NATURAL,
mergeContext, mergingBytes >= maxMergedSegmentBytes);
}