互联网中最常见的一个问题:限流,即在一段时间内,限制访问某个接口的请求数。
要实现限流(或熔断降级),方法有很多,最基本的如计数器法、滑动时间窗口算法(计数器法的升级版)、令牌桶算法、漏桶算法等。
本文重点介绍sentinel的滑动时间窗口算法,从最简单的原始计数器算法逐渐演变到滑动时间窗口算法。
计数器算法的实现比较简单,设置一个初始的时间和一段滑动时间窗口interval,如果请求这个区间内到达,并且没有超过控制值,就放行,否则抛出异常。超时则重置该时间窗口。
/**
* 最简单的计数器限流算法
*/
public class Counter {
public long timeStamp = System.currentTimeMillis(); // 当前时间
public int reqCount = 0; // 初始化计数器
public final int limit = 100; // 时间窗口内最大请求数
public final long interval = 1000 * 60; // 时间窗口ms
public boolean limit() {
long now = System.currentTimeMillis();
if (now < timeStamp + interval) {
// 在时间窗口内
reqCount++;
// 判断当前时间窗口内是否超过最大请求控制数
return reqCount <= limit;
} else {
timeStamp = now;
// 超时后重置
reqCount = 1;
return true;
}
}
}
但是这种方法有一个问题:无法计算当前时间往前推的一段时间内的请求数,只能计算某个时间段,即timeStamp(/now) + interval这个时间段的值。统计的精度太低。
由此,出现了滑动时间窗口的算法。增加时间窗口,划分出来的时间窗口是细粒度,分别统计单个窗口的统计数量,已达到精确统计。
滑动时间窗口
在上图中,整个红色的矩形框表示一个时间窗口,在我们的例子中,一个时间窗口就是一分钟。然后我们将时间窗口进行划分,比如图中,我们就将滑动窗口划成了6格,所以每格代表的是10秒钟。每过10秒钟,我们的时间窗口就会往右滑动一格。每一个格子都有自己独立的计数器counter,比如当一个请求 在0:35秒的时候到达,那么0:30~0:39对应的counter就会加1。计数器算法其实就是滑动窗口算法。只是它没有对时间窗口做进一步地划分,所以只有1格。由此可见,当滑动窗口的格子划分的越多,那么滑动窗口的滚动就越平滑,限流的统计就会越精确。
/**2 * 滑动时间窗口限流实现
3 * 假设某个服务最多只能每秒钟处理100个请求,我们可以设置一个1秒钟的滑动时间窗口,
4 * 窗口中有10个格子,每个格子100毫秒,每100毫秒移动一次,每次移动都需要记录当前服务请求的次数*/
public class SlidingTimeWindow {
//服务访问次数,可以放在Redis中,实现分布式系统的访问计数8
Long counter = 0L;
//使用LinkedList来记录滑动窗口的10个格子。
LinkedList<Long> slots = new LinkedList<Long>();
public static void main(String[] args) throws InterruptedException {
SlidingTimeWindow timeWindow = new SlidingTimeWindow();
new Thread(new Runnable() {
@Override
public void run() {
try {
timeWindow.doCheck();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}).start();
while (true){
//TODO 判断限流标记
timeWindow.counter++;
Thread.sleep(new Random().nextInt(15));
}
}
综上,滑动时间窗口的统计方法是为了解决计数器算法统计精度不够的弊端才出现的。因为统计数算法只是单纯地统计了一个时间段的数量,一旦超过,则重置。而滑动时间窗口则是将窗口细分,每个窗口分别统计对应时间段的数量,这样的话,就可以通过计算某几个时间窗口的数量和来统计某个时间点之前的一段时间的数量。
首先需要了解统计时间的算法是在哪里需要用到,入口在哪?
这涉及到sentinel的核心调用链路,在请求访问时。sentinel会创建一条调用链路
NodeSelectorSlot --> ClusterBuilderSlot --> LOgSlot --> StatisticSlot 。。。。
其中的statisticsSlot特别关键
其调用逻辑是:fireEntry(…) 先执行后面的限流、垄断降级的Slot
根据他们执行的结果来统计各种指标(成功与否、qps、请求时间点。。)
如果成功,则调用sentinel中独立的统计中心ArrayMetric,做成功的统计(此时就需要用到统计中心,也就需要用到滑动时间窗口算法)
即下面代码中
node.increaseThreadNum();
node.addPassRequest(count);
如果失败,那么就统计相应的异常指标。
即node.increaseBlockQps(count);
@Override
public void entry(Context context, ResourceWrapper resourceWrapper, DefaultNode node, int count,
boolean prioritized, Object... args) throws Throwable {
try {
// Do some checking.
fireEntry(context, resourceWrapper, node, count, prioritized, args);
// Request passed, add thread count and pass count.
// 如果请求都通过了,增加线程数和通过的计数
node.increaseThreadNum();
node.addPassRequest(count);
if (context.getCurEntry().getOriginNode() != null) {
// Add count for origin node.
context.getCurEntry().getOriginNode().increaseThreadNum();
context.getCurEntry().getOriginNode().addPassRequest(count);
}
if (resourceWrapper.getEntryType() == EntryType.IN) {
// Add count for global inbound entry node for global statistics.
Constants.ENTRY_NODE.increaseThreadNum();
Constants.ENTRY_NODE.addPassRequest(count);
}
// Handle pass event with registered entry callback handlers.
for (ProcessorSlotEntryCallback<DefaultNode> handler : StatisticSlotCallbackRegistry.getEntryCallbacks()) {
handler.onPass(context, resourceWrapper, node, count, args);
}
} catch (PriorityWaitException ex) {
node.increaseThreadNum();
if (context.getCurEntry().getOriginNode() != null) {
// Add count for origin node.
context.getCurEntry().getOriginNode().increaseThreadNum();
}
if (resourceWrapper.getEntryType() == EntryType.IN) {
// Add count for global inbound entry node for global statistics.
Constants.ENTRY_NODE.increaseThreadNum();
}
// Handle pass event with registered entry callback handlers.
for (ProcessorSlotEntryCallback<DefaultNode> handler : StatisticSlotCallbackRegistry.getEntryCallbacks()) {
handler.onPass(context, resourceWrapper, node, count, args);
}
} catch (BlockException e) {
// Blocked, set block exception to current entry.
context.getCurEntry().setBlockError(e);
// Add block count.
node.increaseBlockQps(count);
if (context.getCurEntry().getOriginNode() != null) {
context.getCurEntry().getOriginNode().increaseBlockQps(count);
}
if (resourceWrapper.getEntryType() == EntryType.IN) {
// Add count for global inbound entry node for global statistics.
Constants.ENTRY_NODE.increaseBlockQps(count);
}
// Handle block event with registered entry callback handlers.
for (ProcessorSlotEntryCallback<DefaultNode> handler : StatisticSlotCallbackRegistry.getEntryCallbacks()) {
handler.onBlocked(e, context, resourceWrapper, node, count, args);
}
throw e;
} catch (Throwable e) {
// Unexpected internal error, set error to current entry.
context.getCurEntry().setError(e);
throw e;
}
}
StatisticNode中一个ArrayMetric是做统计的核心
private transient Metric rollingCounterInMinute = new ArrayMetric(60, 60 * 1000, false);
我们看他的构造方法
private final LeapArray<MetricBucket> data;
public ArrayMetric(int sampleCount, int intervalInMs) {
this.data = new OccupiableBucketLeapArray(sampleCount, intervalInMs);
}
public ArrayMetric(int sampleCount, int intervalInMs, boolean enableOccupy) {
if (enableOccupy) {
this.data = new OccupiableBucketLeapArray(sampleCount, intervalInMs);
} else {
this.data = new BucketLeapArray(sampleCount, intervalInMs);
}
}
关注new OccupiableBucketLeapArray(sampleCount, intervalInMs); intervalInMs是时间间隔,而sampleCount则是滑动时间窗的格数
public LeapArray(int sampleCount, int intervalInMs) {
AssertUtil.isTrue(sampleCount > 0, "bucket count is invalid: " + sampleCount);
AssertUtil.isTrue(intervalInMs > 0, "total time interval of the sliding window should be positive");
AssertUtil.isTrue(intervalInMs % sampleCount == 0, "time span needs to be evenly divided");
this.windowLengthInMs = intervalInMs / sampleCount;//窗口的长度
this.intervalInMs = intervalInMs;
this.intervalInSecond = intervalInMs / 1000.0;
this.sampleCount = sampleCount;
this.array = new AtomicReferenceArray<>(sampleCount);
}
这里的这段构造方法就是在划分一个时间窗口,如下绿色部分。1s被划分成俩部分,每个的时间长度是500ms
由此我们可以想到,要想做限流的逻辑判断
需要先获得该时间窗口,拿出里面的统计数据,和现有的数据进行对比。
那么
要怎么获得该时间窗口,如何判断该获取哪个时间窗口呢?
以下代码就是回答这个问题
@Override
public void addRtAndSuccess(long rt, int successCount) {
//增加成功数,到某个滑动窗口里面
rollingCounterInSecond.addSuccess(successCount);
rollingCounterInSecond.addRT(rt);
rollingCounterInMinute.addSuccess(successCount);
rollingCounterInMinute.addRT(rt);
}
@Override
public void addSuccess(int count) {
//获取当前的时间窗口
WindowWrap<MetricBucket> wrap = data.currentWindow();
wrap.value().addSuccess(count);
}
将当前时间传进去
public WindowWrap<T> currentWindow() {
//将当前时间传进去
return currentWindow(TimeUtil.currentTimeMillis());
}
该方法是本文的核心,需要重点熟知
public WindowWrap<T> currentWindow(long timeMillis) {
if (timeMillis < 0) {
return null;
}
/*long timeId = timeMillis / windowLengthInMs;
return (int)(timeId % array.length());
计算出当前时间应该落在那个滑动窗口下了,比如分成了多块的窗口,是1还是2 。。。*/
int idx = calculateTimeIdx(timeMillis);
//计算是位于当前桶的哪个位置timeMillis - timeMillis % windowLengthInMs;
// Calculate current bucket start time.
long windowStart = calculateWindowStart(timeMillis);
/*
* Get bucket item at given time from the array.
*
* (1) Bucket is absent, then just create a new bucket and CAS update to circular array.
* (2) Bucket is up-to-date, then just return the bucket.
* (3) Bucket is deprecated, then reset current bucket and clean all deprecated buckets.
*/
while (true) {
//根据坐标先拿出一个时间窗口
WindowWrap<T> old = array.get(idx);
if (old == null) {
/*
* B0 B1 B2 NULL B4
* ||_______|_______|_______|_______|_______||___
* 200 400 600 800 1000 1200 timestamp
* ^
* time=888
* bucket is empty, so create new and update
*
* If the old bucket is absent, then we create a new bucket at {@code windowStart},
* then try to update circular array via a CAS operation. Only one thread can
* succeed to update, while other threads yield its time slice.
*/
WindowWrap<T> window = new WindowWrap<T>(windowLengthInMs, windowStart, newEmptyBucket(timeMillis));
if (array.compareAndSet(idx, null, window)) {
// Successfully updated, return the created bucket.
return window;
} else {
// Contention failed, the thread will yield its time slice to wait for bucket available.
Thread.yield();
}
} else if (windowStart == old.windowStart()) {
/*
* B0 B1 B2 B3 B4
* ||_______|_______|_______|_______|_______||___
* 200 400 600 800 1000 1200 timestamp
* ^
* time=888
* startTime of Bucket 3: 800, so it's up-to-date
*
* If current {@code windowStart} is equal to the start timestamp of old bucket,
* that means the time is within the bucket, so directly return the bucket.
*/
return old;
} else if (windowStart > old.windowStart()) {
/*
* (old)
* B0 B1 B2 NULL B4
* |_______||_______|_______|_______|_______|_______||___
* ... 1200 1400 1600 1800 2000 2200 timestamp
* ^
* time=1676
* startTime of Bucket 2: 400, deprecated, should be reset
*
* If the start timestamp of old bucket is behind provided time, that means
* the bucket is deprecated. We have to reset the bucket to current {@code windowStart}.
* Note that the reset and clean-up operations are hard to be atomic,
* so we need a update lock to guarantee the correctness of bucket update.
*
* The update lock is conditional (tiny scope) and will take effect only when
* bucket is deprecated, so in most cases it won't lead to performance loss.
*/
if (updateLock.tryLock()) {
try {
// Successfully get the update lock, now we reset the bucket.
return resetWindowTo(old, windowStart);
} finally {
updateLock.unlock();
}
} else {
// Contention failed, the thread will yield its time slice to wait for bucket available.
Thread.yield();
}
} else if (windowStart < old.windowStart()) {
// Should not go through here, as the provided time is already behind.
return new WindowWrap<T>(windowLengthInMs, windowStart, newEmptyBucket(timeMillis));
}
}
}
上面的注释已经说的非常清楚了
如果没有找到该桶,那么new一个新的,并加入到array中
找到了,直接返回
超时了,那么重置resetWindowTo(old, windowStart)
上面已经可以获得某个时间窗口,那么就可以获取到相应时间段的数据,那么限流就有数据
只需要将我们设置的限流规则和这些数据进行比对,就可以判断是否超过了限流阈值
我们以FlowSlot为例,看看是怎么使用数据的,checkFlow,然后是checker.checkFlow这个方法做限流逻辑
@Override
public void entry(Context context, ResourceWrapper resourceWrapper, DefaultNode node, int count,
boolean prioritized, Object... args) throws Throwable {
checkFlow(resourceWrapper, context, node, count, prioritized);
fireEntry(context, resourceWrapper, node, count, prioritized, args);
}
void checkFlow(ResourceWrapper resource, Context context, DefaultNode node, int count, boolean prioritized)
throws BlockException {
checker.checkFlow(ruleProvider, resource, context, node, count, prioritized);
}
重点关注canPassCheck
public void checkFlow(Function<String, Collection<FlowRule>> ruleProvider, ResourceWrapper resource,
Context context, DefaultNode node, int count, boolean prioritized) throws BlockException {
if (ruleProvider == null || resource == null) {
return;
}
Collection<FlowRule> rules = ruleProvider.apply(resource.getName());
if (rules != null) {
for (FlowRule rule : rules) {
if (!canPassCheck(rule, context, node, count, prioritized)) {
throw new FlowException(rule.getLimitApp(), rule);
}
}
}
}
public boolean canPassCheck(/*@NonNull*/ FlowRule rule, Context context, DefaultNode node, int acquireCount,
boolean prioritized) {
String limitApp = rule.getLimitApp();
if (limitApp == null) {
return true;
}
if (rule.isClusterMode()) {
return passClusterCheck(rule, context, node, acquireCount, prioritized);
}
return passLocalCheck(rule, context, node, acquireCount, prioritized);
}
获取到rule和相关策略
private static boolean passLocalCheck(FlowRule rule, Context context, DefaultNode node, int acquireCount,
boolean prioritized) {
Node selectedNode = selectNodeByRequesterAndStrategy(rule, context, node);
if (selectedNode == null) {
return true;
}
return rule.getRater().canPass(selectedNode, acquireCount, prioritized);
}
rule.getRater().canPass(selectedNode, acquireCount, prioritized);有三种策略,分别对应的就是快速失败,预热和排队等待。
下面是流程图
我们选择默认形式的判断逻辑
avgUsedTokens获取到qps,然后curCount + acquireCount > count这一步做判断,是否需要限流
@Override
public boolean canPass(Node node, int acquireCount, boolean prioritized) {
int curCount = avgUsedTokens(node);
if (curCount + acquireCount > count) {
if (prioritized && grade == RuleConstant.FLOW_GRADE_QPS) {
long currentTime;
long waitInMs;
currentTime = TimeUtil.currentTimeMillis();
waitInMs = node.tryOccupyNext(currentTime, acquireCount, count);
if (waitInMs < OccupyTimeoutProperty.getOccupyTimeout()) {
node.addWaitingRequest(currentTime + waitInMs, acquireCount);
node.addOccupiedPass(acquireCount);
sleep(waitInMs);
// PriorityWaitException indicates that the request will pass after waiting for {@link @waitInMs}.
throw new PriorityWaitException(waitInMs);
}
}
return false;
}
return true;
}
avgUsedTokens(node) 方法是获取到统计数据的
private int avgUsedTokens(Node node) {
if (node == null) {
return DEFAULT_AVG_USED_TOKENS;
}
return grade == RuleConstant.FLOW_GRADE_THREAD ? node.curThreadNum() : (int)(node.passQps());
}
node.passQps() 这里用到了统计数据!!!
@Override
//这里是计算qps
public double passQps() {
return rollingCounterInSecond.pass() / rollingCounterInSecond.getWindowIntervalInSec();
}
@Override
public long pass() {
data.currentWindow();
long pass = 0;
List<MetricBucket> list = data.values();
for (MetricBucket window : list) {
pass += window.pass();
}
return pass;
}
完。。