ZipKin是一个链路追踪服务,可以帮助我们追踪、分析多个服务之间调用延迟情况,可到官网了解更多情况 https://zipkin.io/,本文主要通过源码来探析一下ZipKin如何进行抽样统计。
在zipkin客户端采样率是通过Sampler类来完全控制,代码如下,
package com.github.kristofa.brave;
public abstract class Sampler {
public static final Sampler ALWAYS_SAMPLE = new Sampler() {
@Override public boolean isSampled(long traceId) {
return true;
}
@Override public String toString() {
return "AlwaysSample";
}
};
public static final Sampler NEVER_SAMPLE = new Sampler() {
@Override public boolean isSampled(long traceId) {
return false;
}
@Override public String toString() {
return "NeverSample";
}
};
public abstract boolean isSampled(long traceId);
public static Sampler create(float rate) {
return CountingSampler.create(rate);
}
}
同时Sampler还具有2个字类,分别是BoundarySampler和CountingSampler,按照zipkin介绍锁说BoundarySampler是用来应对high-traffic,CountingSampler是用来应对low-traffic,下面主要来看下BoundarySampler和CountingSampler的区别。
在创建brave的时候我们需要指定样本采集率、以及采集率实现,如下,
@Bean
public Brave brave(SpanCollector spanCollector) {
Brave.Builder builder = new Brave.Builder(srvId);// 指定serviceName
builder.spanCollector(spanCollector);
builder.traceSampler(Sampler.create(1));// 采集率
return builder.build();
}
通过builder.traceSampler指定采集率,当然也可以设置成
builder.traceSampler(CountingSampler.create(1));// 采集率
或者
builder.traceSampler(BoundarySampler.create(1));// 采集率
CountingSampler继承了Sampler并且实现了create方法以及isSampled方法,CountingSampler.create()的实现如下,
public static Sampler create(final float rate) {
if (rate == 0) return NEVER_SAMPLE;
if (rate == 1.0) return ALWAYS_SAMPLE;
checkArgument(rate >= 0.01f && rate < 1, "rate should be between 0.01 and 1: was %s", rate);
return new CountingSampler(rate);
}
比较简单,首先判断是否在边界,然后校验,接着计算出rate。在CountingSampler方法中主要逻辑是调用randomBitSet函数,如下,
static BitSet randomBitSet(int size, int cardinality, Random rnd) {
BitSet result = new BitSet(size);
int[] chosen = new int[cardinality];
int i;
for (i = 0; i < cardinality; ++i) {
chosen[i] = i;
result.set(i);
}
for (; i < size; ++i) {
int j = rnd.nextInt(i + 1);
if (j < cardinality) {
result.clear(chosen[j]);
result.set(i);
chosen[j] = i;
}
}
return result;
}
有关更多bitset可以自行百度,这里的返回值bitset保存了结果为true的下标,数据结果类似
{
"3":true,
"23":true,
"56":true,
"78":true,
"89":true,
"90":true,
}
那Sampler究竟是如何使用这个bitset结果的呢?答案就在实现的isSampled方法中,如下,
@Override
public synchronized boolean isSampled(long traceIdIgnored) {
boolean result = sampleDecisions.get(i++);
if (i == 100) i = 0;
return result;
}
其中sampleDecisions就是一个bitset对象,在CountingSampler中也有定义,isSampled方法前面增加了一把锁,说明这里肯定是希望线安全,isSampled方法中是一个计数器,计数器从1-100,每次调用加1,然后从bitset中取出当前的数据是否为true,具体调用在ClientTracer中进行,代码如下,
SpanId newSpanId = getNewSpanId();
if (sample == null) {
// No sample indication is present.
if (!traceSampler().isSampled(newSpanId.traceId)) {
spanAndEndpoint().state().setCurrentClientSpan(null);
return null;
}
}
BoundarySampler继承了Sampler并且实现了create方法以及isSampled方法,BoundarySampler.create()的实现如下,
public static Sampler create(float rate) {
if (rate == 0) return Sampler.NEVER_SAMPLE;
if (rate == 1.0) return ALWAYS_SAMPLE;
checkArgument(rate > 0.0001 && rate < 1, "rate should be between 0.0001 and 1: was %s", rate);
final long boundary = (long) (rate * 10000); // safe cast as less <= 1
return new BoundarySampler(boundary);
}
这里面相对比CountingSampler更加简单,它没有使用bitset存放数据,而是在isSampled方法中通过取余的方式进行比较,如下,
@Override
public boolean isSampled(long traceId) {
long t = Math.abs(traceId ^ SALT);
return t % 10000 <= boundary;
}
isSampled方法的调用也和CountingSampler是一样的。
个人总结:通过对比CountingSampler和BoundarySampler的采集率实现发现BoundarySampler虽然可以支持客户端大流量,但是采集率不是太准确,有浮动,这可能和它的自身算法有关系,在大流量情况下着点偏差可以忽略;CountingSampler虽然支持的流量不多,但是非常准确。个人推荐还是使用BoundarySampler模式,搞不好哪天流量爆增了。
完。
阅读原文