场景描述:
今天收到线上一个服务的instance deregistered的告警邮件,在springboot admin上确实那个instance不在线了。于是去服务器上jps看了一下,进程正常的,但是/health端点信息显示eureka为down。查看error.log,并没有错误日志,查看warn.log,发现里面有些许的警告日志如下:
[2018-09-20T19:12:24.159+08:00] WARN: [TID: N/A] [Apollo-Config-1] DiscoveryClient.java:1291 - Saw local status change event StatusChangeEvent [timestamp=1537441944159, current=DOWN, previous=UP]
[2018-09-20T19:12:24.160+08:00] WARN: [TID: N/A] [Apollo-Config-1] DiscoveryClient.java:1291 - Saw local status change event StatusChangeEvent [timestamp=1537441944160, current=UP, previous=DOWN]
[2018-09-20T19:12:24.160+08:00] WARN: [TID: N/A] [DiscoveryClient-InstanceInfoReplicator-0] DiscoveryClient.java:1291 - Saw local status change event StatusChangeEvent [timestamp=1537441944160, current=DOWN, previous=UP]
[2018-09-20T19:12:24.160+08:00] WARN: [TID: N/A] [DiscoveryClient-InstanceInfoReplicator-0] InstanceInfoReplicator.java:98 - Ignoring onDemand update due to rate limiter
打开InstanceInfoReplicator.java查看了一下源代码,发现是由于单位分钟内向eureka server同步次数超过限制,导致有一个StatusChangeEvent被忽略掉了。
从上面的警告日志也可以看出一些端倪,就是在很短的时间内,收到了2个StatusChangeEvent事件,一次是从UP-》DOWN,一次是从DOWN->UP。apollo自身也使用了eureka,所以apollo也处理了这个事件两次(第一条和第二条日志,在Apollo-Config线程组中).然后springcloud内部也处理了这个事件(在DiscoveryClient-InstanceInfoReplicator线程组中),但是只处理了一个从UP->DOWN的事件,另外一个从DOWN->UP的事件由于rate limiter限制导致被忽略了,最终导致这个instance的health永远是down状态。
以下是处理statusChangeEvent的代码片段,来自com.netflix.discovery.DiscoveryClient#initScheduledTasks方法:
private void initScheduledTasks() {
if (clientConfig.shouldFetchRegistry()) {
// registry cache refresh timer
int registryFetchIntervalSeconds = clientConfig.getRegistryFetchIntervalSeconds();
int expBackOffBound = clientConfig.getCacheRefreshExecutorExponentialBackOffBound();
scheduler.schedule(
new TimedSupervisorTask(
"cacheRefresh",
scheduler,
cacheRefreshExecutor,
registryFetchIntervalSeconds,
TimeUnit.SECONDS,
expBackOffBound,
new CacheRefreshThread()
),
registryFetchIntervalSeconds, TimeUnit.SECONDS);
}
if (clientConfig.shouldRegisterWithEureka()) {
int renewalIntervalInSecs = instanceInfo.getLeaseInfo().getRenewalIntervalInSecs();
int expBackOffBound = clientConfig.getHeartbeatExecutorExponentialBackOffBound();
logger.info("Starting heartbeat executor: " + "renew interval is: " + renewalIntervalInSecs);
// Heartbeat timer
scheduler.schedule(
new TimedSupervisorTask(
"heartbeat",
scheduler,
heartbeatExecutor,
renewalIntervalInSecs,
TimeUnit.SECONDS,
expBackOffBound,
new HeartbeatThread()
),
renewalIntervalInSecs, TimeUnit.SECONDS);
// InstanceInfo replicator
instanceInfoReplicator = new InstanceInfoReplicator(
this,
instanceInfo,
clientConfig.getInstanceInfoReplicationIntervalSeconds(),
2); // burstSize
statusChangeListener = new ApplicationInfoManager.StatusChangeListener() {
@Override
public String getId() {
return "statusChangeListener";
}
@Override
public void notify(StatusChangeEvent statusChangeEvent) {
if (InstanceStatus.DOWN == statusChangeEvent.getStatus() ||
InstanceStatus.DOWN == statusChangeEvent.getPreviousStatus()) {
// log at warn level if DOWN was involved
logger.warn("Saw local status change event {}", statusChangeEvent);
} else {
logger.info("Saw local status change event {}", statusChangeEvent);
}
instanceInfoReplicator.onDemandUpdate();
}
};
if (clientConfig.shouldOnDemandUpdateStatusChange()) {
applicationInfoManager.registerStatusChangeListener(statusChangeListener);
}
instanceInfoReplicator.start(clientConfig.getInitialInstanceInfoReplicationIntervalSeconds());
} else {
logger.info("Not registering with Eureka server per configuration");
}
}
中的StatusChangeListener的notify方法。里面调用了instanceInfoReplicator.onDemandUpdate()方法,尝试通过令牌桶算法来限流:
public boolean onDemandUpdate() {
if (rateLimiter.acquire(burstSize, allowedRatePerMinute)) {
if (!scheduler.isShutdown()) {
scheduler.submit(new Runnable() {
@Override
public void run() {
logger.debug("Executing on-demand update of local InstanceInfo");
Future latestPeriodic = scheduledPeriodicRef.get();
if (latestPeriodic != null && !latestPeriodic.isDone()) {
logger.debug("Canceling the latest scheduled update, it will be rescheduled at the end of on demand update");
latestPeriodic.cancel(false);
}
InstanceInfoReplicator.this.run();
}
});
return true;
} else {
logger.warn("Ignoring onDemand update due to stopped scheduler");
return false;
}
} else {
logger.warn("Ignoring onDemand update due to rate limiter");
return false;
}
}
警告日志中的最后一行日志就是来自这个方法。
源码里面burstSize写死了等于2,没法改变,allowedRatePerMinute可以通过:
eureka.client.instance-info-replication-interval-seconds
参数来改变,默认是30秒同步一次。
每分钟限制同步多少次是在InstanceInfoReplicator类的构造函数中计算的:
InstanceInfoReplicator(DiscoveryClient discoveryClient, InstanceInfo instanceInfo, int replicationIntervalSeconds, int burstSize) {
this.discoveryClient = discoveryClient;
this.instanceInfo = instanceInfo;
this.scheduler = Executors.newScheduledThreadPool(1,
new ThreadFactoryBuilder()
.setNameFormat("DiscoveryClient-InstanceInfoReplicator-%d")
.setDaemon(true)
.build());
this.scheduledPeriodicRef = new AtomicReference();
this.started = new AtomicBoolean(false);
this.rateLimiter = new RateLimiter(TimeUnit.MINUTES);
this.replicationIntervalSeconds = replicationIntervalSeconds;
this.burstSize = burstSize;
this.allowedRatePerMinute = 60 * this.burstSize / this.replicationIntervalSeconds;
logger.info("InstanceInfoReplicator onDemand update allowed rate per min is {}", allowedRatePerMinute);
}
this.allowedRatePerMinute = 60* this.burstSize / this.replicationIntervalSeconds ,默认为: 60*2/30=4。所以导致了第四个StatusChangedEvent被丢弃了。
那我建议在和apollo一起使用时,将eureka.client.instance-info-replication-interval-seconds改成24,即每分钟最多可以执行60*2/24=5次复制instance信息的请求,这样就可能避免该问题的发生。