1. 心跳续约
心跳续约机制:当服务端接收到客户端的心跳请求后,首先在当前服务端上更新续约事件,如果成功,则将心跳广播给其它服务端节点
续约有两种情况:
(1) 客户端发起的心跳续约(isReplication=false)
(2) 服务端消息广播时发起的心跳续约(isReplication=true)
1.1 接收心跳请求 - renewLease
InstanceResource#renewLease 接收心跳请求 PUT http://{ip}:{port}/eureka/apps/{appName}/{id}
// InstanceResource
@PUT
public Response renewLease(
@HeaderParam(PeerEurekaNode.HEADER_REPLICATION) String isReplication,
@QueryParam("overriddenstatus") String overriddenStatus,
@QueryParam("status") String status,
@QueryParam("lastDirtyTimestamp") String lastDirtyTimestamp) {
// isReplication: "true"为服务端节点心跳 "false"为客户端心跳
boolean isFromReplicaNode = "true".equals(isReplication);
// 1. 心跳处理,当前节点处理成功后进行消息广播,由于消息广播是异步的,实际返回的结果是当前节点处理结果
boolean isSuccess = registry.renew(app.getName(), id, isFromReplicaNode);
// 2. 心跳处理失败有两种情况:
// 2.1 当前节点服务列表中不存在该实例
// 2.2 当前节点中的实例和lastDirtyTimestamp不同,说明服务列表中的实例不是最新的
if (!isSuccess) {
logger.warn("Not Found (Renew): {} - {}", app.getName(), id);
return Response.status(Status.NOT_FOUND).build();
}
Response response;
if (lastDirtyTimestamp != null && serverConfig.shouldSyncWhenTimestampDiffers()) {
// 校验lastDirtyTimestamp
response = this.validateDirtyTimestamp(Long.valueOf(lastDirtyTimestamp), isFromReplicaNode);
// Store the overridden status since the validation found out the node that replicates wins
if (response.getStatus() == Response.Status.NOT_FOUND.getStatusCode()
&& (overriddenStatus != null)
&& !(InstanceStatus.UNKNOWN.name().equals(overriddenStatus))
&& isFromReplicaNode) {
registry.storeOverriddenStatusIfRequired(app.getAppName(), id, InstanceStatus.valueOf(overriddenStatus));
}
} else {
response = Response.ok().build();
}
logger.debug("Found (Renew): {} - {}; reply status={}", app.getName(), id, response.getStatus());
return response;
}
1.2 本地续约处理 - renew
// PeerAwareInstanceRegistryImpl
public boolean renew(final String appName, final String id, final boolean isReplication) {
if (super.renew(appName, id, isReplication)) {
// 本地操作成功后会向其它节点同步
replicateToPeers(Action.Heartbeat, appName, id, null, null, isReplication);
return true;
}
return false;
}
// AbstractInstanceRegistry
public boolean renew(String appName, String id, boolean isReplication) {
RENEW.increment(isReplication);
// 1. 根据appName从服务列表中查找服务实例
Map> gMap = registry.get(appName);
Lease leaseToRenew = null;
if (gMap != null) {
leaseToRenew = gMap.get(id);
}
// 2.1 服务实例不存在,直接返回false
if (leaseToRenew == null) {
RENEW_NOT_FOUND.increment(isReplication);
return false;
// 2.2 服务实例存在
} else {
InstanceInfo instanceInfo = leaseToRenew.getHolder();
if (instanceInfo != null) {
InstanceStatus overriddenInstanceStatus = this.getOverriddenInstanceStatus(instanceInfo, leaseToRenew,
isReplication);
// 实例状态时UNKNOWN时返回false
if (overriddenInstanceStatus == InstanceStatus.UNKNOWN) {
RENEW_NOT_FOUND.increment(isReplication);
return false;
}
if (!instanceInfo.getStatus().equals(overriddenInstanceStatus)) {
instanceInfo.setStatusWithoutDirty(overriddenInstanceStatus);
}
}
// 3. 续约次数+1,控制台Renews (last min)显示的就是这个计数,且控制台的警告信息判断也用到了这个计数
renewsLastMin.increment();
// 4. 更新实例最后一次的更新时间lastUpdateTimestamp (核心)
leaseToRenew.renew();
return true;
}
}
1.3 脏数据校验 - validateDirtyTimestamp
校验规则: 服务实例lastDirtyTimestamp大的代表是最新更新的,因为客户端/服务端节点在每次续约/状态更新/下线都会更新这个值,如果不是最新的,返回NOT_FOUND状态让客户端重新注册一次
private Response validateDirtyTimestamp(Long lastDirtyTimestamp,
boolean isReplication) {
// 1. 获取本地服务实例,和客户端传过来的进行比较
InstanceInfo appInfo = registry.getInstanceByAppAndId(app.getName(), id, false);
if (appInfo != null) {
// 2. 客户端和服务端的时间戳不一样,说明实例信息不一致了
if ((lastDirtyTimestamp != null) && (!lastDirtyTimestamp.equals(appInfo.getLastDirtyTimestamp()))) {
// 3.1 客户端的值比较大,说明服务端的信息不是最新的,返回NOT_FOUND状态,让客户端重新注册一次
if (lastDirtyTimestamp > appInfo.getLastDirtyTimestamp()) {
return Response.status(Status.NOT_FOUND).build();
// 3.2 服务端的值比较大,说明数据正常,将信息返回给客户端,更新客户端实例信息
} else if (appInfo.getLastDirtyTimestamp() > lastDirtyTimestamp) {
if (isReplication) {
// true表示Eureka节点之间同步数据
return Response.status(Status.CONFLICT).entity(appInfo).build();
} else {
return Response.ok().build();
}
}
}
}
return Response.ok().build();
}
1.4 客户端心跳请求后续操作 - renew
接口返回成功状态码时没有后续操作,返回NOT_FOUND状态时,重新注册
// DiscoveryClient
boolean renew() {
EurekaHttpResponse httpResponse;
try {
httpResponse = eurekaTransport.registrationClient.sendHeartBeat(instanceInfo.getAppName(), instanceInfo.getId(), instanceInfo, null);
logger.debug(PREFIX + "{} - Heartbeat status: {}", appPathIdentifier, httpResponse.getStatusCode());
if (httpResponse.getStatusCode() == Status.NOT_FOUND.getStatusCode()) {
REREGISTER_COUNTER.increment();
logger.info(PREFIX + "{} - Re-registering apps/{}", appPathIdentifier, instanceInfo.getAppName());
long timestamp = instanceInfo.setIsDirtyWithTime();
// 服务端返回NOT_FOUND状态时,重新注册
boolean success = register();
if (success) {
instanceInfo.unsetIsDirty(timestamp);
}
return success;
}
return httpResponse.getStatusCode() == Status.OK.getStatusCode();
} catch (Throwable e) {
logger.error(PREFIX + "{} - was unable to send heartbeat!", appPathIdentifier, e);
return false;
}
}
1.5 心跳广播
心跳广播,是当客户端发送请求,Eureka服务端处理成功后,向其它节点同步的过程
// PeerAwareInstanceRegistryImpl
private void replicateToPeers(Action action, String appName, String id, InstanceInfo info /* optional */,
InstanceStatus newStatus /* optional */, boolean isReplication) {
Stopwatch tracer = action.getTimer().start();
try {
// 如果是节点复制,统计+1
if (isReplication) {
numberOfReplicationsLastMin.increment();
}
// 如果已经是节点复制,就不想其它节点同步
if (peerEurekaNodes == Collections.EMPTY_LIST || isReplication) {
return;
}
// 遍历节点列表,同步
for (final PeerEurekaNode node : peerEurekaNodes.getPeerEurekaNodes()) {
// If the url represents this host, do not replicate to yourself.
if (peerEurekaNodes.isThisMyUrl(node.getServiceUrl())) {
continue;
}
replicateInstanceActionsToPeers(action, appName, id, info, newStatus, node);
}
} finally {
tracer.stop();
}
}
//PeerAwareInstanceRegistryImpl
private void replicateInstanceActionsToPeers(Action action, String appName,
String id, InstanceInfo info, InstanceStatus newStatus,
PeerEurekaNode node) {
try {
InstanceInfo infoFromRegistry;
CurrentRequestVersion.set(Version.V2);
switch (action) {
case Cancel:
node.cancel(appName, id);
break;
case Heartbeat:
InstanceStatus overriddenStatus = overriddenInstanceStatusMap.get(id);
infoFromRegistry = getInstanceByAppAndId(appName, id, false);
node.heartbeat(appName, id, infoFromRegistry, overriddenStatus, false);
break;
case Register:
node.register(info);
break;
case StatusUpdate:
infoFromRegistry = getInstanceByAppAndId(appName, id, false);
node.statusUpdate(appName, id, newStatus, infoFromRegistry);
break;
case DeleteStatusOverride:
infoFromRegistry = getInstanceByAppAndId(appName, id, false);
node.deleteStatusOverride(appName, id, infoFromRegistry);
break;
}
} catch (Throwable t) {
logger.error("Cannot replicate information to {} for action {}", node.getServiceUrl(), action.name(), t);
} finally {
CurrentRequestVersion.remove();
}
}
// PeerEurekaNode
public void heartbeat(final String appName, final String id,
final InstanceInfo info, final InstanceStatus overriddenStatus,
boolean primeConnection) throws Throwable {
// 1. primeConnection时不关心心跳结果,发送请求后直接返回
if (primeConnection) {
// We do not care about the result for priming request.
replicationClient.sendHeartBeat(appName, id, info, overriddenStatus);
return;
}
// 2. 心跳成功 -> 没有后续操作
ReplicationTask replicationTask = new InstanceReplicationTask(targetHost, Action.Heartbeat, info, overriddenStatus, false) {
@Override
public EurekaHttpResponse execute() throws Throwable {
return replicationClient.sendHeartBeat(appName, id, info, overriddenStatus);
}
@Override
public void handleFailure(int statusCode, Object responseEntity) throws Throwable {
super.handleFailure(statusCode, responseEntity);
// 2.1 返回NOT_FOUND状态码,再次注册
if (statusCode == 404) {
logger.warn("{}: missing entry.", getTaskName());
if (info != null) {
register(info);
}
// 2.2 对方节点信息比当前节点的新,将对方节点信息同步到当前节点
} else if (config.shouldSyncWhenTimestampDiffers()) {
InstanceInfo peerInstanceInfo = (InstanceInfo) responseEntity;
if (peerInstanceInfo != null) {
syncInstancesIfTimestampDiffers(appName, id, info, peerInstanceInfo);
}
}
}
};
long expiryTime = System.currentTimeMillis() + getLeaseRenewalOf(info);
batchingDispatcher.process(taskId("heartbeat", info), replicationTask, expiryTime);
}
// PeerEurekaNode
private void syncInstancesIfTimestampDiffers(String appName, String id, InstanceInfo info, InstanceInfo infoFromPeer) {
try {
if (infoFromPeer != null) {
if (infoFromPeer.getOverriddenStatus() != null && !InstanceStatus.UNKNOWN.equals(infoFromPeer.getOverriddenStatus())) {
// 1. 更新overriddenStatus状态
registry.storeOverriddenStatusIfRequired(appName, id, infoFromPeer.getOverriddenStatus());
}
// 2. 更新本地实例注册信息
registry.register(infoFromPeer, true);
}
} catch (Throwable e) {
logger.warn("Exception when trying to set information from peer :", e);
}
}
从上面代码可以看出,除了心跳会触发节点复制外,还有客户端下线,注册,状态更新,删除状态重写
2. 自动过期
除了客户端发起的下线请求之外,服务端也会有启动一个调度来定时剔除过期实例,从而避免客户端挂掉,这样的话,客户端就没有机会发起下线请求,该实例就会一直存在于服务端服务列表中。
2.1 启动EvictionTask定时任务
通过调试,可以找到如下调用链
// AbstractInstanceRegistry
protected void postInit() {
renewsLastMin.start();
if (evictionTaskRef.get() != null) {
evictionTaskRef.get().cancel();
}
evictionTaskRef.set(new EvictionTask());
// 启动定时任务
// 注意delay和period都是eureka.server.evictionIntervalTimerInMs(默认60s)
evictionTimer.schedule(evictionTaskRef.get(),
serverConfig.getEvictionIntervalTimerInMs(),
serverConfig.getEvictionIntervalTimerInMs());
}
// EvictionTask
public void run() {
try {
long compensationTimeMs = getCompensationTimeMs();
evict(compensationTimeMs);
} catch (Throwable e) {
}
}
//EvictionTask
long getCompensationTimeMs() {
// 当前时间戳
long currNanos = getCurrentTimeNano();
// 获取上次时间戳,并且将值设置为当前时间戳
long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
if (lastNanos == 0l) {
return 0l;
}
// 判断时间差(当前时间-上次时间)和eureka.server.evictionIntervalTimerInMs(默认60s)配置比较
// 正常情况下会返回0(只是一个异常容错)
long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
return compensationTime <= 0l ? 0l : compensationTime;
}
注意调度的执行间隔时间是通过eureka.server.evictionIntervalTimerInMs配置的,默认60s
2.2 EvictionTask执行流程
2.3 如何判断过期
2.3.1 首先对Lease几个重要参数进行说明:
// Lease
private long evictionTimestamp; // 第一次服务下线时间戳(不管是事件还是调度触发都会更新这个时间)
private long registrationTimestamp; // 注册服务时间(每次注册时更新)
private long serviceUpTimestamp; // 第一次服务上线时间
private volatile long lastUpdateTimestamp; // 最后一次心跳时间
private long duration; // 实例过期时间,默认90s
这里面的lastUpdateTimestamp要注意一下,下面看一下这个参数什么时候会更新:
//Lease
public Lease(T r, int durationInSecs) {
holder = r;
registrationTimestamp = System.currentTimeMillis();
// 在每次注册时,会新建Lease对象,即每次注册时都会更新lastUpdateTimestamp
lastUpdateTimestamp = registrationTimestamp;
duration = (durationInSecs * 1000);
}
//Lease
public void renew() {
// 每次续约时会调用这个方法,会更新lastUpdateTimestamp,duration默认时90s
lastUpdateTimestamp = System.currentTimeMillis() + duration;
}
一是在Lease创建时赋值,即在每次客户端发起注册请求时都会更新这个字段,注意这里是用当前时间赋值的
二是在客户端发起续约请求时更新,注意这里是用当前时间+duration(默认90s)赋值的,这个后面在判断过期时会用到
2.3.2 剔除权限校验
// PeerAwareInstanceRegistryImpl
public boolean isLeaseExpirationEnabled() {
// 1. 是否启用自我保护机制(eureka.server.enableSelfPreservation,默认true)
// 如果闭关了,这里直接返回true
if (!isSelfPreservationModeEnabled()) {
return true;
}
// 2. 如果启用自我保护机制,也是有可能剔除过期实例的,只要满足上一分钟续约数量 > 每分钟的续约阈值
return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
}
首先校验服务端是否开启了自我保护机制(eureka.server.enableSelfPreservation,默认true),如果没开启,直接返回true,即允许剔除;如果开启了自我保护机制,然后再判断上一分钟续约数是否大于每分钟续约数阈值,大于,返回true,反之,false
2.3.3 如何判断实例过期
// Lease
public boolean isExpired(long additionalLeaseMs) {
return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
}
additionalLeaseMs只是用来容错的,正常情况下为0,这里不考虑
evictionTimestamp大于0,说明这个实例之前发起过下线请求,所以直接算过期
重点看lastUpdateTimestamp,我们上面讨论过这个参数的更新时间,注册和续约时会更新,并且续约时已经加过一次duration了
综合以上情况,实例过期条件是
a. evictionTimestamp > 0
b. evictionTimestamp <=0 && 当前时间 > 上次真正的续约时间(不包含duration) + duration (注册后还没有发起续约就挂掉了)
c. evictionTimestamp <=0 && 当前时间 > 上次真正的续约时间(不包含duration) + 2 * duration (发起续约后挂掉)
如果再计入调度执行间隔时间(60s),那么服务端在开启自我保护机制下要想剔除一个过期实例,大概需要90s - 240s
2.3.4 服务下线
// AbstractInstanceRegistry
public void evict(long additionalLeaseMs) {
// 1. 是否开启自我保护机制,下面会分析
if (!isLeaseExpirationEnabled()) {
return;
}
// 2. 过滤出所有过期实例
List> expiredLeases = new ArrayList<>();
for (Entry>> groupEntry : registry.entrySet()) {
Map> leaseMap = groupEntry.getValue();
if (leaseMap != null) {
for (Entry> leaseEntry : leaseMap.entrySet()) {
Lease lease = leaseEntry.getValue();
// 3. 这里判断过期条件,比较重要,下面会分析
if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
expiredLeases.add(lease);
}
}
}
}
// 4. 这里加了一个剔除阈值控制
// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
int evictionLimit = registrySize - registrySizeThreshold;
int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
// 5. 随机剔除过期实例
Random random = new Random(System.currentTimeMillis());
for (int i = 0; i < toEvict; i++) {
// Pick a random item (Knuth shuffle algorithm)
int next = i + random.nextInt(expiredLeases.size() - i);
Collections.swap(expiredLeases, i, next);
Lease lease = expiredLeases.get(i);
String appName = lease.getHolder().getAppName();
String id = lease.getHolder().getId();
EXPIRED.increment();
// 6. 和事件触发下线一样,调用同一个internalCancel方法
internalCancel(appName, id, false);
}
}
}
除了上面的两个判断之外,在过滤出过期实例集合后,加了一个阈值控制,注释解释了这么做的原因,随后通过随机方式剔除
参考:https://www.cnblogs.com/binarylei/p/11621403.html