Eureka系列-服务端接收心跳请求与自动过期

1. 心跳续约

心跳续约机制:当服务端接收到客户端的心跳请求后,首先在当前服务端上更新续约事件,如果成功,则将心跳广播给其它服务端节点

续约有两种情况:

     (1) 客户端发起的心跳续约(isReplication=false)

     (2) 服务端消息广播时发起的心跳续约(isReplication=true)

1.1 接收心跳请求 - renewLease

InstanceResource#renewLease 接收心跳请求 PUT http://{ip}:{port}/eureka/apps/{appName}/{id}

// InstanceResource
@PUT
public Response renewLease(
        @HeaderParam(PeerEurekaNode.HEADER_REPLICATION) String isReplication,
        @QueryParam("overriddenstatus") String overriddenStatus,
        @QueryParam("status") String status,
        @QueryParam("lastDirtyTimestamp") String lastDirtyTimestamp) {
	// isReplication: "true"为服务端节点心跳  "false"为客户端心跳
    boolean isFromReplicaNode = "true".equals(isReplication);
    // 1. 心跳处理,当前节点处理成功后进行消息广播,由于消息广播是异步的,实际返回的结果是当前节点处理结果
    boolean isSuccess = registry.renew(app.getName(), id, isFromReplicaNode);
    // 2. 心跳处理失败有两种情况:
    // 		2.1  当前节点服务列表中不存在该实例
    // 		2.2 当前节点中的实例和lastDirtyTimestamp不同,说明服务列表中的实例不是最新的
    if (!isSuccess) {
        logger.warn("Not Found (Renew): {} - {}", app.getName(), id);
        return Response.status(Status.NOT_FOUND).build();
    }
    
    Response response;
    if (lastDirtyTimestamp != null && serverConfig.shouldSyncWhenTimestampDiffers()) {
        // 校验lastDirtyTimestamp
    	response = this.validateDirtyTimestamp(Long.valueOf(lastDirtyTimestamp), isFromReplicaNode);
        // Store the overridden status since the validation found out the node that replicates wins
        if (response.getStatus() == Response.Status.NOT_FOUND.getStatusCode()
                && (overriddenStatus != null)
                && !(InstanceStatus.UNKNOWN.name().equals(overriddenStatus))
                && isFromReplicaNode) {
            registry.storeOverriddenStatusIfRequired(app.getAppName(), id, InstanceStatus.valueOf(overriddenStatus));
        }
    } else {
        response = Response.ok().build();
    }
    logger.debug("Found (Renew): {} - {}; reply status={}", app.getName(), id, response.getStatus());
    return response;
}

1.2 本地续约处理 - renew

// PeerAwareInstanceRegistryImpl
public boolean renew(final String appName, final String id, final boolean isReplication) {
	if (super.renew(appName, id, isReplication)) {
		// 本地操作成功后会向其它节点同步
		replicateToPeers(Action.Heartbeat, appName, id, null, null, isReplication);
		return true;
	}
	return false;
}

// AbstractInstanceRegistry
public boolean renew(String appName, String id, boolean isReplication) {
	RENEW.increment(isReplication);
	// 1. 根据appName从服务列表中查找服务实例
	Map> gMap = registry.get(appName);
	Lease leaseToRenew = null;
	if (gMap != null) {
		leaseToRenew = gMap.get(id);
	}
	// 2.1 服务实例不存在,直接返回false
	if (leaseToRenew == null) {
		RENEW_NOT_FOUND.increment(isReplication);
		return false;
	// 2.2 服务实例存在
	} else {
		InstanceInfo instanceInfo = leaseToRenew.getHolder();
		if (instanceInfo != null) {
			InstanceStatus overriddenInstanceStatus = this.getOverriddenInstanceStatus(instanceInfo, leaseToRenew,
					isReplication);
			// 实例状态时UNKNOWN时返回false
			if (overriddenInstanceStatus == InstanceStatus.UNKNOWN) {
				RENEW_NOT_FOUND.increment(isReplication);
				return false;
			}
			if (!instanceInfo.getStatus().equals(overriddenInstanceStatus)) {
				instanceInfo.setStatusWithoutDirty(overriddenInstanceStatus);

			}
		}
		// 3. 续约次数+1,控制台Renews (last min)显示的就是这个计数,且控制台的警告信息判断也用到了这个计数
		renewsLastMin.increment();
		// 4. 更新实例最后一次的更新时间lastUpdateTimestamp (核心)
		leaseToRenew.renew();
		return true;
	}
}

1.3 脏数据校验 - validateDirtyTimestamp

校验规则: 服务实例lastDirtyTimestamp大的代表是最新更新的,因为客户端/服务端节点在每次续约/状态更新/下线都会更新这个值,如果不是最新的,返回NOT_FOUND状态让客户端重新注册一次

private Response validateDirtyTimestamp(Long lastDirtyTimestamp,
            boolean isReplication) {
	// 1. 获取本地服务实例,和客户端传过来的进行比较
	InstanceInfo appInfo = registry.getInstanceByAppAndId(app.getName(), id, false);
	if (appInfo != null) {
		// 2. 客户端和服务端的时间戳不一样,说明实例信息不一致了
		if ((lastDirtyTimestamp != null) && (!lastDirtyTimestamp.equals(appInfo.getLastDirtyTimestamp()))) {
			// 3.1 客户端的值比较大,说明服务端的信息不是最新的,返回NOT_FOUND状态,让客户端重新注册一次
			if (lastDirtyTimestamp > appInfo.getLastDirtyTimestamp()) {
				return Response.status(Status.NOT_FOUND).build();
			// 3.2 服务端的值比较大,说明数据正常,将信息返回给客户端,更新客户端实例信息
			} else if (appInfo.getLastDirtyTimestamp() > lastDirtyTimestamp) {
				if (isReplication) {
					// true表示Eureka节点之间同步数据
					return Response.status(Status.CONFLICT).entity(appInfo).build();
				} else {
					return Response.ok().build();
				}
			}
		}

	}
	return Response.ok().build();
}

1.4 客户端心跳请求后续操作 - renew 

接口返回成功状态码时没有后续操作,返回NOT_FOUND状态时,重新注册

// DiscoveryClient
boolean renew() {
    EurekaHttpResponse httpResponse;
    try {
        httpResponse = eurekaTransport.registrationClient.sendHeartBeat(instanceInfo.getAppName(), instanceInfo.getId(), instanceInfo, null);
        logger.debug(PREFIX + "{} - Heartbeat status: {}", appPathIdentifier, httpResponse.getStatusCode());
        if (httpResponse.getStatusCode() == Status.NOT_FOUND.getStatusCode()) {
            REREGISTER_COUNTER.increment();
            logger.info(PREFIX + "{} - Re-registering apps/{}", appPathIdentifier, instanceInfo.getAppName());
            long timestamp = instanceInfo.setIsDirtyWithTime();
            // 服务端返回NOT_FOUND状态时,重新注册
            boolean success = register();
            if (success) {
                instanceInfo.unsetIsDirty(timestamp);
            }
            return success;
        }
        return httpResponse.getStatusCode() == Status.OK.getStatusCode();
    } catch (Throwable e) {
        logger.error(PREFIX + "{} - was unable to send heartbeat!", appPathIdentifier, e);
        return false;
    }
}

1.5 心跳广播

心跳广播,是当客户端发送请求,Eureka服务端处理成功后,向其它节点同步的过程

// PeerAwareInstanceRegistryImpl
private void replicateToPeers(Action action, String appName, String id, InstanceInfo info /* optional */,
		InstanceStatus newStatus /* optional */, boolean isReplication) {
	Stopwatch tracer = action.getTimer().start();
	try {
		// 如果是节点复制,统计+1
		if (isReplication) {
			numberOfReplicationsLastMin.increment();
		}
		// 如果已经是节点复制,就不想其它节点同步
		if (peerEurekaNodes == Collections.EMPTY_LIST || isReplication) {
			return;
		}
		// 遍历节点列表,同步
		for (final PeerEurekaNode node : peerEurekaNodes.getPeerEurekaNodes()) {
			// If the url represents this host, do not replicate to yourself.
			if (peerEurekaNodes.isThisMyUrl(node.getServiceUrl())) {
				continue;
			}
			replicateInstanceActionsToPeers(action, appName, id, info, newStatus, node);
		}
	} finally {
		tracer.stop();
	}
}

//PeerAwareInstanceRegistryImpl
private void replicateInstanceActionsToPeers(Action action, String appName,
        String id, InstanceInfo info, InstanceStatus newStatus,
        PeerEurekaNode node) {
	try {
		InstanceInfo infoFromRegistry;
		CurrentRequestVersion.set(Version.V2);
		switch (action) {
			case Cancel:
				node.cancel(appName, id);
				break;
			case Heartbeat:
				InstanceStatus overriddenStatus = overriddenInstanceStatusMap.get(id);
				infoFromRegistry = getInstanceByAppAndId(appName, id, false);
				node.heartbeat(appName, id, infoFromRegistry, overriddenStatus, false);
				break;
			case Register:
				node.register(info);
				break;
			case StatusUpdate:
				infoFromRegistry = getInstanceByAppAndId(appName, id, false);
				node.statusUpdate(appName, id, newStatus, infoFromRegistry);
				break;
			case DeleteStatusOverride:
				infoFromRegistry = getInstanceByAppAndId(appName, id, false);
				node.deleteStatusOverride(appName, id, infoFromRegistry);
				break;
		}
		} catch (Throwable t) {
			logger.error("Cannot replicate information to {} for action {}", node.getServiceUrl(), action.name(), t);
		} finally {
			CurrentRequestVersion.remove();
	}
}

// PeerEurekaNode
public void heartbeat(final String appName, final String id,
        final InstanceInfo info, final InstanceStatus overriddenStatus,
        boolean primeConnection) throws Throwable {
		// 1. primeConnection时不关心心跳结果,发送请求后直接返回
		if (primeConnection) {
			// We do not care about the result for priming request.
			replicationClient.sendHeartBeat(appName, id, info, overriddenStatus);
			return;
		}
		// 2. 心跳成功 -> 没有后续操作
		ReplicationTask replicationTask = new InstanceReplicationTask(targetHost, Action.Heartbeat, info, overriddenStatus, false) {
			@Override
			public EurekaHttpResponse execute() throws Throwable {
				return replicationClient.sendHeartBeat(appName, id, info, overriddenStatus);
			}
		
			@Override
			public void handleFailure(int statusCode, Object responseEntity) throws Throwable {
				super.handleFailure(statusCode, responseEntity);
				// 2.1 返回NOT_FOUND状态码,再次注册
				if (statusCode == 404) {
				  logger.warn("{}: missing entry.", getTaskName());
				  if (info != null) {
				      register(info);
				  }
				// 2.2 对方节点信息比当前节点的新,将对方节点信息同步到当前节点
				} else if (config.shouldSyncWhenTimestampDiffers()) {
				  InstanceInfo peerInstanceInfo = (InstanceInfo) responseEntity;
				  if (peerInstanceInfo != null) {
				      syncInstancesIfTimestampDiffers(appName, id, info, peerInstanceInfo);
				  }
				}
			}
		};
		long expiryTime = System.currentTimeMillis() + getLeaseRenewalOf(info);
		batchingDispatcher.process(taskId("heartbeat", info), replicationTask, expiryTime);
}

// PeerEurekaNode
private void syncInstancesIfTimestampDiffers(String appName, String id, InstanceInfo info, InstanceInfo infoFromPeer) {
    try {
        if (infoFromPeer != null) {
            if (infoFromPeer.getOverriddenStatus() != null && !InstanceStatus.UNKNOWN.equals(infoFromPeer.getOverriddenStatus())) {
            	// 1. 更新overriddenStatus状态
            	registry.storeOverriddenStatusIfRequired(appName, id, infoFromPeer.getOverriddenStatus());
            }
            // 2. 更新本地实例注册信息
            registry.register(infoFromPeer, true);
        }
    } catch (Throwable e) {
        logger.warn("Exception when trying to set information from peer :", e);
    }
}

从上面代码可以看出,除了心跳会触发节点复制外,还有客户端下线,注册,状态更新,删除状态重写

2. 自动过期

除了客户端发起的下线请求之外,服务端也会有启动一个调度来定时剔除过期实例,从而避免客户端挂掉,这样的话,客户端就没有机会发起下线请求,该实例就会一直存在于服务端服务列表中。

2.1 启动EvictionTask定时任务

通过调试,可以找到如下调用链

// AbstractInstanceRegistry
protected void postInit() {
    renewsLastMin.start();
    if (evictionTaskRef.get() != null) {
        evictionTaskRef.get().cancel();
    }
    evictionTaskRef.set(new EvictionTask());
    // 启动定时任务
    // 注意delay和period都是eureka.server.evictionIntervalTimerInMs(默认60s)
    evictionTimer.schedule(evictionTaskRef.get(),
            serverConfig.getEvictionIntervalTimerInMs(),
            serverConfig.getEvictionIntervalTimerInMs());
}

// EvictionTask
public void run() {
    try {
        long compensationTimeMs = getCompensationTimeMs();
        evict(compensationTimeMs);
    } catch (Throwable e) {
    }
}

//EvictionTask
long getCompensationTimeMs() {
	// 当前时间戳
    long currNanos = getCurrentTimeNano();
    // 获取上次时间戳,并且将值设置为当前时间戳
    long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
    if (lastNanos == 0l) {
        return 0l;
    }
    // 判断时间差(当前时间-上次时间)和eureka.server.evictionIntervalTimerInMs(默认60s)配置比较
    // 正常情况下会返回0(只是一个异常容错)
    long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
    long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
    return compensationTime <= 0l ? 0l : compensationTime;
}

注意调度的执行间隔时间是通过eureka.server.evictionIntervalTimerInMs配置的,默认60s

2.2 EvictionTask执行流程

Eureka系列-服务端接收心跳请求与自动过期_第1张图片

2.3 如何判断过期

2.3.1 首先对Lease几个重要参数进行说明:

// Lease
private long evictionTimestamp;     		// 第一次服务下线时间戳(不管是事件还是调度触发都会更新这个时间)
private long registrationTimestamp; 		// 注册服务时间(每次注册时更新)
private long serviceUpTimestamp;    		// 第一次服务上线时间
private volatile long lastUpdateTimestamp;      // 最后一次心跳时间
private long duration;				// 实例过期时间,默认90s

这里面的lastUpdateTimestamp要注意一下,下面看一下这个参数什么时候会更新:

//Lease
public Lease(T r, int durationInSecs) {
    holder = r;
    registrationTimestamp = System.currentTimeMillis();
    // 在每次注册时,会新建Lease对象,即每次注册时都会更新lastUpdateTimestamp
    lastUpdateTimestamp = registrationTimestamp;
    duration = (durationInSecs * 1000);
}

//Lease
public void renew() {
	// 每次续约时会调用这个方法,会更新lastUpdateTimestamp,duration默认时90s
    lastUpdateTimestamp = System.currentTimeMillis() + duration;
}

一是在Lease创建时赋值,即在每次客户端发起注册请求时都会更新这个字段,注意这里是用当前时间赋值的

二是在客户端发起续约请求时更新,注意这里是用当前时间+duration(默认90s)赋值的,这个后面在判断过期时会用到

2.3.2 剔除权限校验

// PeerAwareInstanceRegistryImpl
public boolean isLeaseExpirationEnabled() {
	// 1. 是否启用自我保护机制(eureka.server.enableSelfPreservation,默认true)
	// 如果闭关了,这里直接返回true
    if (!isSelfPreservationModeEnabled()) {
        return true;
    }
    // 2. 如果启用自我保护机制,也是有可能剔除过期实例的,只要满足上一分钟续约数量 > 每分钟的续约阈值
    return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
}

首先校验服务端是否开启了自我保护机制(eureka.server.enableSelfPreservation,默认true),如果没开启,直接返回true,即允许剔除;如果开启了自我保护机制,然后再判断上一分钟续约数是否大于每分钟续约数阈值,大于,返回true,反之,false

2.3.3 如何判断实例过期

// Lease
public boolean isExpired(long additionalLeaseMs) {
    return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
}

additionalLeaseMs只是用来容错的,正常情况下为0,这里不考虑

evictionTimestamp大于0,说明这个实例之前发起过下线请求,所以直接算过期

重点看lastUpdateTimestamp,我们上面讨论过这个参数的更新时间,注册和续约时会更新,并且续约时已经加过一次duration了

综合以上情况,实例过期条件是

a. evictionTimestamp > 0

b. evictionTimestamp <=0 && 当前时间  > 上次真正的续约时间(不包含duration) + duration (注册后还没有发起续约就挂掉了)

c.  evictionTimestamp <=0 && 当前时间  > 上次真正的续约时间(不包含duration) + 2 * duration (发起续约后挂掉)

如果再计入调度执行间隔时间(60s),那么服务端在开启自我保护机制下要想剔除一个过期实例,大概需要90s -  240s

2.3.4 服务下线

// AbstractInstanceRegistry
public void evict(long additionalLeaseMs) {
    // 1. 是否开启自我保护机制,下面会分析
    if (!isLeaseExpirationEnabled()) {
        return;
    }

    // 2. 过滤出所有过期实例
    List> expiredLeases = new ArrayList<>();
    for (Entry>> groupEntry : registry.entrySet()) {
        Map> leaseMap = groupEntry.getValue();
        if (leaseMap != null) {
            for (Entry> leaseEntry : leaseMap.entrySet()) {
                Lease lease = leaseEntry.getValue();
                // 3. 这里判断过期条件,比较重要,下面会分析
                if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
                    expiredLeases.add(lease);
                }
            }
        }
    }

    // 4. 这里加了一个剔除阈值控制
    // To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
    // triggering self-preservation. Without that we would wipe out full registry.
    int registrySize = (int) getLocalRegistrySize();
    int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
    int evictionLimit = registrySize - registrySizeThreshold;

    int toEvict = Math.min(expiredLeases.size(), evictionLimit);
    if (toEvict > 0) {
    	// 5. 随机剔除过期实例
        Random random = new Random(System.currentTimeMillis());
        for (int i = 0; i < toEvict; i++) {
            // Pick a random item (Knuth shuffle algorithm)
            int next = i + random.nextInt(expiredLeases.size() - i);
            Collections.swap(expiredLeases, i, next);
            Lease lease = expiredLeases.get(i);

            String appName = lease.getHolder().getAppName();
            String id = lease.getHolder().getId();
            EXPIRED.increment();
            // 6. 和事件触发下线一样,调用同一个internalCancel方法
            internalCancel(appName, id, false);
        }
    }
}

除了上面的两个判断之外,在过滤出过期实例集合后,加了一个阈值控制,注释解释了这么做的原因,随后通过随机方式剔除

参考:https://www.cnblogs.com/binarylei/p/11621403.html

你可能感兴趣的:(spring-cloud)