状况描述
- 基于阿里云
k8s
平台发布服务时,外部访问表现不稳定 - 服务间使用
FeginClient
进行服务调用 - 使用
k8s
的服务(Service)
做服务暴露与解析,未使用注册中心 - 服务容器组数量均设置为1个
- 健康检查设置: TCP就绪探针 检测服务提供端口8080
分析
调试:
- 对
shopService
执行发布流程,同时在web服务
中循环检查shopService
-
web服务
中的检查代码如下,循环间隔100ms调用run0()
(已设置JAVA DNS缓存时间为0,就绪探针全1)
private void run0(){
log.info("时间:{} ips:{}", getCurrentTimeMillis(), getShopServiceIps());
try{
ShopDetailDto shop = shopService.getShopDetailById(10002L);
log.info("时间:{} 医院:{} ips:{}", getCurrentTimeMillis(), shop.getName(), getShopServiceIps());
}catch (Exception e){
log.warn("时间:{} ips:{} 异常:{} {}", getCurrentTimeMillis(), getShopServiceIps(), e.getMessage(), e.getLocalizedMessage());
}
}
- 日志(关键部分,已简化)
19:15:33.322 INFO - 时间:1597227333321 ips:[shop-service/172.20.0.159]
19:15:33.348 INFO - 时间:1597227333348 医院:**** ips:[shop-service/172.20.0.159]
19:15:33.449 INFO - 时间:1597227333448 ips:[shop-service/172.20.0.159]
19:15:33.499 WARN - 时间:1597227333499 ips:[shop-service/172.20.0.159] 异常:Connection reset executing GET http://shop-service/shop/getShopDetailById?shopId=10002 Connection reset executing GET http://shop-service/shop/getShopDetailById?shopId=10002
19:15:33.600 INFO - 时间:1597227333599 ips:[shop-service/172.20.0.159]
19:15:33.602 WARN - 时间:1597227333602 ips:[shop-service/172.20.0.159] 异常:Failed to connect to shop-service/172.20.0.159:8080 executing GET http://shop-service/shop/getShopDetailById?shopId=10002 Failed to connect to shop-service/172.20.0.159:8080 executing GET http://shop-service/shop/getShopDetailById?shopId=10002
19:15:33.702 INFO - 时间:1597227333702 ips:[shop-service/172.20.0.159]
19:15:33.704 WARN - 时间:1597227333703 ips:[shop-service/172.20.0.159] 异常:Failed to connect to shop-service/172.20.0.159:8080 executing GET http://shop-service/shop/getShopDetailById?shopId=10002 Failed to connect to shop-service/172.20.0.159:8080 executing GET http://shop-service/shop/getShopDetailById?shopId=10002
19:15:33.804 INFO - 时间:1597227333804 ips:[shop-service/172.20.0.159]
19:15:33.806 WARN - 时间:1597227333805 ips:[shop-service/172.20.0.159] 异常:Failed to connect to shop-service/172.20.0.159:8080 executing GET http://shop-service/shop/getShopDetailById?shopId=10002 Failed to connect to shop-service/172.20.0.159:8080 executing GET http://shop-service/shop/getShopDetailById?shopId=10002
19:15:33.906 INFO - 时间:1597227333906 ips:[shop-service/172.20.0.159]
19:15:33.907 WARN - 时间:1597227333907 ips:[shop-service/172.20.0.159] 异常:Failed to connect to shop-service/172.20.0.159:8080 executing GET http://shop-service/shop/getShopDetailById?shopId=10002 Failed to connect to shop-service/172.20.0.159:8080 executing GET http://shop-service/shop/getShopDetailById?shopId=10002
19:15:34.008 INFO - 时间:1597227334007 ips:[shop-service/172.20.0.159]
19:15:34.310 WARN - 时间:1597227334309 ips:[shop-service/172.20.0.51] 异常:connect timed out executing GET http://shop-service/shop/getShopDetailById?shopId=10002 connect timed out executing GET http://shop-service/shop/getShopDetailById?shopId=10002
19:15:34.410 INFO - 时间:1597227334410 ips:[shop-service/172.20.0.51]
19:15:38.772 INFO - 时间:1597227338772 医院:**** ips:[shop-service/172.20.0.51]
19:15:38.873 INFO - 时间:1597227338872 ips:[shop-service/172.20.0.51]
日志梳理
- 旧容器ip:172.20.0.159,新容器ip:172.20.0.51
- 注意:医院名/异常日志行的ips是后获取的,所以不一定是请求所用的ip。但前后一致且唯一则可参考。
- 开始访问旧容器,正常。
- 有一次
Connection reset
,说明此时已经连接上了旧容器,但旧容器断开了TCP连接,后续就是一直连接失败。 - 直到解析到新容器ip后,结果返回。
问题总结
- 旧容器在被停止服务后,其ip仍存于DNS中
相关资料
就绪检查:
与之前理解不同,就绪检查探针仅是提供流量准入状态的检查,且是持续探针。并非启动成功检查。
官方定义:指示容器是否准备好为请求提供服务。如果就绪态探测失败, 端点控制器将从与 Pod 匹配的所有服务的端点列表中删除该 Pod 的 IP 地址。 初始延迟之前的就绪态的状态值默认为
Failure
。 如果容器不提供就绪态探针,则默认状态为Success
。参考文档:https://kubernetes.io/zh/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
就绪检查最快频率为1s,不健康阈值最低1次
设检查点A-B,则服务在A-B点之间停止时,B点发现失败之前(超时失败还需要包含超时时间),此容器ip在DNS中仍可获取到
容器停止:
容器终止时,会向容器内进程发出信号量,jvm会调起spring注册的终止方法,spring此时直接关闭服务,将会导致进行中的连接断开,端口关闭。
容器回调
容器可以在终止之前同步执行一个预定义回调(执行时长超出设置超时便不再等待)
参考文档:https://kubernetes.io/zh/docs/concepts/containers/container-lifecycle-hooks/#hook-details
解决方案
目的:
- 尽量保证服务停止前已连接的请求处理完成
- 保证在服务停止前,就绪检查判定失败,移除DNS中的此容器ip
方式:
- 使用HTTP方式的就绪检查(可以控制自己控制返回状态)
- 使用容器终止前的同步回调来触发自己的代码改变健康状态及收尾逻辑
健康检查类代码示例:
import lombok.SneakyThrows;
import lombok.extern.slf4j.Slf4j;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
import javax.annotation.PreDestroy;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
/**
* @author 许文
* @date 2020-08-13 16:51
*/
@Slf4j
@RestController
public class HealthCheck {
private static boolean health = true;
@GetMapping("/healthCheck")
public void getHealthCheckStatus(HttpServletResponse response){
int httpCode = health ? 200 : 500;
response.setStatus(httpCode);
log.debug("健康检查:{} httpCode:{}", health, httpCode);
}
@SneakyThrows
@GetMapping("/healthCheck/shutdown")
public void shutdown(HttpServletRequest request){
// 获取请求信息并检查
String remoteHost = request.getRemoteHost();
String remoteAddr = request.getRemoteAddr();
String requestUrl = request.getRequestURL().toString();
if (allEquals(remoteAddr, remoteHost, "127.0.0.1")
&& requestUrl.equals("http://127.0.0.1:8080/healthCheck/shutdown")
){
health = false;
log.error("收到shutdown请求 remoteHost:{} remoteAddr:{} requestUrl:{}", remoteHost, remoteAddr, requestUrl);
}else {
log.error("收到无效shutdown请求 remoteHost:{} remoteAddr:{} requestUrl:{}", remoteHost, remoteAddr, requestUrl);
return;
}
// TODO:可在此处实现收尾逻辑
for (int i = 20; i > 0; i--) {
log.warn("准备挂了:{}", i);
Thread.sleep(1000);
}
log.error("凉了");
}
@PreDestroy
public void preDestroy() {
/* 注解@PreDestroy标记方法会在正常结束时由spring调用
* 执行到此方法时,HTTP服务可能已关闭,未对其做详细测试
*/
log.error("服务终止...");
}
private static boolean allEquals(Object ...objects){
if (objects.length == 0){
return true;
}
Object first = objects[0];
if (first == null){
return false;
}
for (Object object : objects) {
if (!first.equals(object)){
return false;
}
}
return true;
}
}
服务部署(无状态(Deployment)
)配置:
- 停止前处理命令:
wget http://127.0.0.1:8080/healthCheck/shutdown