服务注册中心状态DOWN问题排查

背景

某项目需要升级kubernetes集群,考虑到原k8s版本较低,并且在部署结构上不是很合理,因此决定重新搭建一套新的k8s集群,做应用迁移。迁移过程也是非常曲折,这个后面会专门写一篇文章记录,应用迁移后有部分应用在注册中心状态为DOWN

服务注册中心状态DOWN问题排查_第1张图片

如果服务状态为DOWN调用该服务就报404错误,因为应用配置了健康检查,怀疑是健康检查没有通过,进入后台调用接口查看检查结果

$ curl http://localhost:8080/management/health
{"description":"Remote status from Eureka server","status":"DOWN"}

服务使用jhipster框架生成,在配置文件里面有以下配置

eureka:
    client:
        enabled: true
        healthcheck:
            enabled: true
        fetch-registry: true
        register-with-eureka: true
        instance-info-replication-interval-seconds: 10
        registry-fetch-interval-seconds: 10

eureka.client.healthcheck.enabled设置为false后,注册中心恢复正常,因此可以肯定是健康检查的问题。但是在后台没有任何错误,甚至将日志级别调整到最低也未发现错误信息,这个给排查带来很大的困难。

排查

因为有些服务是正常的,有些服务不正常,所以第一个想法就是对比两个服务的区别,做了以下尝试

  • 对比两个服务的配置文件
  • 对比两个服务的网络数据包
  • 将服务在本地运行(依赖的太多,最终没运行起来)

都没有找到原因,现在问题解决了回过头再看是可以找到两个服务不同点,也是当时不够细心。如果对比行不通,那么就只能深入源码,找到问题的根源。首先就得知道源码要从哪里开始看,在健康检查中有错误提示信息Remote status from Eureka server可以拿这个作为关键字在idea中进行全局搜索(在maven导入的时候需要把源码一起导入)或者你可以可以在github上搜索,都可以找到这个关键字出处。最终找到位于org.springframework.cloud.netflix.eureka.EurekaHealthIndicator.getStatus中,如下

private Status getStatus(Builder builder) {
        Status status = new Status(this.eurekaClient.getInstanceRemoteStatus().toString(),
                "Remote status from Eureka server");
        DiscoveryClient discoveryClient = getDiscoveryClient();
        if (discoveryClient != null && clientConfig.shouldFetchRegistry()) {
            long lastFetch = discoveryClient.getLastSuccessfulRegistryFetchTimePeriod();
            if (lastFetch < 0) {
                status = new Status("UP",
                        "Eureka discovery client has not yet successfully connected to a Eureka server");
            }
            else if (lastFetch > clientConfig.getRegistryFetchIntervalSeconds() * 2000) {
                status = new Status("UP",
                        "Eureka discovery client is reporting failures to connect to a Eureka server");
                builder.withDetail("renewalPeriod", instanceConfig.getLeaseRenewalIntervalInSeconds());
                builder.withDetail("failCount", lastFetch / clientConfig.getRegistryFetchIntervalSeconds());
            }
        }
        return status;
    }

因为没有服务的源码,而且服务本身依赖较多,在本地运行不大现实,可以本地搭建一个demo方便了解整个健康检查的过程。这个就是经验问题。经过阅读源码,整个健康检查过程大概如下

  • 状态是从org.springframework.cloud.netflix.eureka.EurekaHealthCheckHandler.getStatus中获取
  • EurekaHealthCheckHandler包含org.springframework.boot.actuate.health.CompositeHealthIndicator,主要由CompositeHealthIndicator执行具体的健康检查逻辑
  • CompositeHealthIndicator包含一系列的健康检查组件,会依次执行每个组件进行检查(调用health方法)

简单理清整个过程后就要祭出神器Arthas,因为需要在容器中使用Arthas,所以你可以先看下之前发表的文章学习如何在docker中使用Arthas。

  • 观察getStatus方法,确实返回了DOWN状态
➜ watch org.springframework.cloud.netflix.eureka.EurekaHealthCheckHandler getStatus "{returnObj}" -x 2

Affect(class count: 1 , method count: 1) cost in 107 ms, listenerId: 4
method=org.springframework.cloud.netflix.eureka.EurekaHealthCheckHandler.getStatus location=AtExit
ts=2021-03-24 09:38:03; [cost=13.776747ms] result=@ArrayList[
    @InstanceStatus[
        UP=@InstanceStatus[UP],
        DOWN=@InstanceStatus[DOWN],
        STARTING=@InstanceStatus[STARTING],
        OUT_OF_SERVICE=@InstanceStatus[OUT_OF_SERVICE],
        UNKNOWN=@InstanceStatus[UNKNOWN],
        $VALUES=@InstanceStatus[][isEmpty=false;size=5],
        name=@String[DOWN],
        ordinal=@Integer[1],
    ],
]
  • 观察CompositeHealthIndicator的health方法
➜ watch org.springframework.boot.actuate.health.CompositeHealthIndicator health "{returnObj,target.indicators}" -x 2
Press Q or Ctrl+C to abort.
Affect(class count: 2 , method count: 1) cost in 194 ms, listenerId: 6
method=org.springframework.boot.actuate.health.CompositeHealthIndicator.health location=AtExit
ts=2021-03-24 09:46:04; [cost=11.390849ms] result=@ArrayList[
    @Health[
        status=@Status[DOWN],
        details=@UnmodifiableMap[isEmpty=false;size=7],
    ],
    @LinkedHashMap[
        @String[discoveryClient]:@Holder[org.springframework.cloud.client.discovery.health.DiscoveryCompositeHealthIndicator$Holder@47625d8a],
@String[diskSpaceHealthIndicator]:@DiskSpaceHealthIndicator[org.springframework.boot.actuate.health.DiskSpaceHealthIndicator@3f01e628],     @String[redisHealthIndicator]:@RedisHealthIndicator[org.springframework.boot.actuate.health.RedisHealthIndicator@17b54981],  @String[dbHealthIndicator]:@DataSourceHealthIndicator[org.springframework.boot.actuate.health.DataSourceHealthIndicator@10534a8a],      @String[refreshScopeHealthIndicator]:@RefreshScopeHealthIndicator[org.springframework.cloud.health.RefreshScopeHealthIndicator@2284c82d],      @String[configServerHealthIndicator]:@ConfigServerHealthIndicator[org.springframework.cloud.config.client.ConfigServerHealthIndicator@4ec50d1a],
 @String[hystrixHealthIndicator]:@HystrixHealthIndicator[org.springframework.cloud.netflix.hystrix.HystrixHealthIndicator@5c5c6962],
    ],
]

这里可以得到一个很重要的信息,服务总共有以下几个健康检查组件

org.springframework.cloud.client.discovery.health.DiscoveryCompositeHealthIndicator$Holder
org.springframework.boot.actuate.health.DiskSpaceHealthIndicator
org.springframework.boot.actuate.health.RedisHealthIndicator
org.springframework.boot.actuate.health.DataSourceHealthIndicator
org.springframework.cloud.health.RefreshScopeHealthIndicator
org.springframework.cloud.config.client.ConfigServerHealthIndicator
org.springframework.cloud.netflix.hystrix.HystrixHealthIndicator

那么理论上只要一个个检查过去就能知道是哪个出问题,不过这里有一个比较快的方法,因为这些组件都继承AbstractHealthIndicator所以只要观察这个就行

  • 观察AbstractHealthIndicator health方法
➜ watch org.springframework.boot.actuate.health.AbstractHealthIndicator health "{returnObj,target}" -x 2
...
method=org.springframework.boot.actuate.health.AbstractHealthIndicator.health location=AtExit
ts=2021-03-24 09:50:55; [cost=7.652594ms] result=@ArrayList[
    @Health[
        status=@Status[DOWN],
        details=@UnmodifiableMap[isEmpty=false;size=1],
    ],
    @RedisHealthIndicator[
        VERSION=@String[version],
        REDIS_VERSION=@String[redis_version],
        redisConnectionFactory=@JedisConnectionFactory[org.springframework.data.redis.connection.jedis.JedisConnectionFactory@4c91526e],
    ],
]...

可以看到是RedisHealthIndicator检查没过最终导致了整个结果都是DOWN,那么怎么知道报什么错,错误信息在日志里吗没有,可以看下health的源码

public abstract class AbstractHealthIndicator implements HealthIndicator {
    @Override
    public final Health health() {
        Health.Builder builder = new Health.Builder();
        try {
            doHealthCheck(builder);
        }
        catch (Exception ex) {
            builder.down(ex);
        }
        return builder.build();
    }
}

如果有异常会进入builder.down(ex);我们只需观察这个方法就能知道报什么错

➜ watch org.springframework.boot.actuate.health.Health$Builder down "{params}" -x 2
....
@RedisConnectionFailureException[org.springframework.data.redis.RedisConnectionFailureException: Cannot get Jedis connection; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool],
    ],
]

我们最终拿到了错误信息

redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool

redis的配置信息如下

#redis\u4e3b\u673a 
host=redis
#redis\u7aef\u53e3
port=6379
#\u6388\u6743\u5bc6\u7801
password=*****
#\u8d85\u65f6\u65f6\u95f4\uff1a\u5355\u4f4dms
timeout=100000

在容器内部执行curl(因为没有ping命令)

# curl redis
curl: (6) Could not resolve host: redis

dns无法解析redis,但是在k8s中是有redis这个服务的,但发现应用和redis是在两个命名空间中,kubernetes如果在不同的命名空间域名需要用如下格式

$svc_name.$namespace.svc.cluster.local

重新执行curl命令

# curl redis.k2-infrastructure.svc.cluster.local
Failed to connect to redis.k2-infrastructure.svc.cluster.local port 80: No route to host

虽然报错了,但说明dns是解析到了。那意味着是不是需要更改连接呢,不需要的,通过增加搜索域的方式就可以不需要更改连接。

解决方案

在rancher部署中,增加搜索域,名称为$namespace.svc.cluster.local

服务注册中心状态DOWN问题排查_第2张图片

或者在yaml中添加dnsConfig

apiVersion: apps/v1
kind: Deployment
...
spec:
 ....
    spec:
      ....
      dnsPolicy: ClusterFirst
      dnsConfig:
        searches:
          - xx-infrastructure.svc.cluster.local
status: {}

你可能感兴趣的:(服务注册中心状态DOWN问题排查)