Actuator Health 超时导致 Spring Boot Admin 反复 Offline / Up 的临时解决方案

问题现象

Spring Boot Admin (SBA) 监控从早晨 7:23 开始反复通知 service offline / up。

临时解决方案

修改 SBA 配置:

spring.boot.admin.monitor.period=180000
spring.boot.admin.monitor.status-lifetime=180000
spring.boot.admin.monitor.read-timeout=120000

修改 Zuul 配置:

ribbon.ReadTimeout=120000
ribbon.ConnectTimeout=120000

TODO

  • 是什么原因导致 actuator health 突然变慢?

排查过程

SBA 日志如下:

Couldn't retrieve status for Application [id=20e256cd, name=ADMIN-SERVICE, managementUrl=http://172.19.222.xxx:yyyy/, healthUrl=http://172.19.222.xxx:yyyy/health, serviceUrl=http://172.19.222.xxx:yyyy/]

org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://172.19.222.xxx:yyyy/health": Read timed out; 

nested exception is java.net.SocketTimeoutException: Read timed out

B 和 D 可用区的 service 是同样的现象;

排除网络原因,monitor / gateway / rest 都在 B 可用区,只有 rest-d1 在 D 可用区;

排除 SBA 服务原因,重启 sba 服务,无效;

排除 SBA 机器原因,重启 sba 机器,无效;

初步锁定为 actuator 问题,手动调用 actuator health 超时;

http http://172.19.222.xxx:yyyy/health

http: error: Request timed out (30s).

修改 actuator 配置,关闭未使用或不重要的检查点,无效;

management.health.db.enabled=false
management.health.mail.enabled=false
management.health.redis.enabled=false
management.health.mongo.enabled=false

查看 MySQL 监控,确认数据库一切正常;

修改 SBA 配置,增加超时时间,无效

spring.boot.admin.monitor.period=180000
spring.boot.admin.monitor.status-lifetime=180000
spring.boot.admin.monitor.read-timeout=120000

修改 Zuul 配置,增加超时时间,有效

ribbon.ReadTimeout=120000
ribbon.ConnectTimeout=120000

参考资料

Spring Boot Admin 集成 Eureka 和 Actuator 后,服务 health 状态返回超时 https://www.bitdoom.com/2018/03/21/p140/

启动 spring boot admin 项目后,发现很多服务状态都是 DOWN,发现是 actuator 的 health 端点访问很慢超时造成的。经过排查,需要把 management 的检查数据库相关属性关闭掉,问题解决。

Long health request + Read Timeout https://github.com/codecentric/spring-boot-admin/issues/494

The values shown in the ui are fetched via the zuul proxy. You can use zuul.host.socket-timeout-millis (default: 10000) and zuul.host.connect-timeout-millis (default: 2000) to control these timeouts.
But you should better fix your slow /health responses, as they are made quite often.

spring-cloud-zuul timeout configuration does not work https://stackoverflow.com/questions/49525707/spring-cloud-zuul-timeout-configuration-does-not-work

Try to define the below properties instead if you are using Zuul withe Eureka.
ribbon:
ReadTimeout: 60000
ConnectTimeout: 20000
If you are using Zuul with Eureka, Zuul will use RibbonRoutingFilter for routing instead of SimpleHostRoutingFilter. In this case, HTTP requests are handled by Ribbon.

Table 3. Spring Boot Admin Server configuration options
https://codecentric.github.io/spring-boot-admin/1.5.5/#spring-boot-admin-server

你可能感兴趣的:(Actuator Health 超时导致 Spring Boot Admin 反复 Offline / Up 的临时解决方案)