Eureka Server prometheus监控服务健康状态

为什么80%的码农都做不了架构师?>>>   hot3.png

背景

  服务进程监控一般都有相关组件处理了,早期业务出现特定服务使用的DB资源超过额配量,导致健康检测失败,服务陆续从Eureka下线了,业务监控在没路由到特定节点时候,或者路由到特定节点但没有碰到阈值场景不会触发告警,意味着业务短暂性正常,服务陆续下线;Eureka server 作为注册中心可以较早感知到服务注册状态,实例节点挂了(注册上的实例少了)、节点状态非UP 场景

监控方案

  • Eureka定时采集注册信息,实例节点数、实例节点状态信息
  • prometheus 定时采集Eureka server 采集到的数据
  • grafana 查询及对数据告警

Eureka注册信息数据采集

metric 数据结构定义

  • 统计节点状态

    type:Gauge

eureka_instance_status{client="{client}",status="{status}"}

client: eureka client application name

status 枚举

状态 枚举值
UP 1
DOWN 5
STARTING 2
OUT_OF_SERVICE 3
UNKNOW 4

最近n时间内平均值大于1,表示异常,执行告警

  • 统计节点数量

    type:Gauge

eureka_instance_count{client="{client}",count="{count}"}

client: eureka client application name

count: client count

java pom 依赖




    io.prometheus
    simpleclient
    0.6.0



    io.prometheus
    simpleclient_hotspot
    0.6.0



    io.prometheus
    simpleclient_httpserver
    0.6.0



    io.prometheus
    simpleclient_pushgateway
    0.6.0


    io.micrometer
    micrometer-registry-prometheus
    1.1.4


    io.micrometer
    micrometer-core
    1.1.4

java代码

@Component
public class InstanceStateCollector {

    @Autowired
    PeerAwareInstanceRegistry registry;

    private static final Logger log = LoggerFactory.getLogger(InstanceStateCollector.class);

    @Scheduled(cron = "*/5 * * * * ?")
    public void collect() {

        Applications applications = registry.getApplications();

        applications.getRegisteredApplications().forEach((registeredApplication) -> {
            Integer count = registeredApplication.size();
            String client = registeredApplication.getName();

            log.debug("client :{}, count :{}", client, count);
            PrometheusMetricsUtils.metricInstanceCount(client, count);

            registeredApplication.getInstances().forEach((instance) -> {
                String instanceId = instance.getInstanceId();
                log.debug("client :{}, instance :{}, status :{}", client, instanceId, instance.getStatus());
                PrometheusMetricsUtils.metricInstanceStatus(client, instanceId, instance.getStatus());

            });
        });
    }
}
@Service
public class PrometheusMetricsService {

    /**
     * 实例状态统计
     * eureka_instance_status{client="{client}",status="{status}"}
     */
    private static final String EUREKA_INSTANCE_STATUS = "mall_eureka_instance_status";

    /**
     * 实例数量统计
     * eureka_instance_count{client="{client}",count="{count}"}
     */
    private static final String EUREKA_INSTANCE_COUNT = "mall_eureka_instance_count";

    private static final String LABEL_CLIENT = "client";

    private final Gauge instanceStatusGauge;
    private final Gauge instanceCountGauge;


    public PrometheusMetricsService(CollectorRegistry registry) {
        instanceStatusGauge = Gauge
                .build(EUREKA_INSTANCE_STATUS, "instance status")
                .labelNames(LABEL_CLIENT)
                .register(registry);

        instanceCountGauge = Gauge
                .build(EUREKA_INSTANCE_COUNT, "instance count")
                .labelNames(LABEL_CLIENT)
                .register(registry);
    }

    /**
     * 实例状态埋点
     *
     * @param client   client name || application name
     * @param statusValue   status
     */
    void metricInstanceStatus(String client, Integer statusValue) {
        instanceStatusGauge.labels(client).set(statusValue);
    }

    /**
     * 实例数量埋点
     *
     * @param client client name || application name
     * @param count  count
     */
    void metricInstanceCount(String client, Integer count) {
        instanceCountGauge.labels(client).set(count);
    }



}

Prometheus采集Eureka server数据

prometheus.yml

  - job_name: 'mgmall-eureka'
    scrape_interval: 10s 
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['10.124.129.42:19110']

Grafana报表维护

报表

mall_eureka_instance_count{client="MGMALL-CONFIG"}
.....

![image-20190531140258528](/Users/yugj/Library/Application Support/typora-user-images/image-20190531140258528.png)

监控

avg() query(A,10s,now) is below 1

![image-20190531140319350](/Users/yugj/Library/Application Support/typora-user-images/image-20190531140319350.png)

转载于:https://my.oschina.net/yugj/blog/3056695

你可能感兴趣的:(Eureka Server prometheus监控服务健康状态)