为什么80%的码农都做不了架构师?>>>
背景
服务进程监控一般都有相关组件处理了,早期业务出现特定服务使用的DB资源超过额配量,导致健康检测失败,服务陆续从Eureka下线了,业务监控在没路由到特定节点时候,或者路由到特定节点但没有碰到阈值场景不会触发告警,意味着业务短暂性正常,服务陆续下线;Eureka server 作为注册中心可以较早感知到服务注册状态,实例节点挂了(注册上的实例少了)、节点状态非UP 场景
监控方案
- Eureka定时采集注册信息,实例节点数、实例节点状态信息
- prometheus 定时采集Eureka server 采集到的数据
- grafana 查询及对数据告警
Eureka注册信息数据采集
metric 数据结构定义
-
统计节点状态
type:Gauge
eureka_instance_status{client="{client}",status="{status}"}
client: eureka client application name
status 枚举
状态 | 枚举值 |
---|---|
UP | 1 |
DOWN | 5 |
STARTING | 2 |
OUT_OF_SERVICE | 3 |
UNKNOW | 4 |
最近n时间内平均值大于1,表示异常,执行告警
-
统计节点数量
type:Gauge
eureka_instance_count{client="{client}",count="{count}"}
client: eureka client application name
count: client count
java pom 依赖
io.prometheus
simpleclient
0.6.0
io.prometheus
simpleclient_hotspot
0.6.0
io.prometheus
simpleclient_httpserver
0.6.0
io.prometheus
simpleclient_pushgateway
0.6.0
io.micrometer
micrometer-registry-prometheus
1.1.4
io.micrometer
micrometer-core
1.1.4
java代码
@Component
public class InstanceStateCollector {
@Autowired
PeerAwareInstanceRegistry registry;
private static final Logger log = LoggerFactory.getLogger(InstanceStateCollector.class);
@Scheduled(cron = "*/5 * * * * ?")
public void collect() {
Applications applications = registry.getApplications();
applications.getRegisteredApplications().forEach((registeredApplication) -> {
Integer count = registeredApplication.size();
String client = registeredApplication.getName();
log.debug("client :{}, count :{}", client, count);
PrometheusMetricsUtils.metricInstanceCount(client, count);
registeredApplication.getInstances().forEach((instance) -> {
String instanceId = instance.getInstanceId();
log.debug("client :{}, instance :{}, status :{}", client, instanceId, instance.getStatus());
PrometheusMetricsUtils.metricInstanceStatus(client, instanceId, instance.getStatus());
});
});
}
}
@Service
public class PrometheusMetricsService {
/**
* 实例状态统计
* eureka_instance_status{client="{client}",status="{status}"}
*/
private static final String EUREKA_INSTANCE_STATUS = "mall_eureka_instance_status";
/**
* 实例数量统计
* eureka_instance_count{client="{client}",count="{count}"}
*/
private static final String EUREKA_INSTANCE_COUNT = "mall_eureka_instance_count";
private static final String LABEL_CLIENT = "client";
private final Gauge instanceStatusGauge;
private final Gauge instanceCountGauge;
public PrometheusMetricsService(CollectorRegistry registry) {
instanceStatusGauge = Gauge
.build(EUREKA_INSTANCE_STATUS, "instance status")
.labelNames(LABEL_CLIENT)
.register(registry);
instanceCountGauge = Gauge
.build(EUREKA_INSTANCE_COUNT, "instance count")
.labelNames(LABEL_CLIENT)
.register(registry);
}
/**
* 实例状态埋点
*
* @param client client name || application name
* @param statusValue status
*/
void metricInstanceStatus(String client, Integer statusValue) {
instanceStatusGauge.labels(client).set(statusValue);
}
/**
* 实例数量埋点
*
* @param client client name || application name
* @param count count
*/
void metricInstanceCount(String client, Integer count) {
instanceCountGauge.labels(client).set(count);
}
}
Prometheus采集Eureka server数据
prometheus.yml
- job_name: 'mgmall-eureka'
scrape_interval: 10s
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['10.124.129.42:19110']
Grafana报表维护
报表
mall_eureka_instance_count{client="MGMALL-CONFIG"}
.....
![image-20190531140258528](/Users/yugj/Library/Application Support/typora-user-images/image-20190531140258528.png)
监控
avg() query(A,10s,now) is below 1
![image-20190531140319350](/Users/yugj/Library/Application Support/typora-user-images/image-20190531140319350.png)