Consul监控
Consul支持众多监控工具进行对自身监控。我们这里使用Prometheus进行监控。
前提条件
-
有一个consul server集群及agent。集群搭建及配置请参考Consul安装备份升级
-
需要在配置文件中指定telemetry选项。如下所示
~]# cat /usr/local/consul/consul.d/consul.json { "datacenter": "dc1", "client_addr": "0.0.0.0", "bind_addr": "{{ GetInterfaceIP \"eth0\" }}", "data_dir": "/usr/local/consul/data", "retry_interval": "20s", "retry_join": ["10.111.67.1","10.111.67.2","10.111.67.3","10.111.67.4","10.111.67.5"], "enable_local_script_checks": true, "log_file": "/usr/local/consul/logs/", "log_level": "debug", "enable_debug": true, "pid_file": "/var/run/consul.pid", "performance": { "raft_multiplier": 1 }, "telemetry": { "prometheus_retention_time": "120s", "disable_hostname": true } }
-
启动成功后,使用如下命令测试
~]# curl 127.0.0.1:8500/v1/agent/metrics?format=prometheus # HELP consul_fsm_register consul_fsm_register # TYPE consul_fsm_register summary consul_fsm_register{quantile="0.5"} NaN consul_fsm_register{quantile="0.9"} NaN consul_fsm_register{quantile="0.99"} NaN consul_fsm_register_sum 3.396029010415077 consul_fsm_register_count 8 # HELP consul_http_GET_v1_agent_metrics consul_http_GET_v1_agent_metrics # TYPE consul_http_GET_v1_agent_metrics summary consul_http_GET_v1_agent_metrics{quantile="0.5"} 0.5403839945793152 consul_http_GET_v1_agent_metrics{quantile="0.9"} 0.5403839945793152 consul_http_GET_v1_agent_metrics{quantile="0.99"} 0.5403839945793152 consul_http_GET_v1_agent_metrics_sum 366820.44427236915 consul_http_GET_v1_agent_metrics_count 349523 # HELP consul_http_GET_v1_catalog_service__ consul_http_GET_v1_catalog_service__ # TYPE consul_http_GET_v1_catalog_service__ summary consul_http_GET_v1_catalog_service__{quantile="0.5"} 31258.423828125 consul_http_GET_v1_catalog_service__{quantile="0.9"} 306137.71875 consul_http_GET_v1_catalog_service__{quantile="0.99"} 306137.71875 consul_http_GET_v1_catalog_service___sum 4.0220439955034314e+11 consul_http_GET_v1_catalog_service___count 2.388023e+06 …………………………
Server监控
server监控我们采用Prometheus基于文件的自动发现(file_sd_configs
),也可以使用静态配置(static_config
)。
因为我们要做Consul的报警,报警需要有主机名,所以我们使用基于文件的自动发现(file_sd_configs
),对每台主机打上consul_node_name
标签。而静态配置(static_config
)则不能对每一台主机单独打标签,只能对整体的targets列表打标签。
配置文件如下,此配置文件是k8s的配置文件
~]# cat prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config-consul
namespace: prometheus
labels:
app: prometheus-consul
environment: prod
release: release
data:
prometheus.yml: |
global:
external_labels:
region: cn-hangzhou
monitor: consul
replica: A
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: consul-server
# 采集频率
scrape_interval: 60s
# 采集超时
scrape_timeout: 10s
# 采集对象的path路径
metrics_path: "/v1/agent/metrics"
scheme: http
params:
format: ['prometheus']
file_sd_configs:
- files:
- /etc/config/consul-server.json
refresh_interval: 1m
consul-server.json: |
[
{
"targets": [
"10.111.67.1:8500"
],
"labels": {
"consul_node_name": "Consul-Server-1"
}
},
{
"targets": [
"10.111.67.2:8500"
],
"labels": {
"consul_node_name": "Consul-Server-2"
}
},
{
"targets": [
"10.111.67.3:8500"
],
"labels": {
"consul_node_name": "Consul-Server-3"
}
},
{
"targets": [
"10.111.67.4:8500"
],
"labels": {
"consul_node_name": "Consul-Server-4"
}
},
{
"targets": [
"10.111.67.5:8500"
],
"labels": {
"consul_node_name": "Consul-Server-5"
}
}
]
至此,Prometheus就可以采集的Consul Server的数据了,可以使用Prometheus自带的UI进行查询。
Client监控
对于Consul client监控,因为Consul client数量太多,成百上千台。因此如果使用基于文件的发现(file_sd_configs
)给每一台主机打标签,维护这个文件工作量太大(有主机的新增和删除)。所以我们选用基于Consul的自动发现(consul_sd_config
)`来实现client的监控。
Consul client自注册
要想让Prometheus或者别的服务发现,那这个服务必须得注册到Consul中。因此我们使用脚本生成一个简单的服务注册
~]# cat create-consul-registration.sh
#!/bin/bash
ADDR=`ip addr show|awk -F '[ /]+' '/eth[0-9]|em[0-9]/ && /inet/ {print $3}'`
CONSUL_CONF_DIR='/usr/local/consul/consul.d'
CONSUL_REDISTER_FILE="$CONSUL_CONF_DIR/consul-members-registration.json"
if [[ -n "$ADDR" && -d $CONSUL_CONF_DIR ]];then
cat > ${CONSUL_REDISTER_FILE} <<-EOF
{
"service": {
"id": "consul-${ADDR}",
"name": "consul-members",
"tags": [
"prometheus",
"client",
"consul-client"
],
"address": "${ADDR}",
"port": 8500,
"check": {
"http": "http://127.0.0.1:8500",
"interval": "60s"
}
}
}
EOF
else
echo "ip address is empty or the $CONSUL_CONF_DIR does not exist"
fi
执行这个脚本会在/usr/local/consul/consul.d/
下创建服务注册的配置文件consul-members-registration.json
~]# cat /usr/local/consul/consul.d/consul-members-registration.json
{
"service": {
"id": "consul-10.111.74.8",
"name": "consul-members",
"tags": [
"prometheus",
"client",
"consul-client"
],
"address": "10.111.74.8",
"port": 8500,
"check": {
"http": "http://127.0.0.1:8500",
"interval": "60s"
}
}
}
之后执行consul reload
加载配置
~]# consul reload
此时,这个服务就已经注册到Consul中了,service名称为consul-members
,service ID为consul-10.111.74.86
,我们可以使用curl命令或者浏览器来验证。
~]# curl -s 127.0.0.1:8500/v1/agent/services|python -m json.tool
{
"consul-10.111.74.8": {
"Address": "10.111.74.8",
"EnableTagOverride": false,
"ID": "consul-10.111.74.8",
"Meta": {},
"Port": 8500,
"Service": "consul-members",
"Tags": [
"prometheus",
"client",
"consul-client"
],
"Weights": {
"Passing": 1,
"Warning": 1
}
}
}
Prometheus配置
配置如下:
~]# cat prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config-consul
namespace: prometheus
labels:
app: prometheus-consul
environment: prod
release: release
data:
prometheus.yml: |
global:
external_labels:
region: cn-hangzhou
monitor: consul
replica: A
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: consul-client
# 采集频率
scrape_interval: 60s
# 采集超时
scrape_timeout: 10s
# 采集对象的path路径
metrics_path: "/v1/agent/metrics"
scheme: http
params:
format: ['prometheus']
consul_sd_configs:
- server: "10.111.67.1:8500"
services:
- consul-members
relabel_configs:
- action: replace
source_labels:
- __meta_consul_dc
target_label: consul_dc
- action: replace
source_labels:
- __meta_consul_node
target_label: consul_node_name
- action: replace
source_labels:
- __meta_consul_service
target_label: consul_service
- action: replace
source_labels:
- __meta_consul_service_id
target_label: consul_service_id
因为我们要做Consul的报警,报警需要有主机名、Service名称、Service ID、DC等信息,所以我们需要对标签进行重写。可重写的标签有:
__meta_consul_address
: the address of the target__meta_consul_dc
: the datacenter name for the target__meta_consul_tagged_address_<key>
: each node tagged address key value of the target__meta_consul_metadata_<key>
: each node metadata key value of the target__meta_consul_node
: the node name defined for the target__meta_consul_service_address
: the service address of the target__meta_consul_service_id
: the service ID of the target__meta_consul_service_metadata_<key>
: each service metadata key value of the target__meta_consul_service_port
: the service port of the target__meta_consul_service
: the name of the service the target belongs to__meta_consul_tags
: the list of tags of the target joined by the tag separator