Consul监控

Consul支持众多监控工具进行对自身监控。我们这里使用Prometheus进行监控。

前提条件

  • 有一个consul server集群及agent。集群搭建及配置请参考Consul安装备份升级

  • 需要在配置文件中指定telemetry选项。如下所示

    ~]# cat /usr/local/consul/consul.d/consul.json 
    {
      "datacenter": "dc1",
      "client_addr": "0.0.0.0",
      "bind_addr": "{{ GetInterfaceIP \"eth0\" }}",
      "data_dir": "/usr/local/consul/data",
      "retry_interval": "20s",
      "retry_join": ["10.111.67.1","10.111.67.2","10.111.67.3","10.111.67.4","10.111.67.5"],
      "enable_local_script_checks": true,
      "log_file": "/usr/local/consul/logs/",
      "log_level": "debug",
      "enable_debug": true,
      "pid_file": "/var/run/consul.pid",
      "performance": {
          "raft_multiplier": 1
      },
      "telemetry": {
          "prometheus_retention_time": "120s",
          "disable_hostname": true
      }
    }
  • 启动成功后,使用如下命令测试

    ~]# curl 127.0.0.1:8500/v1/agent/metrics?format=prometheus
    # HELP consul_fsm_register consul_fsm_register
    # TYPE consul_fsm_register summary
    consul_fsm_register{quantile="0.5"} NaN
    consul_fsm_register{quantile="0.9"} NaN
    consul_fsm_register{quantile="0.99"} NaN
    consul_fsm_register_sum 3.396029010415077
    consul_fsm_register_count 8
    # HELP consul_http_GET_v1_agent_metrics consul_http_GET_v1_agent_metrics
    # TYPE consul_http_GET_v1_agent_metrics summary
    consul_http_GET_v1_agent_metrics{quantile="0.5"} 0.5403839945793152
    consul_http_GET_v1_agent_metrics{quantile="0.9"} 0.5403839945793152
    consul_http_GET_v1_agent_metrics{quantile="0.99"} 0.5403839945793152
    consul_http_GET_v1_agent_metrics_sum 366820.44427236915
    consul_http_GET_v1_agent_metrics_count 349523
    # HELP consul_http_GET_v1_catalog_service__ consul_http_GET_v1_catalog_service__
    # TYPE consul_http_GET_v1_catalog_service__ summary
    consul_http_GET_v1_catalog_service__{quantile="0.5"} 31258.423828125
    consul_http_GET_v1_catalog_service__{quantile="0.9"} 306137.71875
    consul_http_GET_v1_catalog_service__{quantile="0.99"} 306137.71875
    consul_http_GET_v1_catalog_service___sum 4.0220439955034314e+11
    consul_http_GET_v1_catalog_service___count 2.388023e+06
    …………………………

Server监控

server监控我们采用Prometheus基于文件的自动发现(file_sd_configs),也可以使用静态配置(static_config)。

因为我们要做Consul的报警,报警需要有主机名,所以我们使用基于文件的自动发现(file_sd_configs),对每台主机打上consul_node_name标签。而静态配置(static_config)则不能对每一台主机单独打标签,只能对整体的targets列表打标签。

配置文件如下,此配置文件是k8s的配置文件

~]# cat prometheus-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config-consul
  namespace: prometheus
  labels:
    app: prometheus-consul
    environment: prod
    release: release
data:
  prometheus.yml: |
    global:
      external_labels:
        region: cn-hangzhou
        monitor: consul
        replica: A
    scrape_configs:
    - job_name: prometheus
      static_configs:
      - targets:
        - localhost:9090

    - job_name: consul-server
      # 采集频率
      scrape_interval: 60s
      # 采集超时
      scrape_timeout: 10s
      # 采集对象的path路径
      metrics_path: "/v1/agent/metrics"
      scheme: http
      params:
        format: ['prometheus']
      file_sd_configs:
      - files:
        - /etc/config/consul-server.json
        refresh_interval: 1m

  consul-server.json: |
    [
        {
            "targets": [
                "10.111.67.1:8500"
            ],
            "labels": {
                "consul_node_name": "Consul-Server-1"
            }
        },
        {
            "targets": [
                "10.111.67.2:8500"
            ],
            "labels": {
                "consul_node_name": "Consul-Server-2"
            }
        },
        {
            "targets": [
                "10.111.67.3:8500"
            ],
            "labels": {
                "consul_node_name": "Consul-Server-3"
            }
        },
        {
            "targets": [
                "10.111.67.4:8500"
            ],
            "labels": {
                "consul_node_name": "Consul-Server-4"
            }
        },
        {
            "targets": [
                "10.111.67.5:8500"
            ],
            "labels": {
                "consul_node_name": "Consul-Server-5"
            }
        }
    ]

至此,Prometheus就可以采集的Consul Server的数据了,可以使用Prometheus自带的UI进行查询。

Client监控

对于Consul client监控,因为Consul client数量太多,成百上千台。因此如果使用基于文件的发现(file_sd_configs)给每一台主机打标签,维护这个文件工作量太大(有主机的新增和删除)。所以我们选用基于Consul的自动发现(consul_sd_config)`来实现client的监控。

Consul client自注册

要想让Prometheus或者别的服务发现,那这个服务必须得注册到Consul中。因此我们使用脚本生成一个简单的服务注册

~]# cat create-consul-registration.sh 
#!/bin/bash

ADDR=`ip addr show|awk -F '[ /]+' '/eth[0-9]|em[0-9]/ && /inet/ {print $3}'`
CONSUL_CONF_DIR='/usr/local/consul/consul.d'
CONSUL_REDISTER_FILE="$CONSUL_CONF_DIR/consul-members-registration.json"

if [[ -n "$ADDR" && -d $CONSUL_CONF_DIR ]];then
        cat > ${CONSUL_REDISTER_FILE} <<-EOF
        {
            "service": {
                "id": "consul-${ADDR}",
                "name": "consul-members",
                "tags": [
                    "prometheus",
                    "client",
                    "consul-client"
                ],
                "address": "${ADDR}",
                "port": 8500,
                "check": {
                    "http": "http://127.0.0.1:8500",
                    "interval": "60s"
                }
            }
        }
        EOF
else
        echo "ip address is empty or the $CONSUL_CONF_DIR does not exist"
fi

执行这个脚本会在/usr/local/consul/consul.d/下创建服务注册的配置文件consul-members-registration.json

~]# cat /usr/local/consul/consul.d/consul-members-registration.json 
{
    "service": {
        "id": "consul-10.111.74.8",
        "name": "consul-members",
        "tags": [
            "prometheus",
            "client",
            "consul-client"
        ],
        "address": "10.111.74.8",
        "port": 8500,
        "check": {
            "http": "http://127.0.0.1:8500",
            "interval": "60s"
        }
    }
}

之后执行consul reload加载配置

~]# consul reload

此时,这个服务就已经注册到Consul中了,service名称为consul-members ,service ID为consul-10.111.74.86,我们可以使用curl命令或者浏览器来验证。

~]# curl -s 127.0.0.1:8500/v1/agent/services|python -m json.tool
{
    "consul-10.111.74.8": {
        "Address": "10.111.74.8",
        "EnableTagOverride": false,
        "ID": "consul-10.111.74.8",
        "Meta": {},
        "Port": 8500,
        "Service": "consul-members",
        "Tags": [
            "prometheus",
            "client",
            "consul-client"
        ],
        "Weights": {
            "Passing": 1,
            "Warning": 1
        }
    }
}

Prometheus配置

配置如下:

~]# cat prometheus-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config-consul
  namespace: prometheus
  labels:
    app: prometheus-consul
    environment: prod
    release: release
data:
  prometheus.yml: |
    global:
      external_labels:
        region: cn-hangzhou
        monitor: consul
        replica: A
    scrape_configs:
    - job_name: prometheus
      static_configs:
      - targets:
        - localhost:9090

    - job_name: consul-client
      # 采集频率
      scrape_interval: 60s
      # 采集超时
      scrape_timeout: 10s
      # 采集对象的path路径
      metrics_path: "/v1/agent/metrics"
      scheme: http
      params:
        format: ['prometheus']
      consul_sd_configs:
      - server: "10.111.67.1:8500"
        services:
        - consul-members
      relabel_configs:
      - action: replace
        source_labels:
        - __meta_consul_dc
        target_label: consul_dc
      - action: replace
        source_labels:
        - __meta_consul_node
        target_label: consul_node_name
      - action: replace
        source_labels:
        - __meta_consul_service
        target_label: consul_service
      - action: replace
        source_labels:
        - __meta_consul_service_id
        target_label: consul_service_id

因为我们要做Consul的报警,报警需要有主机名、Service名称、Service ID、DC等信息,所以我们需要对标签进行重写。可重写的标签有:

  • __meta_consul_address: the address of the target
  • __meta_consul_dc: the datacenter name for the target
  • __meta_consul_tagged_address_<key>: each node tagged address key value of the target
  • __meta_consul_metadata_<key>: each node metadata key value of the target
  • __meta_consul_node: the node name defined for the target
  • __meta_consul_service_address: the service address of the target
  • __meta_consul_service_id: the service ID of the target
  • __meta_consul_service_metadata_<key>: each service metadata key value of the target
  • __meta_consul_service_port: the service port of the target
  • __meta_consul_service: the name of the service the target belongs to
  • __meta_consul_tags: the list of tags of the target joined by the tag separator