Prometheus -1-3


7.5  kube-state-metrics 

安装和配置 Alertmanager-发送报警到 qq 邮箱

cat alertmanager-cm.yaml

kind: ConfigMap

apiVersion: v1

metadata:

  name: alertmanager

  namespace: monitor-sa

data:

  alertmanager.yml: |-

    global:

      resolve_timeout: 1m

      smtp_smarthost: 'smtp.163.com:25'

      smtp_from: 'cadr***@163.com'

      smtp_auth_username: '1815871****'

      smtp_auth_password: 'BDBPRMLNZGKWRFJP'

      smtp_require_tls: false

    route:                      #用于配置告警分发策略

      group_by: [alertname]    # 采用哪个标签来作为分组依据

      group_wait: 10s            # 组告警等待时间。也就是告警产生后等待 10s,如果有同组告警一起发出

      group_interval: 10s        #两组告警的间隔时间

      repeat_interval: 10m      # 重复告警的间隔时间,减少相同邮件发送频率

      receiver: default-receiver  #设置默认接受人

    receivers:

    - name: 'default-receiver'

      email_configs:

      - to: '137855***@qq.com'

        send_resolved: true

alertmanager 配置文件解释说明:   

smtp_smarthost: 'smtp.163.com:25'

#用于发送邮件的邮箱的 SMTP 服务器地址+端口 

smtp_from: 'cadr****@163.com'

#这是指定从哪个邮箱发送报警 

smtp_auth_username: '181****'

#这是发送邮箱的认证用户,不是邮箱名 

smtp_auth_password: 'BDBPRMLNZGKWRFJP'

#这是发送邮箱的授权码而不是登录密码 email_configs:

- to: '1378******@qq.com'

#to 后面指定发送到哪个邮箱,我发送到我的 qq 邮箱,大家需要写自己的邮箱地址,不应该跟 smtp_from 的邮箱名字重复

#通过 kubectl apply 更新文件

kubectl apply -f alertmanager-cm.yaml

cat prometheus-alertmanager-cfg.yaml

kind: ConfigMap

apiVersion: v1

metadata:

  labels:

    app: prometheus

  name: prometheus-config

  namespace: monitor-sa

data:

  prometheus.yml: |

    rule_files:

    - /etc/prometheus/rules.yml

    alerting:

      alertmanagers:

      - static_configs:

        - targets: ["localhost:9093"]

    global:

      scrape_interval: 15s

      scrape_timeout: 10s

      evaluation_interval: 1m

    scrape_configs:

    - job_name: 'kubernetes-node'

      kubernetes_sd_configs:

      - role: node

      relabel_configs:

      - source_labels: [__address__]

        regex: '(.*):10250'

        replacement: '${1}:9100'

        target_label: __address__

        action: replace

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

    - job_name: 'kubernetes-node-cadvisor'

      kubernetes_sd_configs:

      - role:  node

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      relabel_configs:

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

      - target_label: __address__

        replacement: kubernetes.default.svc:443

      - source_labels: [__meta_kubernetes_node_name]

        regex: (.+)

        target_label: __metrics_path__

        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

    - job_name: 'kubernetes-apiserver'

      kubernetes_sd_configs:

      - role: endpoints

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      relabel_configs:

      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]

        action: keep

        regex: default;kubernetes;https

    - job_name: 'kubernetes-service-endpoints'

      kubernetes_sd_configs:

      - role: endpoints

      relabel_configs:

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]

        action: keep

        regex: true

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]

        action: replace

        target_label: __scheme__

        regex: (https?)

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]

        action: replace

        target_label: __metrics_path__

        regex: (.+)

      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]

        action: replace

        target_label: __address__

        regex: ([^:]+)(?::\d+)?;(\d+)

        replacement: $1:$2

      - action: labelmap

        regex: __meta_kubernetes_service_label_(.+)

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: kubernetes_namespace

      - source_labels: [__meta_kubernetes_service_name]

        action: replace

target_label: kubernetes_name 

    - job_name: kubernetes-pods

      kubernetes_sd_configs:

      - role: pod

      relabel_configs:

      - action: keep

        regex: true

        source_labels:

        - __meta_kubernetes_pod_annotation_prometheus_io_scrape

      - action: replace

        regex: (.+)

        source_labels:

        - __meta_kubernetes_pod_annotation_prometheus_io_path

        target_label: __metrics_path__

      - action: replace

        regex: ([^:]+)(?::\d+)?;(\d+)

        replacement: $1:$2

        source_labels:

        - __address__

        - __meta_kubernetes_pod_annotation_prometheus_io_port

        target_label: __address__

      - action: labelmap

        regex: __meta_kubernetes_pod_label_(.+)

      - action: replace

        source_labels:

        - __meta_kubernetes_namespace

        target_label: kubernetes_namespace

      - action: replace

        source_labels:

        - __meta_kubernetes_pod_name

        target_label: kubernetes_pod_name

    - job_name: 'kubernetes-schedule'

      scrape_interval: 5s

      static_configs:

      - targets: ['192.168.40.130:10251']

    - job_name: 'kubernetes-controller-manager'

      scrape_interval: 5s

      static_configs:

      - targets: ['192.168.40.130:10252']

    - job_name: 'kubernetes-kube-proxy'

      scrape_interval: 5s

      static_configs:

      - targets: ['192.168.40.130:10249','192.168.40.131:10249']

    - job_name: 'kubernetes-etcd'

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt

        cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt

        key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key

      scrape_interval: 5s

      static_configs:

      - targets: ['192.168.40.130:2379']

  rules.yml: |

    groups:

    - name: example

      rules:

      - alert: kube-proxy的cpu使用率大于80%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"

      - alert:  kube-proxy的cpu使用率大于90%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 90

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"

      - alert: scheduler的cpu使用率大于80%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 80

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"

      - alert:  scheduler的cpu使用率大于90%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 90

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"

      - alert: controller-manager的cpu使用率大于80%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 80

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"

      - alert:  controller-manager的cpu使用率大于90%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 0

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"

      - alert: apiserver的cpu使用率大于80%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 80

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"

      - alert:  apiserver的cpu使用率大于90%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 90

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"

      - alert: etcd的cpu使用率大于80%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 80

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"

      - alert:  etcd的cpu使用率大于90%

        expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 90

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"

      - alert: kube-state-metrics的cpu使用率大于80%

        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 80

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"

          value: "{{ $value }}%"

threshold: "80%" 

      - alert: kube-state-metrics的cpu使用率大于90%

        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 0

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"

          value: "{{ $value }}%"

threshold: "90%" 

      - alert: coredns的cpu使用率大于80%

        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 80

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"

          value: "{{ $value }}%"

threshold: "80%" 

      - alert: coredns的cpu使用率大于90%

        expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 90

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"

          value: "{{ $value }}%"

threshold: "90%" 

      - alert: kube-proxy打开句柄数>600

        expr: process_open_fds{job=~"kubernetes-kube-proxy"}  > 600

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"

          value: "{{ $value }}"

      - alert: kube-proxy打开句柄数>1000

        expr: process_open_fds{job=~"kubernetes-kube-proxy"}  > 1000

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"

          value: "{{ $value }}"

      - alert: kubernetes-schedule打开句柄数>600

        expr: process_open_fds{job=~"kubernetes-schedule"}  > 600

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"

          value: "{{ $value }}"

      - alert: kubernetes-schedule打开句柄数>1000

        expr: process_open_fds{job=~"kubernetes-schedule"}  > 1000

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"

          value: "{{ $value }}"

      - alert: kubernetes-controller-manager打开句柄数>600

        expr: process_open_fds{job=~"kubernetes-controller-manager"}  > 600

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"

          value: "{{ $value }}"

      - alert: kubernetes-controller-manager打开句柄数>1000

        expr: process_open_fds{job=~"kubernetes-controller-manager"}  > 1000

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"

          value: "{{ $value }}"

      - alert: kubernetes-apiserver打开句柄数>600

        expr: process_open_fds{job=~"kubernetes-apiserver"}  > 600

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"

          value: "{{ $value }}"

      - alert: kubernetes-apiserver打开句柄数>1000

        expr: process_open_fds{job=~"kubernetes-apiserver"}  > 1000

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"

          value: "{{ $value }}"

      - alert: kubernetes-etcd打开句柄数>600

        expr: process_open_fds{job=~"kubernetes-etcd"}  > 600

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"

          value: "{{ $value }}"

      - alert: kubernetes-etcd打开句柄数>1000

        expr: process_open_fds{job=~"kubernetes-etcd"}  > 1000

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"

          value: "{{ $value }}"

      - alert: coredns

        expr: process_open_fds{k8s_app=~"kube-dns"}  > 600

        for: 2s

        labels:

severity: warnning 

        annotations:

          description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过600"

          value: "{{ $value }}"

      - alert: coredns

        expr: process_open_fds{k8s_app=~"kube-dns"}  > 1000

        for: 2s

        labels:

          severity: critical

        annotations:

          description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过1000"

          value: "{{ $value }}"

      - alert: kube-proxy

        expr: process_virtual_memory_bytes{job=~"kubernetes-kube-proxy"}  > 2000000000

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"

          value: "{{ $value }}"

      - alert: scheduler

        expr: process_virtual_memory_bytes{job=~"kubernetes-schedule"}  > 2000000000

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"

          value: "{{ $value }}"

      - alert: kubernetes-controller-manager

        expr: process_virtual_memory_bytes{job=~"kubernetes-controller-manager"}  > 2000000000

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"

          value: "{{ $value }}"

      - alert: kubernetes-apiserver

        expr: process_virtual_memory_bytes{job=~"kubernetes-apiserver"}  > 2000000000

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"

          value: "{{ $value }}"

      - alert: kubernetes-etcd

        expr: process_virtual_memory_bytes{job=~"kubernetes-etcd"}  > 2000000000

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"

          value: "{{ $value }}"

      - alert: kube-dns

        expr: process_virtual_memory_bytes{k8s_app=~"kube-dns"}  > 2000000000

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 使用虚拟内存超过2G"

          value: "{{ $value }}"

      - alert: HttpRequestsAvg

        expr: sum(rate(rest_client_requests_total{job=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"}[1m]))  > 1000

        for: 2s

        labels:

          team: admin

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}): TPS超过1000"

          value: "{{ $value }}"

threshold: "1000" 

      - alert: Pod_restarts

        expr: kube_pod_container_status_restarts_total{namespace=~"kube-system|default|monitor-sa"} > 0

        for: 2s

        labels:

          severity: warnning

        annotations:

          description: "在{{$labels.namespace}}名称空间下发现{{$labels.pod}}这个pod下的容器{{$labels.container}}被重启,这个监控指标是由{{$labels.instance}}采集的"

          value: "{{ $value }}"

          threshold: "0"

      - alert: Pod_waiting

        expr: kube_pod_container_status_waiting_reason{namespace=~"kube-system|default"} == 1

        for: 2s

        labels:

          team: admin

        annotations:

          description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}启动异常等待中"

          value: "{{ $value }}"

threshold: "1" 

      - alert: Pod_terminated

        expr: kube_pod_container_status_terminated_reason{namespace=~"kube-system|default|monitor-sa"} == 1

        for: 2s

        labels:

          team: admin

        annotations:

          description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}被删除"

          value: "{{ $value }}"

          threshold: "1"

      - alert: Etcd_leader

        expr: etcd_server_has_leader{job="kubernetes-etcd"} == 0

        for: 2s

        labels:

          team: admin

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}): 当前没有leader"

          value: "{{ $value }}"

          threshold: "0"

      - alert: Etcd_leader_changes

        expr: rate(etcd_server_leader_changes_seen_total{job="kubernetes-etcd"}[1m]) > 0

        for: 2s

        labels:

          team: admin

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}): 当前leader已发生改变"

          value: "{{ $value }}"

          threshold: "0"

      - alert: Etcd_failed

        expr: rate(etcd_server_proposals_failed_total{job="kubernetes-etcd"}[1m]) > 0

        for: 2s

        labels:

          team: admin

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}): 服务失败"

          value: "{{ $value }}"

          threshold: "0"

      - alert: Etcd_db_total_size

        expr: etcd_debugging_mvcc_db_total_size_in_bytes{job="kubernetes-etcd"} > 10000000000

        for: 2s

        labels:

          team: admin

        annotations:

          description: "组件{{$labels.job}}({{$labels.instance}}):db空间超过10G"

          value: "{{ $value }}"

          threshold: "10G"

      - alert: Endpoint_ready

        expr: kube_endpoint_address_not_ready{namespace=~"kube-system|default"} == 1

        for: 2s

        labels:

          team: admin

        annotations:

          description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.endpoint}}不可用"

          value: "{{ $value }}"

          threshold: "1"

    - name: 物理节点状态-监控告警

      rules:

      - alert: 物理节点cpu使用率

        expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 90

        for: 2s

        labels:

          severity: ccritical

        annotations:

          summary: "{{ $labels.instance }}cpu使用率过高"

description: "{{ $labels.instance }}的cpu使用率超过90%,当前使用率[{{ $value }}],需要排查处理" 

      - alert: 物理节点内存使用率

        expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90

        for: 2s

        labels:

          severity: critical

        annotations:

          summary: "{{ $labels.instance }}内存使用率过高"

          description: "{{ $labels.instance }}的内存使用率超过90%,当前使用率[{{ $value }}],需要排查处理"

      - alert: InstanceDown

        expr: up == 0

        for: 2s

        labels:

          severity: critical

annotations: 

          summary: "{{ $labels.instance }}: 服务器宕机"

          description: "{{ $labels.instance }}: 服务器延时超过2分钟"

      - alert: 物理节点磁盘的IO性能

        expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60

        for: 2s

        labels:

          severity: critical

        annotations:

          summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"

          description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"

      - alert: 入网流量带宽

        expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400

        for: 2s

        labels:

          severity: critical

        annotations:

          summary: "{{$labels.mountpoint}} 流入网络带宽过高!"

          description: "{{$labels.mountpoint }}流入网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"

      - alert: 出网流量带宽

        expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400

        for: 2s

        labels:

          severity: critical

        annotations:

          summary: "{{$labels.mountpoint}} 流出网络带宽过高!"

          description: "{{$labels.mountpoint }}流出网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"

      - alert: TCP会话

        expr: node_netstat_Tcp_CurrEstab > 1000

        for: 2s

        labels:

          severity: critical

        annotations:

          summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"

          description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"

      - alert: 磁盘容量

        expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80

        for: 2s

        labels:

          severity: critical

        annotations:

          summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"

          description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"

注意:配置文件解释说明

- job_name: 'kubernetes-schedule' scrape_interval: 5s

static_configs:

- targets: ['192.168.1.63:10251']

- job_name: 'kubernetes-controller-manager'

#god63 节点的 ip:schedule 端口

scrape_interval: 5s

static_configs:

- targets: ['192.168.172.163:10252'] 

#god63 节点的 ip:controller-manager 端口

- job_name: 'kubernetes-kube-proxy'

scrape_interval: 5s

static_configs:

- targets: ['192.168.172.163:10249','192.168.172.164:10249']

#god63 和 god64 节点的 ip:kube-proxy 端口 - job_name: 'kubernetes-etcd'

scheme: https tls_config:

ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key

scrape_interval: 5s static_configs:

- targets: ['192.168.172.163:2379']

#god63 节点的 ip:etcd 端口

#更新资源清单文件

# kubectl delete -f prometheus-cfg.yaml

configmap "prometheus-config" deleted

# kubectl apply -f prometheus-alertmanager-cfg.yaml 

alertmanager.tar.gz 上传的 k8s 的各个节点

cat prometheus-alertmanager-deploy.yaml

---

apiVersion: apps/v1

kind: Deployment

metadata:

  name: prometheus-server

  namespace: monitor-sa

  labels:

    app: prometheus

spec:

  replicas: 1

  selector:

    matchLabels:

      app: prometheus

      component: server

    #matchExpressions:

    #- {key: app, operator: In, values: [prometheus]}

    #- {key: component, operator: In, values: [server]}

  template:

    metadata:

      labels:

        app: prometheus

        component: server

      annotations:

        prometheus.io/scrape: 'false'

    spec:

      nodeName: god64

      serviceAccountName: monitor

      containers:

      - name: prometheus

        image: prom/prometheus:v2.2.1

        imagePullPolicy: IfNotPresent

        command:

        - "/bin/prometheus"

        args:

        - "--config.file=/etc/prometheus/prometheus.yml"

        - "--storage.tsdb.path=/prometheus"

        - "--storage.tsdb.retention=24h"

        - "--web.enable-lifecycle"

        ports:

        - containerPort: 9090

          protocol: TCP

        volumeMounts:

        - mountPath: /etc/prometheus

          name: prometheus-config

        - mountPath: /prometheus/

          name: prometheus-storage-volume

        - name: k8s-certs

          mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/

      - name: alertmanager

        image: prom/alertmanager:v0.14.0

        imagePullPolicy: IfNotPresent

        args:

        - "--config.file=/etc/alertmanager/alertmanager.yml"

        - "--log.level=debug"

        ports:

        - containerPort: 9093

          protocol: TCP

          name: alertmanager

        volumeMounts:

        - name: alertmanager-config

          mountPath: /etc/alertmanager

        - name: alertmanager-storage

          mountPath: /alertmanager

        - name: localtime

          mountPath: /etc/localtime

      volumes:

        - name: prometheus-config

          configMap:

            name: prometheus-config

        - name: prometheus-storage-volume

          hostPath:

          path: /data

          type: Directory

        - name: k8s-certs

          secret:

          secretName: etcd-certs

        - name: alertmanager-config

          configMap:

            name: alertmanager

        - name: alertmanager-storage

          hostPath:

          path: /data/alertmanager

          type: DirectoryOrCreate

        - name: localtime

          hostPath:

          path: /usr/share/zoneinfo/Asia/Shanghai

生成一个 etcd-certs,这个在部署 prometheus 需要

kubectl -n monitor-sa create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/server.key --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/ca.crt

#查看 prometheus 是否部署成功

kubectl get pods -n monitor-sa | grep prometheus

prometheus-server-5bc47cc46d-nzgbn  1/1    Running      0          5s

显示如下,可看到 pod 状态是 running,说明 prometheus 部署成功

cat alertmanager-svc.yaml

---

apiVersion: v1

kind: Service

metadata:

  labels:

    name: prometheus

    kubernetes.io/cluster-service: 'true'

  name: alertmanager

  namespace: monitor-sa

spec:

  ports:

  - name: alertmanager

    nodePort: 30066    # 外部机器可访问的端口。

    port: 9093            #kubernetes中的服务之间访问的端口

    protocol: TCP

    targetPort: 9093    #       容器的端口(

  selector:

    app: prometheus

  sessionAffinity: None

  type: NodePort

kubectl apply -f alertmanager-svc.yaml

#查看 service 在物理机上映射的端口

kubectl get svc -n monitor-sa

NAME          TYPE      CLUSTER-IP      EXTERNAL-IP  PORT(S)          AGE

alertmanager  NodePort  10.104.175.75            9093:30066/TCP  28m

prometheus    NodePort  10.106.198.175          9090:31994/TCP  22h

注意:上面可以看到 prometheus 的 service 暴漏的端口是 30009,alertmanager 的 service 暴 露的端口是 30066

访问 prometheus 的 web 界面 点击 status->targets,可看到如下

从上面可以发现 kubernetes-controller-manager 和 kubernetes-schedule 都显示连接不上对 应的端口

可按如下方法处理;

vim /etc/kubernetes/manifests/kube-scheduler.yaml 修改如下内容:

把--bind-address=127.0.0.1 变成--bind-address=192.168.172.163 把 httpGet:字段下的 hosts 由 127.0.0.1 变成 192.168.172.163 把—port=0 删除

修改之后在 k8s 各个节点执行 systemctl restart kubelet

kubectl get cs 显示如下:

NAME controller-manager scheduler

etcd-0

STATUS Healthy ok

Healthy ok

Healthy {"health":"true"}

ss -antulp | grep :10251

ss -antulp | grep :10252

可以看到相应的端口已经被物理机监听了 点击 status->targets,可看到如下

是因为 kube-proxy 默认端口 10249 是监听在 127.0.0.1 上的,需要改成监听到物理节点上,按如 下方法修改,线上建议在安装 k8s 的时候就做修改,这样风险小一些:

kubectl edit configmap kube-proxy -n kube-system

把 metricsBindAddress 这段修改成 metricsBindAddress: 0.0.0.0:10249

然后重新启动 kube-proxy 这个 pod

]# kubectl get pods -n kube-system | grep kube-proxy |awk '{print $1}' | xargs kubectl delete pods -n kube-system

]# ss -antulp |grep :10249

可显示如下

]# ss -antulp | grep :10249

tcp LISTEN 0 128 [::]:10249

点击 Alerts,可看到如下

把 kubernetes-etcd 展开,可看到如下:

FIRING 表示 prometheus 已经将告警发给 alertmanager,在 Alertmanager 中可以看到有一个 alert。

登录到 alertmanager web 界面,浏览器输入 192.168.172.163:30066,显示如下

这样我在我的 qq 邮箱,就可以收到报警了

修改 prometheus 任何一个配置文件之后,可通过 kubectl apply 使配置生效,执行顺序如下: 

]# kubectl delete -f alertmanager-cm.yaml

]# kubectl apply -f alertmanager-cm.yaml

]# kubectl delete -f prometheus-alertmanager-cfg.yaml 

# kubectl apply -f prometheus-alertmanager-cfg.yaml 

# kubectl delete-f prometheus-alertmanager-deploy.yaml 

# kubectl apply –f prometheus-alertmanager-deploy.yaml

你可能感兴趣的:(Prometheus -1-3)