Prometheus-Operator监控kubernetes集群

Prometheus Operator (后面都简称 Operater) 提供如下功能:

1、创建/销毁:在 Kubernetes namespace 中更加容易地启动一个 Prometheues 实例,一个特定应用程序或者团队可以更容易使用 Operator。
2、便捷配置:通过 Kubernetes 资源配置 Prometheus 的基本信息,比如版本、存储、副本集等。
3、通过标签标记目标服务: 基于常见的 Kubernetes label 查询自动生成监控目标配置;不需要学习 Prometheus 特定的配置语言。

Prometheus Operator 架构图如下:
image.png

上面架构图中,各组件以不同的方式运行在 Kubernetes 集群中:

  • Operator: 根据自定义资源(Custom Resource Definition / CRDs)来部署和管理 Prometheus Server,同时监控这些自定义资源事件的变化来做相应的处理,是整个系统的控制中心。
  • Prometheus:声明 Prometheus deployment 期望的状态,Operator 确保这个 deployment 运行时一直与定义保持一致。
  • Prometheus Server: Operator 根据自定义资源 Prometheus 类型中定义的内容而部署的 Prometheus Server 集群,这些自定义资源可以看作是用来管理 Prometheus Server 集群的 StatefulSets 资源。
  • ServiceMonitor:声明指定监控的服务,描述了一组被 Prometheus 监控的目标列表。该资源通过 Labels 来选取对应的 Service Endpoint,让 Prometheus Server 通过选取的 Service 来获取 Metrics 信息。
  • Service:简单的说就是 Prometheus 监控的对象。
  • Alertmanager:定义 AlertManager deployment 期望的状态,Operator 确保这个deployment 运行时一直与定义保持一致。

以下为安装部署

https://github.com/prometheus-operator/kube-prometheus/tree/release-0.5

Prometheus-Operator版本和kubernetes版本有要求,具体见GitHub

1、GitHub下载yml文件做下分类

[root@master70 manifests]# ll
total 12
drwxr-xr-x 2 root root  241 Aug 14 19:57 01-node-exporter
drwxr-xr-x 2 root root  189 Aug 14 19:58 02-alertmanage
drwxr-xr-x 2 root root  272 Aug 14 19:58 03-kube-state-metrics
drwxr-xr-x 2 root root  254 Aug 14 19:58 04-grafana
drwxr-xr-x 2 root root 4096 Aug 14 19:58 05-prometheus-adapter
drwxr-xr-x 2 root root 4096 Aug 14 19:59 06-prometheus
drwxr-xr-x 2 root root  228 Aug 14 20:03 add
drwxr-xr-x 2 root root 4096 Aug 14 20:06 setup
[root@master70 manifests]# pwd
/root/2monitor/kube-prometheus-release-0.5/manifests
##分类文件history导出
  210  mkdir 01-node-exporter
  211  mv node-exporter-* 01-node-exporter/
  212  ll
  213  mkdir 02-alertmanage
  214  mv alertmanager-* 02-alertmanage/
  215  ll
  216  mkdir 03-kube-state-metrics
  217  mv kube-state-metrics-* 03-kube-state-metrics/
  218  ll
  219  mkdir 04-grafana
  220  mv grafana-* 04-grafana/
  221  ll
  222  mkdir 05-prometheus-adapter
  223  mv prometheus-adapter-* 05-prometheus-adapter/
  224  ll
  225  vim prometheus-clusterRole.yaml 
  226  vim prometheus-clusterRoleBinding.yaml 
  227  mkdor 06-prometheus
  228  mkdir 06-prometheus
  229  ll
  230  mv prometheus-* 06-prometheus/

2、首先我们需要安装 setup 目录下面的 CRD 和 Operator 资源对象:

[root@master70 manifests]# cd setup/
[root@master70 setup]# ll
total 952
-rw-r--r-- 1 root root     60 Jun 24 19:32 0namespace-namespace.yaml
-rw-r--r-- 1 root root 268343 Jun 24 19:32 prometheus-operator-0alertmanagerCustomResourceDefinition.yaml
-rw-r--r-- 1 root root  12635 Jun 24 19:32 prometheus-operator-0podmonitorCustomResourceDefinition.yaml
-rw-r--r-- 1 root root 348686 Jun 24 19:32 prometheus-operator-0prometheusCustomResourceDefinition.yaml
-rw-r--r-- 1 root root   3605 Jun 24 19:32 prometheus-operator-0prometheusruleCustomResourceDefinition.yaml
-rw-r--r-- 1 root root  23305 Jun 24 19:32 prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
-rw-r--r-- 1 root root 279736 Jun 24 19:32 prometheus-operator-0thanosrulerCustomResourceDefinition.yaml
-rw-r--r-- 1 root root    425 Jun 24 19:32 prometheus-operator-clusterRoleBinding.yaml
-rw-r--r-- 1 root root   1665 Jun 24 19:32 prometheus-operator-clusterRole.yaml
-rw-r--r-- 1 root root   1943 Jun 24 19:32 prometheus-operator-deployment.yaml
-rw-r--r-- 1 root root    239 Jun 24 19:32 prometheus-operator-serviceAccount.yaml
-rw-r--r-- 1 root root    422 Jun 24 19:32 prometheus-operator-service.yaml

直接创建

[root@master70 setup]# kubectl  create -f .
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com created
Error from server (AlreadyExists): error when creating "0namespace-namespace.yaml": namespaces "monitoring" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-0podmonitorCustomResourceDefinition.yaml": customresourcedefinitions.apiextensions.k8s.io "podmonitors.monitoring.coreos.com" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-0thanosrulerCustomResourceDefinition.yaml": customresourcedefinitions.apiextensions.k8s.io "thanosrulers.monitoring.coreos.com" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-clusterRole.yaml": clusterroles.rbac.authorization.k8s.io "prometheus-operator" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-clusterRoleBinding.yaml": clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-deployment.yaml": deployments.apps "prometheus-operator" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-service.yaml": services "prometheus-operator" already exists
Error from server (AlreadyExists): error when creating "prometheus-operator-serviceAccount.yaml": serviceaccounts "prometheus-operator" already exists

若提示crd存在,进行删除

[root@master70 setup]# kubectl delete -f .
namespace "monitoring" deleted
customresourcedefinition.apiextensions.k8s.io "alertmanagers.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "podmonitors.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "prometheuses.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "prometheusrules.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted
customresourcedefinition.apiextensions.k8s.io "thanosrulers.monitoring.coreos.com" deleted
clusterrole.rbac.authorization.k8s.io "prometheus-operator" deleted
clusterrolebinding.rbac.authorization.k8s.io "prometheus-operator" deleted
deployment.apps "prometheus-operator" deleted
service "prometheus-operator" deleted
serviceaccount "prometheus-operator" deleted



[root@master70 setup]# kubectl delete crd prometheuses.monitoring.coreos.com
[root@master70 setup]# kubectl delete crd prometheusrules.monitoring.coreos.com
[root@master70 setup]# kubectl delete crd servicemonitors.monitoring.coreos.com
[root@master70 setup]# kubectl delete crd alertmanagers.monitoring.coreos.com

重新创建

[root@master70 setup]# kubectl  create -f .
namespace/monitoring created
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com created
clusterrole.rbac.authorization.k8s.io/prometheus-operator created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator created
deployment.apps/prometheus-operator created
service/prometheus-operator created
serviceaccount/prometheus-operator created    
[root@master70 setup]# kubectl  get crd -n monitoring
NAME                                          CREATED AT
alertmanagers.monitoring.coreos.com           2020-08-14T12:10:54Z
bgpconfigurations.crd.projectcalico.org       2020-08-14T10:57:22Z
bgppeers.crd.projectcalico.org                2020-08-14T10:57:22Z
blockaffinities.crd.projectcalico.org         2020-08-14T10:57:22Z
clusterauthtokens.cluster.cattle.io           2020-08-14T10:57:46Z
clusterinformations.crd.projectcalico.org     2020-08-14T10:57:22Z
clusteruserattributes.cluster.cattle.io       2020-08-14T10:57:46Z
felixconfigurations.crd.projectcalico.org     2020-08-14T10:57:22Z
globalnetworkpolicies.crd.projectcalico.org   2020-08-14T10:57:22Z
globalnetworksets.crd.projectcalico.org       2020-08-14T10:57:22Z
hostendpoints.crd.projectcalico.org           2020-08-14T10:57:22Z
ipamblocks.crd.projectcalico.org              2020-08-14T10:57:22Z
ipamconfigs.crd.projectcalico.org             2020-08-14T10:57:22Z
ipamhandles.crd.projectcalico.org             2020-08-14T10:57:22Z
ippools.crd.projectcalico.org                 2020-08-14T10:57:22Z
networkpolicies.crd.projectcalico.org         2020-08-14T10:57:23Z
networksets.crd.projectcalico.org             2020-08-14T10:57:23Z
podmonitors.monitoring.coreos.com             2020-08-14T12:10:54Z
prometheuses.monitoring.coreos.com            2020-08-14T12:10:54Z
prometheusrules.monitoring.coreos.com         2020-08-14T12:10:54Z
servicemonitors.monitoring.coreos.com         2020-08-14T12:10:54Z
thanosrulers.monitoring.coreos.com            2020-08-14T12:10:55Z


#查看pod状态为running
[root@master70 setup]# kubectl get pod -n monitoring -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP           NODE     NOMINATED NODE   READINESS GATES
prometheus-operator-574fd8ccd9-xlx8t   2/2     Running   0          14m   10.42.1.11   node71              

3、修改grafana、prometheus、alertmanagers的svc端口暴露方式为nodeport(或者使用ingress暴露外部访问)

[root@master70 manifests]# cat 02-alertmanage/alertmanager-service.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    alertmanager: main
  name: alertmanager-main
  namespace: monitoring
spec:
  type: NodePort
  ports:
  - name: web
    port: 9093
    targetPort: web
    nodePort: 30093
  selector:
    alertmanager: main
    app: alertmanager
  sessionAffinity: ClientIP
[root@master70 manifests]# cat 04-grafana/grafana-service.yaml 
apiVersion: v1
kind: Service
metadata:
  labels:
    app: grafana
  name: grafana
  namespace: monitoring
spec:
  type: NodePort
  ports:
  - name: http
    port: 3000
    targetPort: http
    nodePort: 30030
  selector:
    app: grafana
[root@master70 manifests]# cat 06-prometheus/prometheus-service.yaml 
apiVersion: v1
kind: Service
metadata:
  labels:
    prometheus: k8s
  name: prometheus-k8s
  namespace: monitoring
spec:
  type: NodePort
  ports:
  - name: web
    port: 9090
    targetPort: web
    nodePort: 30090
  selector:
    app: prometheus
    prometheus: k8s
  sessionAffinity: ClientIP

4、创建node-exporter

[root@master70 manifests]# kubectl  apply -f 01-node-exporter/
clusterrole.rbac.authorization.k8s.io/node-exporter created
clusterrolebinding.rbac.authorization.k8s.io/node-exporter created
daemonset.apps/node-exporter created
service/node-exporter created
serviceaccount/node-exporter created
servicemonitor.monitoring.coreos.com/node-exporter created

[root@master70 manifests]# kubectl  get pod -n monitoring                            
NAME                                   READY   STATUS    RESTARTS   AGE
node-exporter-m9pgp                    2/2     Running   0          3m56s
node-exporter-tsmfs                    2/2     Running   0          3m56s
prometheus-operator-574fd8ccd9-xlx8t   2/2     Running   0          25m

5、创建kube-state-metrics

[root@master70 manifests]# kubectl apply -f 02-kube-state-metrics/
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
deployment.apps/kube-state-metrics created
service/kube-state-metrics created
serviceaccount/kube-state-metrics created
servicemonitor.monitoring.coreos.com/kube-state-metrics created

[root@master70 manifests]# kubectl  get pod -n monitoring
NAME                                   READY   STATUS    RESTARTS   AGE
kube-state-metrics-bdb8874fd-xptb5     3/3     Running   0          58s

6、创建promethues。

首先修改promethues配置文件,使用自动发现功能,自动发现 Kubernetes 集群中的 Service/Pod。

  • 创建自动发现文件:
[root@master70 06-prometheus]# cat ../add/prometheus-additional.yaml 
- job_name: 'kubernetes-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
  • 要想自动发现集群中的 Service,就需要我们在 Service 的 annotation 区域添加 prometheus.io/scrape=true 的声明,将上面文件直接保存为 prometheus-additional.yaml,然后通过这个文件创建一个对应的 Secret 对象:
[root@master70 add]# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
secret/additional-configs created
  • 然后我们需要在声明 prometheus 的资源对象文件中通过 additionalScrapeConfigs 属性添加上这个额外的配置:(末尾三行)
[root@master70 06-prometheus]# tail prometheus-prometheus.yaml 
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  version: v2.15.2
  additionalScrapeConfigs:
    name: additional-configs
    key: prometheus-additional.yaml
[root@master70 manifests]# kubectl  apply -f 06-prometheus/
clusterrole.rbac.authorization.k8s.io/prometheus-k8s created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus-operator created
prometheus.monitoring.coreos.com/k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s-config created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s-config created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
prometheusrule.monitoring.coreos.com/prometheus-k8s-rules created
service/prometheus-k8s created
serviceaccount/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus created
servicemonitor.monitoring.coreos.com/kube-apiserver created
servicemonitor.monitoring.coreos.com/coredns created
servicemonitor.monitoring.coreos.com/kube-controller-manager created
servicemonitor.monitoring.coreos.com/kube-scheduler created
servicemonitor.monitoring.coreos.com/kubelet created




[root@master70 manifests]# kubectl  apply -f 05-prometheus-adapter/
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io configured
clusterrole.rbac.authorization.k8s.io/prometheus-adapter created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-adapter created
clusterrolebinding.rbac.authorization.k8s.io/resource-metrics:system:auth-delegator created
clusterrole.rbac.authorization.k8s.io/resource-metrics-server-resources created
configmap/adapter-config created
deployment.apps/prometheus-adapter created
rolebinding.rbac.authorization.k8s.io/resource-metrics-auth-reader created
service/prometheus-adapter created
serviceaccount/prometheus-adapter created

  • 访问promethues,暴露端口为nodeport 30090
[root@master70 manifests]# kubectl  get svc -n monitoring
NAME                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
kube-state-metrics    ClusterIP   None                   8443/TCP,9443/TCP   3m45s
node-exporter         ClusterIP   None                   9100/TCP            8m29s
prometheus-adapter    ClusterIP   10.43.114.84           443/TCP             37s
prometheus-k8s        NodePort    10.43.45.89            9090:30090/TCP      68s
prometheus-operated   ClusterIP   None                   9090/TCP            69s
prometheus-operator   ClusterIP   None                   8443/TCP            30m
  • operator文件中已经定义了部分告警规则


    image.png
image.png

image.png

但是切换到 targets 页面下面却并没有发现对应的监控任务,查看 Prometheus 的 Pod 日志:

$ kubectl logs -f prometheus-k8s-0 prometheus -n monitoring
......
level=error ts=2020-08-14T02:38:27.800Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:261: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"
level=error ts=2020-08-14T02:38:27.801Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"

都是 xxx is forbidden,说明是 RBAC 权限的问题,通过 prometheus 资源对象的配置可以知道 Prometheus 绑定了一个名为 prometheus-k8s 的 ServiceAccount 对象,而这个对象绑定的是一个名为 prometheus-k8s 的 ClusterRole:
修改yml文件:

[root@master70 06-prometheus]# cat prometheus-clusterRole.yaml 
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-k8s
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - services
  - endpoints
  - pods
  - nodes/proxy
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  - nodes/metrics
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get

[root@master70 06-prometheus]# kubectl  apply -f .
clusterrole.rbac.authorization.k8s.io/prometheus-k8s configured

重新应用下promethues下的yml文件,稍等会就能在promethues的dashboard看到endpoints的targets


image.png

7、部署Grafana

[root@master70 manifests]# kubectl apply -f 07-grafana/
secret/grafana-datasources created
configmap/grafana-dashboard-apiserver created
configmap/grafana-dashboard-cluster-total created
configmap/grafana-dashboard-controller-manager created
configmap/grafana-dashboard-k8s-resources-cluster created
configmap/grafana-dashboard-k8s-resources-namespace created
configmap/grafana-dashboard-k8s-resources-node created
configmap/grafana-dashboard-k8s-resources-pod created
configmap/grafana-dashboard-k8s-resources-workload created
configmap/grafana-dashboard-k8s-resources-workloads-namespace created
configmap/grafana-dashboard-kubelet created
configmap/grafana-dashboard-namespace-by-pod created
configmap/grafana-dashboard-namespace-by-workload created
configmap/grafana-dashboard-node-cluster-rsrc-use created
configmap/grafana-dashboard-node-rsrc-use created
configmap/grafana-dashboard-nodes created
configmap/grafana-dashboard-persistentvolumesusage created
configmap/grafana-dashboard-pod-total created
configmap/grafana-dashboard-prometheus-remote-write created
configmap/grafana-dashboard-prometheus created
configmap/grafana-dashboard-proxy created
configmap/grafana-dashboard-scheduler created
configmap/grafana-dashboard-statefulset created
configmap/grafana-dashboard-workload-total created
configmap/grafana-dashboards created
deployment.apps/grafana created
service/grafana created
serviceaccount/grafana created
servicemonitor.monitoring.coreos.com/grafana created


[root@master70 manifests]# kubectl  get pod,svc -n monitoring|grep grafana
pod/grafana-5c55845445-9hsjj               1/1     Running   0          46s
service/grafana               NodePort    10.43.173.121           3000:30030/TCP      48s

访问grafana
默认登录用户密码:admin/admin,首次登陆会要求改密码

image.png

默认官方已经提供很多dashboard模板


image.png

image.png

8、部署告警组件

[root@master70 manifests]# kubectl  apply -f 08-alertmanage/
alertmanager.monitoring.coreos.com/main created
secret/alertmanager-main created
service/alertmanager-main created
serviceaccount/alertmanager-main created
servicemonitor.monitoring.coreos.com/alertmanager created

[root@master70 manifests]# kubectl  get pod,svc -n monitoring|grep alertman
pod/alertmanager-main-0                    2/2     Running   0          16s
pod/alertmanager-main-1                    2/2     Running   0          16s
pod/alertmanager-main-2                    2/2     Running   0          15s
service/alertmanager-main       NodePort    10.43.34.192            9093:30093/TCP               16s
service/alertmanager-operated   ClusterIP   None                    9093/TCP,9094/TCP,9094/UDP   16s

访问alertmanager页面


image.png

默认为告警页面,查看有无告警
菜单栏点击Status可以看到配置信息,这些配置信息实际上是来自于前面创建的 alertmanager-secret.yaml(配置文件为base64加密形式) 文件:


image.png

9、配置钉钉告警

[root@master70 add]# cat 08-dingtalk-hook.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: dingding-config
  namespace: monitoring
data:
  config.yml: |-
    templates:
     - /etc/prometheus-webhook-dingtalk/templet.tmpl
    targets:
      webhook2:
        url: https://oapi.dingtalk.com/robot/send?access_token=8c1293fb6418399f21526087f08cf6d241192531664a0de7e3a86004652
  templet.tmpl: |-
                {{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
                {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}
                
                {{ define "__text_alert_list" }}{{ range . }}
                **Labels**
                {{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
                {{ end }}
                **Annotations**
                {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
                {{ end }}
                **Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})
                {{ end }}{{ end }}
                
                {{ define "default.__text_alert_list" }}{{ range . }}
                ---
                **告警级别:** {{ .Labels.severity | upper }}
                
                **运营团队:** {{ .Labels.team | upper }}
                
                **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
                
                **事件信息:** 
                {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
                
                
                {{ end }}
                
                **事件标签:**
                {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
                {{ end }}{{ end }}
                {{ end }}
                {{ end }}
                {{ define "default.__text_alertresovle_list" }}{{ range . }}
                ---
                **告警级别:** {{ .Labels.severity | upper }}
                
                **运营团队:** {{ .Labels.team | upper }}
                
                **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
                
                **结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
                
                **事件信息:**
                {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
                
                
                {{ end }}
                
                **事件标签:**
                {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
                {{ end }}{{ end }}
                {{ end }}
                {{ end }}
                
                {{/* Default */}}
                {{ define "default.title" }}{{ template "__subject" . }}{{ end }}
                {{ define "default.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
                {{ if gt (len .Alerts.Firing) 0 -}}
                
                ![警报 图标](https://ss0.bdstatic.com/70cFuHSh_Q1YnxGkpoWK1HF6hhy/it/u=3626076420,1196179712&fm=15&gp=0.jpg)
                **====侦测到故障====**
                {{ template "default.__text_alert_list" .Alerts.Firing }}
                
                
                {{- end }}
                
                {{ if gt (len .Alerts.Resolved) 0 -}}
                {{ template "default.__text_alertresovle_list" .Alerts.Resolved }}
                
                
                {{- end }}
                {{- end }}
                
                {{/* Legacy */}}
                {{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}
                {{ define "legacy.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
                {{ template "__text_alert_list" .Alerts.Firing }}
                {{- end }}
                
                {{/* Following names for compatibility */}}
                {{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
                {{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dingtalk-hook
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dingtalk-hook
  template:
    metadata:
      labels:
        app: dingtalk-hook
    spec:
      containers:
      - name: dingtalk-hook
        image: timonwong/prometheus-webhook-dingtalk:v1.2.2
        args:
          - '--web.listen-address=0.0.0.0:8060'
          - '--log.level=info'
          - '--config.file=/etc/prometheus-webhook-dingtalk/config.yml'
        imagePullPolicy: IfNotPresent
        ports:
        - name: http
          containerPort: 8060
        resources:
          requests:
            cpu: 100m
            memory: 32Mi
          limits:
            cpu: 200m
            memory: 64Mi
        volumeMounts:
          - mountPath: /etc/prometheus-webhook-dingtalk/
            name: config-yml
          - mountPath: /etc/prometheus-webhook-dingtalk/
            name: template-conf
      volumes:
        - configMap:
            defaultMode: 420
            name: dingding-config
          name: config-yml
        - configMap:
            name: dingding-config
          name: template-conf 
---
apiVersion: v1
kind: Service
metadata:
  name: dingtalk-hook
  namespace: monitoring
spec:
  ports:
    - port: 8060
      protocol: TCP
      targetPort: 8060
      name: http
  selector:
    app: dingtalk-hook
  type: ClusterIP

报错

[root@master70 add]# kubectl  apply -f 08-dingtalk-hook.yaml 
configmap/dingding-config created
service/dingtalk-hook created
The Deployment "dingtalk-hook" is invalid: spec.template.spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/etc/prometheus-webhook-dingtalk/": must be unique

先注释

          - mountPath: /etc/prometheus-webhook-dingtalk/
            name: template-conf

创建完成后再打开注释 重新apply,

[root@master70 add]# vim 08-dingtalk-hook.yaml 
[root@master70 add]# kubectl  apply -f 08-dingtalk-hook.yaml 
configmap/dingding-config unchanged
deployment.apps/dingtalk-hook created
service/dingtalk-hook unchanged
[root@master70 add]# vim 08-dingtalk-hook.yaml   
[root@master70 add]# kubectl  apply -f 08-dingtalk-hook.yaml 
configmap/dingding-config unchanged
deployment.apps/dingtalk-hook configured
service/dingtalk-hook unchanged
[root@master70 add]# kubectl  get pod -n monitoring|grep ding
dingtalk-hook-85894948b7-dcwrg         1/1     Running   0          2m16s

配置alertmanager,
指定了告警接受者为邮件接收和webhook接收
secret文件支持64编码

[root@master70 08-alertmanage]# cat alertmanager-secret.yaml
apiVersion: v1
data: {}
kind: Secret
metadata:
  name: alertmanager-main
  namespace: monitoring
data:
  alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KICBzbXRwX3NtYXJ0aG9zdDogJ3NtdHAuMTYzLmNvbToyNScKICBzbXRwX2Zyb206ICdjZG5samNAMTYzLmNvbScKICBzbXRwX2F1dGhfdXNlcm5hbWU6ICdjZG5samNAMTYzLmNvbScKICBzbXRwX2F1dGhfcGFzc3dvcmQ6ICdsamNjLmNvbScKICBzbXRwX2hlbGxvOiAnMTYzLmNvbScKICBzbXRwX3JlcXVpcmVfdGxzOiBmYWxzZQpyb3V0ZToKICBncm91cF9ieTogWydqb2InLCAnc2V2ZXJpdHknXQogIGdyb3VwX3dhaXQ6IDMwcwogIGdyb3VwX2ludGVydmFsOiAzMHMKICByZXBlYXRfaW50ZXJ2YWw6IDFoCiAgcmVjZWl2ZXI6IGRlZmF1bHQKICByb3V0ZXM6CiAgLSByZWNlaXZlcjogd2ViaG9vawogICAgbWF0Y2hfcmU6CiAgICAgIHNldmVyaXR5OiBjcml0aWNhbHx3YXJuaW5nfG5vbmV8ZXJyb3IKcmVjZWl2ZXJzOgotIG5hbWU6ICdkZWZhdWx0JwogIGVtYWlsX2NvbmZpZ3M6CiAgLSB0bzogJzgwNzAxMTUyNUBxcS5jb20nCiAgICBzZW5kX3Jlc29sdmVkOiB0cnVlCi0gbmFtZTogJ3dlYmhvb2snCiAgd2ViaG9va19jb25maWdzOgogIC0gdXJsOiAnaHR0cDovL2Rpbmd0YWxrLWhvb2s6ODA2MC9kaW5ndGFsay93ZWJob29rMi9zZW5kJwogICAgc2VuZF9yZXNvbHZlZDogdHJ1ZQp0ZW1wbGF0ZXM6Ci0gJyoudG1wbCc=

解码后的配置文件为:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'xxxxxx'
  smtp_hello: '163.com'
  smtp_require_tls: false
route:
  group_by: ['job', 'severity']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 1h
  receiver: default
  routes:
  - receiver: webhook
    match_re:
      severity: critical|warning|none|error
receivers:
- name: 'default'
  email_configs:
  - to: '[email protected]'
    send_resolved: true
- name: 'webhook'
  webhook_configs:
  - url: 'http://dingtalk-hook:8060/dingtalk/webhook2/send'
    send_resolved: true
templates:
- '*.tmpl'

应用此文件

[root@master70 08-alertmanage]# kubectl  apply -f alertmanager-secret.yaml
secret/alertmanager-main configured

修改告警,测试钉钉通知:

本文使用的告警规则文件:prometheus-k8s-rules.yaml
链接:https://pan.baidu.com/s/1O8XNLxvM6ERmhGFjD3u7eg 
提取码:trna
##修改阈值和持续时间,然后重新apply该规则文件
    - alert: CPU报警
      annotations:
        description: '{{$labels.instance}}: NodeCpu usage above 85% (current value:
          {{ $value }}'
        summary: '{{$labels.instance}}: High NodeCpu usage detected'
      expr: 100 - (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100) > 1
      for: 1s
image.png
image.png

接收成功~

未完待续~~

你可能感兴趣的:(Prometheus-Operator监控kubernetes集群)