一、Prometheus概述

    Prometheus是一个开源系统监测和警报工具箱。

主要特征:  

  • 多维数据模型(时间序列由metri和key/value定义)

  • 灵活的查询语言

  • 不依赖分布式存储

  • 采用 http 协议,使用 pull 拉取数据

  • 可以通过push gateway进行时序列数据推送

  • 可通过服务发现或静态配置发现目标

  • 多种可视化图表及仪表盘支持

    Prometheus架构如下:



使用Prometheus Operator 监控Kubernetes_第1张图片

    Prometheus组件包括:Prometheus server、push gateway 、alertmanager、Web UI等。

    Prometheus server 定期从数据源拉取数据,然后将数据持久化到磁盘。Prometheus 可以配置 rules,然后定时查询数据,当条件触发的时候,会将 alert 推送到配置的 Alertmanager。Alertmanager 收到警告的时候,可以根据配置,聚合并记录新时间序列,或者生成警报。同时还可以使用其他 API 或者 Grafana 来将收集到的数据进行可视化。


二、安装Prometheus Operator 

1.Prometheus Operator简化了在 Kubernetes 上部署并管理和运行 Prometheus 和 Alertmanager 集群。

# wget https://codeload.github.com/coreos/prometheus-operator/tar.gz/v0.18.0 -O prometheus-operator-0.18.0.tar.gz
# tar -zxvf prometheus-operator-0.18.0.tar.gz
# cd prometheus-operator-0.18.0
# kubectl apply -f bundle.yaml 
clusterrolebinding "prometheus-operator" configured
clusterrole "prometheus-operator" configured
serviceaccount "prometheus-operator" created
deployment "prometheus-operator" created
# cd contrib/kube-prometheus
# hack/cluster-monitoring/deploy
namespace "monitoring" created
clusterrolebinding "prometheus-operator" created
clusterrole "prometheus-operator" created
serviceaccount "prometheus-operator" created
service "prometheus-operator" created
deployment "prometheus-operator" created
Waiting for Operator to register custom resource definitions...done!
clusterrolebinding "node-exporter" created
clusterrole "node-exporter" created
daemonset "node-exporter" created
serviceaccount "node-exporter" created
service "node-exporter" created
clusterrolebinding "kube-state-metrics" created
clusterrole "kube-state-metrics" created
deployment "kube-state-metrics" created
rolebinding "kube-state-metrics" created
role "kube-state-metrics-resizer" created
serviceaccount "kube-state-metrics" created
service "kube-state-metrics" created
secret "grafana-credentials" created
secret "grafana-credentials" created
configmap "grafana-dashboard-definitions-0" created
configmap "grafana-dashboards" created
configmap "grafana-datasources" created
deployment "grafana" created
service "grafana" created
configmap "prometheus-k8s-rules" created
serviceaccount "prometheus-k8s" created
servicemonitor "alertmanager" created
servicemonitor "kube-apiserver" created
servicemonitor "kube-controller-manager" created
servicemonitor "kube-scheduler" created
servicemonitor "kube-state-metrics" created
servicemonitor "kubelet" created
servicemonitor "node-exporter" created
servicemonitor "prometheus-operator" created
servicemonitor "prometheus" created
service "prometheus-k8s" created
prometheus "k8s" created
role "prometheus-k8s" created
role "prometheus-k8s" created
role "prometheus-k8s" created
clusterrole "prometheus-k8s" created
rolebinding "prometheus-k8s" created
rolebinding "prometheus-k8s" created
rolebinding "prometheus-k8s" created
clusterrolebinding "prometheus-k8s" created
secret "alertmanager-main" created
service "alertmanager-main" created
alertmanager "main" created 
# kubectl get pod -n monitoring
NAME                                   READY     STATUS    RESTARTS   AGE
alertmanager-main-0                    2/2       Running   0          15h
alertmanager-main-1                    2/2       Running   0          15h
alertmanager-main-2                    2/2       Running   0          15h
grafana-567fcdf7b7-44ldd               1/1       Running   0          15h
kube-state-metrics-76b4dc5ffb-2vbh9    4/4       Running   0          15h
node-exporter-9wm8c                    2/2       Running   0          15h
node-exporter-kf6mq                    2/2       Running   0          15h
node-exporter-xtm4r                    2/2       Running   0          15h
prometheus-k8s-0                       2/2       Running   0          15h
prometheus-k8s-1                       2/2       Running   0          15h
prometheus-operator-7466f6887f-9nsk8   1/1       Running   0          15h
# kubectl -n monitoring get svc
NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
alertmanager-main       NodePort    10.244.69.39             9093:30903/TCP      15h
alertmanager-operated   ClusterIP   None                     9093/TCP,6783/TCP   15h
grafana                 NodePort    10.244.86.54             3000:30902/TCP      15h
kube-state-metrics      ClusterIP   None                     8443/TCP,9443/TCP   15h
node-exporter           ClusterIP   None                     9100/TCP            15h
prometheus-k8s          NodePort    10.244.226.104           9090:30900/TCP      15h
prometheus-operated     ClusterIP   None                     9090/TCP            15h
prometheus-operator     ClusterIP   10.244.9.203             8080/TCP            15h
# kubectl -n monitoring get endpoints
NAME                    ENDPOINTS                                                        AGE
alertmanager-main       10.244.2.10:9093,10.244.35.4:9093,10.244.91.5:9093               15h
alertmanager-operated   10.244.2.10:9093,10.244.35.4:9093,10.244.91.5:9093 + 3 more...   15h
grafana                 10.244.2.8:3000                                                  15h
kube-state-metrics      10.244.2.9:9443,10.244.2.9:8443                                  15h
node-exporter           192.168.100.102:9100,192.168.100.103:9100,192.168.100.105:9100   15h
prometheus-k8s          10.244.2.11:9090,10.244.35.5:9090                                15h
prometheus-operated     10.244.2.11:9090,10.244.35.5:9090                                15h
prometheus-operator     10.244.35.3:8080                                                 15h
# kubectl -n monitoring get servicemonitors
NAME                      AGE
alertmanager              15h
kube-apiserver            15h
kube-controller-manager   15h
kube-scheduler            15h
kube-state-metrics        15h
kubelet                   15h
node-exporter             15h
prometheus                15h
prometheus-operator       15h
# kubectl get customresourcedefinitions
NAME                                    AGE
alertmanagers.monitoring.coreos.com     11d
prometheuses.monitoring.coreos.com      11d
servicemonitors.monitoring.coreos.com   11d

注:

在部署过程中我将镜像地址都更改为从本地镜像仓库进行拉取,但是有pod依然会从远端拉取镜像,如下:

使用Prometheus Operator 监控Kubernetes_第2张图片

这里我是无法拉取alertmanager的镜像,解决方法就是先将该镜像拉取到本地,然后打包分发至各节点:

# docker save 23744b2d645c -o alertmanager-v0.14.0.tar.gz
# ansible node -m copy -a 'src=alertmanager-v0.14.0.tar.gz dest=/root'
# ansible node -a 'docker load -i /root/alertmanager-v0.14.0.tar.gz'
192.168.100.104 | SUCCESS | rc=0 >>
Loaded image ID: sha256:23744b2d645c0574015adfba4a90283b79251aee3169dbe67f335d8465a8a63f
192.168.100.103 | SUCCESS | rc=0 >>
Loaded image ID: sha256:23744b2d645c0574015adfba4a90283b79251aee3169dbe67f335d8465a8a63f
# ansible node -a 'docker images quay.io/prometheus/alertmanager'
192.168.100.103 | SUCCESS | rc=0 >>
REPOSITORY                        TAG                 IMAGE ID            CREATED             SIZE
quay.io/prometheus/alertmanager   v0.14.0             23744b2d645c        7 weeks ago         31.9MB

192.168.100.104 | SUCCESS | rc=0 >>
REPOSITORY                        TAG                 IMAGE ID            CREATED             SIZE
quay.io/prometheus/alertmanager   v0.14.0             23744b2d645c        7 weeks ago         31.9MB


2.添加 etcd 监控

Prometheus Operator有 etcd 仪表盘,但是需要额外的配置才能完全监控显示。官方文档:Monitoring external etcd


a.在 namespace 中创建secrets

# kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/ssl/ca.pem --from-file=/etc/kubernetes/ssl/etcd.pem --from-file=/etc/kubernetes/ssl/etcd-key.pem
secret "etcd-certs" created
# kubectl -n monitoring get secrets etcd-certs
NAME         TYPE      DATA      AGE
etcd-certs   Opaque    3         16h

注:这里的证书是在部署 etcd 集群时创建,请更改为自己证书存放的路径。


b.使Prometheus Operator接入secret

# vim manifests/prometheus/prometheus-k8s.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: k8s
  labels:
    prometheus: k8s
spec:
  replicas: 2
  secrets:
  - etcd-certs
  version: v2.2.1
# kubectl -n monitoring replace -f manifests/prometheus/prometheus-k8s.yaml
prometheus "k8s" replaced

注:

这里只需加入如下项即可:

  secrets:
  - etcd-certs


c.创建Service、Endpoints和ServiceMonitor服务

# vim manifests/prometheus/prometheus-etcd.yaml 
apiVersion: v1
kind: Service
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: api
    port: 2379
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd
subsets:
- addresses:
  - ip: 192.168.100.102
    nodeName: etcd1
  - ip: 192.168.100.103
    nodeName: etcd2
  - ip: 192.168.100.104
    nodeName: etcd3
  ports:
  - name: api
    port: 2379
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd-k8s
spec:
  jobLabel: k8s-app
  endpoints:
  - port: api
    interval: 30s
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/secrets/etcd-certs/ca.pem
      certFile: /etc/prometheus/secrets/etcd-certs/etcd.pem
      keyFile: /etc/prometheus/secrets/etcd-certs/etcd-key.pem
      #use insecureSkipVerify only if you cannot use a Subject Alternative Name
      insecureSkipVerify: true 
  selector:
    matchLabels:
      k8s-app: etcd
  namespaceSelector:
    matchNames:
    - monitoring
# kubectl create -f manifests/prometheus/prometheus-etcd.yaml

注1:请将 etcd 的ip地址和 etcd 的节点名更改为自行配置的ip和节点名。

注2:在 tlsconfig 下边的三项只需更改最后的ca.pem、etcd.pem、etcd-key.pem为自己相应的证书名即可。如实在不了解,可登陆进 prometheus-k8s 的pod进行查看:

# kubectl exec -ti -n monitoring prometheus-k8s-0 /bin/sh
Defaulting container name to prometheus.
Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod.
/prometheus $ ls /etc/prometheus/secrets/etcd-certs/
ca.pem        etcd-key.pem  etcd.pem


3.Prometheus Operator 部署完成后会对外暴露三个端口:30900为Prometheus端口、30902为grafana端口、30903为alertmanager端口。

Prometheus显示如下,如何一切正常,所有target都应该是up的。

使用Prometheus Operator 监控Kubernetes_第3张图片


Alertmanager显示如下

使用Prometheus Operator 监控Kubernetes_第4张图片

使用Prometheus Operator 监控Kubernetes_第5张图片

Grafana的监控项显示如下

使用Prometheus Operator 监控Kubernetes_第6张图片

etcd相关监控项显示如下

使用Prometheus Operator 监控Kubernetes_第7张图片

使用Prometheus Operator 监控Kubernetes_第8张图片

kubernetes集群显示如下

使用Prometheus Operator 监控Kubernetes_第9张图片

使用Prometheus Operator 监控Kubernetes_第10张图片

节点监控显示如下

使用Prometheus Operator 监控Kubernetes_第11张图片

使用Prometheus Operator 监控Kubernetes_第12张图片