一、Prometheus概述
Prometheus是一个开源系统监测和警报工具箱。
主要特征:
多维数据模型(时间序列由metri和key/value定义)
灵活的查询语言
不依赖分布式存储
采用 http 协议,使用 pull 拉取数据
可以通过push gateway进行时序列数据推送
可通过服务发现或静态配置发现目标
多种可视化图表及仪表盘支持
Prometheus架构如下:
Prometheus组件包括:Prometheus server、push gateway 、alertmanager、Web UI等。
Prometheus server 定期从数据源拉取数据,然后将数据持久化到磁盘。Prometheus 可以配置 rules,然后定时查询数据,当条件触发的时候,会将 alert 推送到配置的 Alertmanager。Alertmanager 收到警告的时候,可以根据配置,聚合并记录新时间序列,或者生成警报。同时还可以使用其他 API 或者 Grafana 来将收集到的数据进行可视化。
二、安装Prometheus Operator
1.Prometheus Operator简化了在 Kubernetes 上部署并管理和运行 Prometheus 和 Alertmanager 集群。
# wget https://codeload.github.com/coreos/prometheus-operator/tar.gz/v0.18.0 -O prometheus-operator-0.18.0.tar.gz # tar -zxvf prometheus-operator-0.18.0.tar.gz # cd prometheus-operator-0.18.0 # kubectl apply -f bundle.yaml clusterrolebinding "prometheus-operator" configured clusterrole "prometheus-operator" configured serviceaccount "prometheus-operator" created deployment "prometheus-operator" created # cd contrib/kube-prometheus # hack/cluster-monitoring/deploy namespace "monitoring" created clusterrolebinding "prometheus-operator" created clusterrole "prometheus-operator" created serviceaccount "prometheus-operator" created service "prometheus-operator" created deployment "prometheus-operator" created Waiting for Operator to register custom resource definitions...done! clusterrolebinding "node-exporter" created clusterrole "node-exporter" created daemonset "node-exporter" created serviceaccount "node-exporter" created service "node-exporter" created clusterrolebinding "kube-state-metrics" created clusterrole "kube-state-metrics" created deployment "kube-state-metrics" created rolebinding "kube-state-metrics" created role "kube-state-metrics-resizer" created serviceaccount "kube-state-metrics" created service "kube-state-metrics" created secret "grafana-credentials" created secret "grafana-credentials" created configmap "grafana-dashboard-definitions-0" created configmap "grafana-dashboards" created configmap "grafana-datasources" created deployment "grafana" created service "grafana" created configmap "prometheus-k8s-rules" created serviceaccount "prometheus-k8s" created servicemonitor "alertmanager" created servicemonitor "kube-apiserver" created servicemonitor "kube-controller-manager" created servicemonitor "kube-scheduler" created servicemonitor "kube-state-metrics" created servicemonitor "kubelet" created servicemonitor "node-exporter" created servicemonitor "prometheus-operator" created servicemonitor "prometheus" created service "prometheus-k8s" created prometheus "k8s" created role "prometheus-k8s" created role "prometheus-k8s" created role "prometheus-k8s" created clusterrole "prometheus-k8s" created rolebinding "prometheus-k8s" created rolebinding "prometheus-k8s" created rolebinding "prometheus-k8s" created clusterrolebinding "prometheus-k8s" created secret "alertmanager-main" created service "alertmanager-main" created alertmanager "main" created # kubectl get pod -n monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 0 15h alertmanager-main-1 2/2 Running 0 15h alertmanager-main-2 2/2 Running 0 15h grafana-567fcdf7b7-44ldd 1/1 Running 0 15h kube-state-metrics-76b4dc5ffb-2vbh9 4/4 Running 0 15h node-exporter-9wm8c 2/2 Running 0 15h node-exporter-kf6mq 2/2 Running 0 15h node-exporter-xtm4r 2/2 Running 0 15h prometheus-k8s-0 2/2 Running 0 15h prometheus-k8s-1 2/2 Running 0 15h prometheus-operator-7466f6887f-9nsk8 1/1 Running 0 15h # kubectl -n monitoring get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-main NodePort 10.244.69.399093:30903/TCP 15h alertmanager-operated ClusterIP None 9093/TCP,6783/TCP 15h grafana NodePort 10.244.86.54 3000:30902/TCP 15h kube-state-metrics ClusterIP None 8443/TCP,9443/TCP 15h node-exporter ClusterIP None 9100/TCP 15h prometheus-k8s NodePort 10.244.226.104 9090:30900/TCP 15h prometheus-operated ClusterIP None 9090/TCP 15h prometheus-operator ClusterIP 10.244.9.203 8080/TCP 15h # kubectl -n monitoring get endpoints NAME ENDPOINTS AGE alertmanager-main 10.244.2.10:9093,10.244.35.4:9093,10.244.91.5:9093 15h alertmanager-operated 10.244.2.10:9093,10.244.35.4:9093,10.244.91.5:9093 + 3 more... 15h grafana 10.244.2.8:3000 15h kube-state-metrics 10.244.2.9:9443,10.244.2.9:8443 15h node-exporter 192.168.100.102:9100,192.168.100.103:9100,192.168.100.105:9100 15h prometheus-k8s 10.244.2.11:9090,10.244.35.5:9090 15h prometheus-operated 10.244.2.11:9090,10.244.35.5:9090 15h prometheus-operator 10.244.35.3:8080 15h # kubectl -n monitoring get servicemonitors NAME AGE alertmanager 15h kube-apiserver 15h kube-controller-manager 15h kube-scheduler 15h kube-state-metrics 15h kubelet 15h node-exporter 15h prometheus 15h prometheus-operator 15h # kubectl get customresourcedefinitions NAME AGE alertmanagers.monitoring.coreos.com 11d prometheuses.monitoring.coreos.com 11d servicemonitors.monitoring.coreos.com 11d
注:
在部署过程中我将镜像地址都更改为从本地镜像仓库进行拉取,但是有pod依然会从远端拉取镜像,如下:
这里我是无法拉取alertmanager的镜像,解决方法就是先将该镜像拉取到本地,然后打包分发至各节点:
# docker save 23744b2d645c -o alertmanager-v0.14.0.tar.gz # ansible node -m copy -a 'src=alertmanager-v0.14.0.tar.gz dest=/root' # ansible node -a 'docker load -i /root/alertmanager-v0.14.0.tar.gz' 192.168.100.104 | SUCCESS | rc=0 >> Loaded image ID: sha256:23744b2d645c0574015adfba4a90283b79251aee3169dbe67f335d8465a8a63f 192.168.100.103 | SUCCESS | rc=0 >> Loaded image ID: sha256:23744b2d645c0574015adfba4a90283b79251aee3169dbe67f335d8465a8a63f # ansible node -a 'docker images quay.io/prometheus/alertmanager' 192.168.100.103 | SUCCESS | rc=0 >> REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/prometheus/alertmanager v0.14.0 23744b2d645c 7 weeks ago 31.9MB 192.168.100.104 | SUCCESS | rc=0 >> REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/prometheus/alertmanager v0.14.0 23744b2d645c 7 weeks ago 31.9MB
2.添加 etcd 监控
Prometheus Operator有 etcd 仪表盘,但是需要额外的配置才能完全监控显示。官方文档:Monitoring external etcd
a.在 namespace 中创建secrets
# kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/ssl/ca.pem --from-file=/etc/kubernetes/ssl/etcd.pem --from-file=/etc/kubernetes/ssl/etcd-key.pem secret "etcd-certs" created # kubectl -n monitoring get secrets etcd-certs NAME TYPE DATA AGE etcd-certs Opaque 3 16h
注:这里的证书是在部署 etcd 集群时创建,请更改为自己证书存放的路径。
b.使Prometheus Operator接入secret
# vim manifests/prometheus/prometheus-k8s.yaml apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: k8s labels: prometheus: k8s spec: replicas: 2 secrets: - etcd-certs version: v2.2.1 # kubectl -n monitoring replace -f manifests/prometheus/prometheus-k8s.yaml prometheus "k8s" replaced
注:
这里只需加入如下项即可:
secrets: - etcd-certs
c.创建Service、Endpoints和ServiceMonitor服务
# vim manifests/prometheus/prometheus-etcd.yaml apiVersion: v1 kind: Service metadata: name: etcd-k8s labels: k8s-app: etcd spec: type: ClusterIP clusterIP: None ports: - name: api port: 2379 protocol: TCP --- apiVersion: v1 kind: Endpoints metadata: name: etcd-k8s labels: k8s-app: etcd subsets: - addresses: - ip: 192.168.100.102 nodeName: etcd1 - ip: 192.168.100.103 nodeName: etcd2 - ip: 192.168.100.104 nodeName: etcd3 ports: - name: api port: 2379 protocol: TCP --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: etcd-k8s labels: k8s-app: etcd-k8s spec: jobLabel: k8s-app endpoints: - port: api interval: 30s scheme: https tlsConfig: caFile: /etc/prometheus/secrets/etcd-certs/ca.pem certFile: /etc/prometheus/secrets/etcd-certs/etcd.pem keyFile: /etc/prometheus/secrets/etcd-certs/etcd-key.pem #use insecureSkipVerify only if you cannot use a Subject Alternative Name insecureSkipVerify: true selector: matchLabels: k8s-app: etcd namespaceSelector: matchNames: - monitoring # kubectl create -f manifests/prometheus/prometheus-etcd.yaml
注1:请将 etcd 的ip地址和 etcd 的节点名更改为自行配置的ip和节点名。
注2:在 tlsconfig 下边的三项只需更改最后的ca.pem、etcd.pem、etcd-key.pem为自己相应的证书名即可。如实在不了解,可登陆进 prometheus-k8s 的pod进行查看:
# kubectl exec -ti -n monitoring prometheus-k8s-0 /bin/sh Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ ls /etc/prometheus/secrets/etcd-certs/ ca.pem etcd-key.pem etcd.pem
3.Prometheus Operator 部署完成后会对外暴露三个端口:30900为Prometheus端口、30902为grafana端口、30903为alertmanager端口。
Prometheus显示如下,如何一切正常,所有target都应该是up的。
Alertmanager显示如下
Grafana的监控项显示如下
etcd相关监控项显示如下
kubernetes集群显示如下
节点监控显示如下