什么是 Prometheus
Prometheus 是由 SoundCloud 开源监控告警解决方案,从 2012 年开始编写代码,再到 2015 年 github 上开源以来,已经吸引了 9k+ 关注,以及很多大公司的使用;2016 年 Prometheus 成为继 k8s 后,第二名 CNCF成员。作为新一代开源解决方案,很多理念与 Google SRE 运维之道不谋而合。
基础架构:
image
安装prometheus:
1.创建prometheus和altertmanager(prometheus的报警模块)的configmap文件
下面用到的yaml配置文件已放到github上有需要的可以自己clone
cat prometheus-cm.yaml
apiVersion: v1
kind: Namespace
metadata:
name: kube-ops
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: kube-ops
data:
prometheus.yml: |
global:
scrape_interval: 30s
scrape_timeout: 30s
rule_files:
- /etc/prometheus/rules.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-node-exporter'
scheme: http
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_role]
action: replace
target_label: kubernetes_role
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:31672'
target_label: __address__
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.example.com:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
rules.yml: |
groups:
- name: test-rule
rules:
- alert: NodeFilesystemUsage
expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Filesystem usage detected"
description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High CPU usage detected"
description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"
---
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: kube-ops
data:
config.yml: |-
global:
resolve_timeout: 5m
route:
receiver: webhook
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
group_by: [alertname]
routes:
- receiver: webhook
group_wait: 10s
match:
team: node
receivers:
- name: webhook
webhook_configs:
- url: 'http://apollo/hooks/dingtalk/'
send_resolved: true
- url: 'http://apollo/hooks/prome/'
send_resolved: true
kubectl apply -f prometheus-cm.yaml
2.部署node-exporter(在每个节点都安装,需要用daemonset模式)
vim node-exporter.yaml
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-exporter
namespace: kube-ops
labels:
k8s-app: node-exporter
spec:
template:
metadata:
labels:
k8s-app: node-exporter
spec:
containers:
- image: prom/node-exporter
name: node-exporter
ports:
- containerPort: 9100
protocol: TCP
name: http
---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: node-exporter
name: node-exporter
namespace: kube-ops
spec:
ports:
- name: http
port: 9100
nodePort: 31672
protocol: TCP
type: NodePort
selector:
k8s-app: node-exporter
kubectl apply -f node-exporter.yaml
3.安装prometheus+alertmanager(注意这里的data数据盘是使用cephfs pv pvc挂载的,需要自行创建,pv pvc的模板前面的github链接中可以找到)
vim prometheus-deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
k8s-app: prometheus
name: prometheus
namespace: kube-ops
spec:
replicas: 1
template:
metadata:
labels:
k8s-app: prometheus
spec:
serviceAccountName: prometheus
securityContext:
runAsUser: 0
containers:
- image: prom/prometheus:v2.0.0
name: prometheus
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention=24h"
ports:
- containerPort: 9090
protocol: TCP
name: http
volumeMounts:
- mountPath: "/prometheus"
name: data
- mountPath: "/etc/prometheus"
name: config-volume
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 200m
memory: 1Gi
- image: quay.io/prometheus/alertmanager:v0.12.0
name: alertmanager
args:
- "-config.file=/etc/alertmanager/config.yml"
- "-storage.path=/alertmanager"
ports:
- containerPort: 9093
protocol: TCP
name: http
volumeMounts:
- name: alertmanager-config-volume
mountPath: /etc/alertmanager
resources:
requests:
cpu: 50m
memory: 50Mi
limits:
cpu: 200m
memory: 200Mi
volumes:
- name: data
persistentVolumeClaim:
claimName: prometheus-pvc
- configMap:
name: prometheus-config
name: config-volume
- name: alertmanager-config-volume
configMap:
name: alertmanager
kubectl apply -f prometheus-deployment.yaml
4.暴露prometheus端口(也可以使用ingress规则暴露,可自行选择,下面的grafana选择用ingress暴露)
vim prometheus-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: kube-ops
labels:
k8s-app: prometheus
spec:
selector:
k8s-app: prometheus
type: NodePort
ports:
- name: web
port: 9090
targetPort: http
kubectl apply -f prometheus-svc.yaml
5.安装grafana(prometheus自带的图形展示不行很理想,所以要搭配强大的图形展示工具grafana)
安装grafnan的时候用到了cephfs pv pvc和ingress。
vim grafana.yaml
apiVersion: v1
kind: Service
metadata:
labels:
kubernetes.io/cluster-service: 'true'
kubernetes.io/name: grafana
name: grafana
namespace: kube-ops
spec:
ports:
- port: 3000
targetPort: 3000
selector:
k8s-app: grafana
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
labels:
kubernetes.io/cluster-service: 'true'
kubernetes.io/name: grafana
name: grafana-ingress
namespace: kube-ops
spec:
rules:
- host: grafana.io
http:
paths:
- path: /
backend:
serviceName: grafana
servicePort: 3000
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-pv
namespace: kube-ops
spec:
capacity:
storage: 50Gi
accessModes:
- ReadWriteMany
cephfs:
monitors:
- 192.168.0.231:6789
- 192.168.0.242:6789
- 192.168.0.211:6789
path: /data/system/grafana
user: admin
secretRef:
name: ceph-secret
readOnly: false
persistentVolumeReclaimPolicy: Recycle
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-pvc
namespace: kube-ops
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: grafana
namespace: kube-ops
spec:
replicas: 1
template:
metadata:
labels:
task: monitoring
k8s-app: grafana
spec:
containers:
- name: grafana
image: gcr.io/google_containers/heapster-grafana-amd64:v4.4.3
ports:
- containerPort: 3000
protocol: TCP
volumeMounts:
- mountPath: /var
name: grafana
subPath: grafana/data
- mountPath: /ssl
name: ssl
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
env:
- name: INFLUXDB_HOST
value: influxdb.kube-system
- name: GF_SERVER_HTTP_PORT
value: "3000"
- name: GF_AUTH_BASIC_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "false"
- name: GF_SERVER_ROOT_URL
value: /
- name: GF_SMTP_ENABLED
value: "true"
- name: GF_ALERTING_ENABLED
value: "true"
- name: GF_ALERTING_EXECUTE_ALERTS
value: "true"
readinessProbe:
httpGet:
path: /login
port: 3000
initialDelaySeconds: 30
timeoutSeconds: 2
volumes:
- name: ssl
hostPath:
path: /etc/ssl/certs
- name: grafana
persistentVolumeClaim:
claimName: prometheus-pvc
kubectl apply -f grafana.yaml
这里的ingress是选择 grafana.io作为域名,需要将本地的hosts解析成nodeip(没做lb的话)。
进入到以下界面,填写数据信息
image.png
随后设置dashboard页面,推荐下面的两款。
https://grafana.com/dashboards/1621
https://grafana.com/dashboards/315
这里选择使用315
效果展示如下图所示
image.png