目录
背景:
环境准备:
1. 磁盘准备
2. 磁盘分区格式化
local storage部署
1. 节点打标签
2. 创建local pv storageClass和prometheus-pv
Prometheus-stack部署
1. 下载helm chart包
2. values.yaml 参数解释
3. 部署prometheus-stack
4. 查看部署情况
k8s集群prometheus 监控数据和业务数据共用一个NFS(网络文件系统),可能会出现以下问题:
影响业务:业务数据和监控数据进行隔离,原则上我们可以允许监控数据丢失,但是业务数据一定是不能丢失的
读写性能:业务服务和监控系统挂载NFS共享的文件或者目录,如果业务服务和监控系统同时在进行大量的读写则会互现干扰
稳定性:NFS对网络环境的要求比较高,如果网络环境不稳定,容易导致文件共享出现故障
存储空间:prometheus 虽然有监控数据回收的机制,但是也只是针对数据有限期进行回收,如果某一天有大量的监控数据就会占用NFS的很多存储空间,极端情况下会出现将NFS存储空间占满的情况
NFS扩容:NFS的扩展性比较差,当需要扩容时,需要手动进行配置,操作比较繁琐
一个正常运行的集群,集群版本最好 >= 1.21,低于1.21 版本兼容性可能会有问题
kube-prometheus stack | Kubernetes 1.21 | Kubernetes 1.22 | Kubernetes 1.23 | Kubernetes 1.24 | Kubernetes 1.25 | Kubernetes 1.26 | Kubernetes 1.27 |
release-0.9 | ✔ | ✔ | ✗ | ✗ | ✗ | x | x |
release-0.10 | ✗ | ✔ | ✔ | ✗ | ✗ | x | x |
release-0.11 | ✗ | ✗ | ✔ | ✔ | ✗ | x | x |
release-0.12 | ✗ | ✗ | ✗ | ✔ | ✔ | x | x |
main | ✗ | ✗ | ✗ | ✗ | x | ✔ | ✔ |
从集群中选择一个节点,该节点独立挂载一块磁盘。磁盘最好是做一个磁盘阵列例如Raid50,提高磁盘的容错能力
# 将sdb的空间都分给一个分区
parted /dev/sdb mkpart primary 0% 100%
# 写入文件系统
mkfs -t ext4 /dev/sdb1
# 获取磁盘的UUID,用于写入fstab实现开机自动挂载
blkid /dev/sdb1
# 创建挂载点
mkdir -p /monitoring
# 查看fstab文件
cat /etc/fstab | grep monitoring
/dev/disk/by-uuid/93a76705-814a-4a5e-85f0-88fe03d7837c /monitoring ext4 defaults 0 1
# 挂载
mount -a
kubectl label node node156 prometheus=deploy
cd /home/sunwenbo/local-pv
kubectl apply -f local-pv-storage.yaml
kubectl apply -f local-pv.yaml
local-pv-storage.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
#reclaimPolicy: Retain 注:local pv不支持retain存储方式
#volumeBindingMode: Immediate 注:不支持动态创建pv
local-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-pv
spec:
capacity:
storage: 200Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
#persistentVolumeReclaimPolicy: Delete
storageClassName: local-storage
local:
path: /monitoring/prometheus
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: prometheus
operator: In
values:
- "deploy"
解释一下:还记得我们上面打标签的步骤吧,这里配置nodeAffinity就是为了将pv创建在指定的节点上通过标签进行匹配
查看StorageClass
root@master01:/home/sunwenbo/local-pv# kubectl get storageclasses.storage.k8s.io
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-storage kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 17h
nfs-016 nfs.csi.k8s.io Retain Immediate false 59d
nfs-018 nfs.csi.k8s.io Retain Immediate false 44d
nfs-retain (default) nfs.csi.k8s.io Retain Immediate false 62d
查看pv
注:正常pv的状态是Available,因为还有没有创建pvc,下面展示是我部署后的结果,可以看到prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 绑定了prometheus-pv,至于这个pvc是怎么来的下面会介绍
root@master01:/home/sunwenbo/local-pv# kubectl get pv | grep prometheus
prometheus-pv 200Gi RWO Retain Bound kube-prometheus/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 local-storage
23m
wget https://github.com/prometheus-community/helm-charts/releases/download/kube-prometheus-stack-45.27.2/kube-prometheus-stack-45.27.2.tgz
tar xf kube-prometheus-stack-45.27.2.tgz
cd kube-prometheus-stack
修改部分如下
# alertmanager 持久化配置,使用nfs 存储空间为4G
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: nfs-retain
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 4Gi
# grafana 持久化存储配置及环境变量、plugin添加
grafana:
enabled: true
namespaceOverride: ""
forceDeployDatasources: false
persistence:
type: pvc
enabled: true
storageClassName: nfs-retain
accessModes:
- ReadWriteOnce
size: 2Gi
finalizers:
- kubernetes.io/pvc-protection
env:
GF_AUTH_ANONYMOUS_ENABLED: "true"
GF_AUTH_ANONYMOUS_ORG_NAME: "Main Org."
GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
plugins:
- grafana-worldmap-panel
- grafana-piechart-panel
# grafana service 暴露配置
service:
portName: http-web
port: 30080
externalIPs: ["10.1.2.15"]
# 监控数据保留15天
prometheus:
retention: 15d
# prometheus 部署节点使用node亲和性标签匹配
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: prometheus
operator: In
values:
- deploy
# prometheus 设置内存、cpu的reqeust和limit
resources:
requests:
memory: 10Gi
cpu: 10
limits:
memory: 50Gi
cpu: 10
# prometheus 使用外部ip暴露
service:
externalIPs: ["10.1.2.15"]
# prometheus数据持久化存储使用local-storage
storageSpec:
## Using PersistentVolumeClaim
#
volumeClaimTemplate:
spec:
storageClassName: local-storage
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
# 增加gpu-metrics
additionalScrapeConfigs:
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- nvidia-device-plugin
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
全量的values.yaml已经上传到csdn不需要积分就可以下载了
https://download.csdn.net/download/weixin_43798031/88046678https://download.csdn.net/download/weixin_43798031/88046678
helm upgrade -i kube-prometheus-stack -f values.yaml . -n kube-prometheus
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get deployments.apps -n kube-prometheus
NAME READY UP-TO-DATE AVAILABLE AGE
kube-prometheus-stack-grafana 1/1 1 1 123m
kube-prometheus-stack-kube-state-metrics 1/1 1 1 123m
kube-prometheus-stack-operator 1/1 1 1 123m
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get daemonsets.apps -n kube-prometheus
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-prometheus-stack-prometheus-node-exporter 148 148 148 148 148 123m
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get statefulsets.apps -n kube-prometheus
NAME READY AGE
alertmanager-kube-prometheus-stack-alertmanager 1/1 123m
prometheus-kube-prometheus-stack-prometheus 1/1 123m
service
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get svc -n kube-prometheus
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None 9093/TCP,9094/TCP,9094/UDP 123m
kube-prometheus-stack-alertmanager ClusterIP 10.111.20.147 9093/TCP 123m
kube-prometheus-stack-grafana ClusterIP 10.104.171.223 10.1.2.15 30080/TCP 123m
kube-prometheus-stack-kube-state-metrics ClusterIP 10.107.110.116 8080/TCP 123m
kube-prometheus-stack-operator ClusterIP 10.107.180.72 443/TCP 123m
kube-prometheus-stack-prometheus ClusterIP 10.102.115.147 10.1.2.15 9090/TCP 123m
kube-prometheus-stack-prometheus-export ClusterIP 10.109.169.13 10.1.2.15 30081/TCP 3d5h
kube-prometheus-stack-prometheus-node-exporter ClusterIP 10.101.152.90 9100/TCP 123m
prometheus-operated ClusterIP None 9090/TCP 123m
pv、pvc
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get pv | grep prometh
prometheus-pv 200Gi RWO Retain Bound kube-prometheus/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 local-storage
127m
pvc-43823533-9a35-4ace-b0a3-5853e3b4099e 4Gi RWO Retain Bound kube-prometheus/alertmanager-kube-prometheus-stack-alertmanager-db-alertmanager-kube-prometheus-stack-alertmanager-0 nfs-retain
60d
pvc-cef3dd98-7090-47ac-8cec-c52c78e9237f 2Gi RWO Retain Bound kube-prometheus/kube-prometheus-stack-grafana nfs-retain
129m
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get pvc -n kube-prometheus
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
alertmanager-kube-prometheus-stack-alertmanager-db-alertmanager-kube-prometheus-stack-alertmanager-0 Bound pvc-43823533-9a35-4ace-b0a3-5853e3b4099e 4Gi RWO nfs-retain 60d
kube-prometheus-stack-grafana Bound pvc-cef3dd98-7090-47ac-8cec-c52c78e9237f 2Gi RWO nfs-retain 127m
prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 Bound prometheus-pv 200Gi RWO local-storage 127m
解释一下:使用volumeClaimTemplate 会动态的给我们创建出来一个pvc,由于之前已经创建pv了,这个pvc会自动和pv进行绑定