Helm 安装prometheus-stack 使用local pv持久化存储数据

目录

背景:

环境准备:

1. 磁盘准备

2. 磁盘分区格式化

local storage部署

1. 节点打标签

2. 创建local pv storageClass和prometheus-pv

Prometheus-stack部署

1. 下载helm chart包

2. values.yaml 参数解释

3. 部署prometheus-stack

4. 查看部署情况


背景:

k8s集群prometheus 监控数据和业务数据共用一个NFS(网络文件系统),可能会出现以下问题:

  • 影响业务:业务数据和监控数据进行隔离,原则上我们可以允许监控数据丢失,但是业务数据一定是不能丢失的

  • 读写性能:业务服务和监控系统挂载NFS共享的文件或者目录,如果业务服务和监控系统同时在进行大量的读写则会互现干扰

  • 稳定性:NFS对网络环境的要求比较高,如果网络环境不稳定,容易导致文件共享出现故障

  • 存储空间:prometheus 虽然有监控数据回收的机制,但是也只是针对数据有限期进行回收,如果某一天有大量的监控数据就会占用NFS的很多存储空间,极端情况下会出现将NFS存储空间占满的情况

  • NFS扩容:NFS的扩展性比较差,当需要扩容时,需要手动进行配置,操作比较繁琐

环境准备:

一个正常运行的集群,集群版本最好 >= 1.21,低于1.21 版本兼容性可能会有问题

kube-prometheus stack Kubernetes 1.21 Kubernetes 1.22 Kubernetes 1.23 Kubernetes 1.24 Kubernetes 1.25 Kubernetes 1.26 Kubernetes 1.27
release-0.9 x x
release-0.10 x x
release-0.11 x x
release-0.12 x x
main x

1. 磁盘准备

从集群中选择一个节点,该节点独立挂载一块磁盘。磁盘最好是做一个磁盘阵列例如Raid50,提高磁盘的容错能力

2. 磁盘分区格式化

# 将sdb的空间都分给一个分区
parted /dev/sdb mkpart primary 0% 100%

# 写入文件系统
mkfs -t ext4 /dev/sdb1

# 获取磁盘的UUID,用于写入fstab实现开机自动挂载
blkid  /dev/sdb1

# 创建挂载点
mkdir -p /monitoring

# 查看fstab文件
cat /etc/fstab  | grep monitoring                                                                                                                                                                     
/dev/disk/by-uuid/93a76705-814a-4a5e-85f0-88fe03d7837c /monitoring ext4 defaults 0 1 

# 挂载
mount -a 

local storage部署

1. 节点打标签

kubectl label node node156 prometheus=deploy

2. 创建local pv storageClass和prometheus-pv

cd /home/sunwenbo/local-pv 
kubectl apply -f local-pv-storage.yaml 
kubectl apply -f local-pv.yaml

local-pv-storage.yaml

kind: StorageClass                                                                                                                                                                                                                         
apiVersion: storage.k8s.io/v1                                                                                                                                                                                                              
metadata:                                                                                                                                                                                                                                  
  name: local-storage                                                                                                                                                                                                                      
provisioner: kubernetes.io/no-provisioner                                                                                                                                                                                                  
volumeBindingMode: WaitForFirstConsumer  
#reclaimPolicy: Retain            注:local pv不支持retain存储方式                                                                                                                                                                                                           
#volumeBindingMode: Immediate     注:不支持动态创建pv

local-pv.yaml

apiVersion: v1                                                                                                                                                                                                                             
kind: PersistentVolume                                                                                                                                                                                                                     
metadata:                                                                                                                                                                                                                                  
  name: prometheus-pv                                                                                                                                                                                                                      
spec:                                                                                                                                                                                                                                      
  capacity:                                                                                                                                                                                                                                
    storage: 200Gi                                                                                                                                                                                                                         
  volumeMode: Filesystem                                                                                                                                                                                                                   
  accessModes:                                                                                                                                                                                                                             
  - ReadWriteOnce                                                                                                                                                                                                                          
  persistentVolumeReclaimPolicy: Retain                                                                                                                                                                                                    
    #persistentVolumeReclaimPolicy: Delete                                                                                                                                                                                                 
  storageClassName: local-storage                                                                                                                                                                                                          
  local:                                                                                                                                                                                                                                   
    path: /monitoring/prometheus                                                                                                                                                                                                           
  nodeAffinity:                                                                                                                                                                                                                            
    required:                                                                                                                                                                                                                              
      nodeSelectorTerms:                                                                                                                                                                                                                   
      - matchExpressions:                                                                                                                                                                                                                  
        - key: prometheus                                                                                                                                                                                                                  
          operator: In                                                                                                                                                                                                                     
          values:                                                                                                                                                                                                                          
          - "deploy"    

解释一下:还记得我们上面打标签的步骤吧,这里配置nodeAffinity就是为了将pv创建在指定的节点上通过标签进行匹配

查看StorageClass

root@master01:/home/sunwenbo/local-pv# kubectl  get storageclasses.storage.k8s.io                                                                                                                                                          
NAME                   PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE                                                                                                                    
local-storage          kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  17h                                                                                                                    
nfs-016                nfs.csi.k8s.io                 Retain          Immediate              false                  59d                                                                                                                    
nfs-018                nfs.csi.k8s.io                 Retain          Immediate              false                  44d                                                                                                                    
nfs-retain (default)   nfs.csi.k8s.io                 Retain          Immediate              false                  62d   

查看pv

注:正常pv的状态是Available,因为还有没有创建pvc,下面展示是我部署后的结果,可以看到prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 绑定了prometheus-pv,至于这个pvc是怎么来的下面会介绍

root@master01:/home/sunwenbo/local-pv# kubectl  get pv | grep prometheus                                                                                                                                                                   
prometheus-pv                              200Gi      RWO            Retain           Bound    kube-prometheus/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0           local-storage        
    23m 

Prometheus-stack部署

1. 下载helm chart包

wget https://github.com/prometheus-community/helm-charts/releases/download/kube-prometheus-stack-45.27.2/kube-prometheus-stack-45.27.2.tgz 
tar xf kube-prometheus-stack-45.27.2.tgz 
cd kube-prometheus-stack

2. values.yaml 参数解释

修改部分如下

  # alertmanager 持久化配置,使用nfs 存储空间为4G
  alertmanager:
    alertmanagerSpec:
      storage:                                                                                                                                                                                                                               
        volumeClaimTemplate:                                                                                                                                                                                                                 
          spec:                                                                                                                                                                                                                              
            storageClassName: nfs-retain                                                                                                                                                                                                     
            accessModes: ["ReadWriteOnce"]                                                                                                                                                                                                   
            resources:                                                                                                                                                                                                                       
              requests:                                                                                                                                                                                                                      
                storage: 4Gi 
 # grafana 持久化存储配置及环境变量、plugin添加
 grafana:                                                                                                                                                                                                                                   
  enabled: true                                                                                                                                                                                                                            
  namespaceOverride: ""                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  forceDeployDatasources: false
  persistence:                                                                                                                                                                                                                             
    type: pvc                                                                                                                                                                                                                              
    enabled: true                                                                                                                                                                                                                          
    storageClassName: nfs-retain                                                                                                                                                                                                           
    accessModes:                                                                                                                                                                                                                           
      - ReadWriteOnce                                                                                                                                                                                                                      
    size: 2Gi                                                                                                                                                                                                                              
    finalizers:                                                                                                                                                                                                                            
      - kubernetes.io/pvc-protection 
  env:                                                                                                                                                                                                                                     
    GF_AUTH_ANONYMOUS_ENABLED: "true"                                                                                                                                                                                                      
    GF_AUTH_ANONYMOUS_ORG_NAME: "Main Org."                                                                                                                                                                                                
    GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer                                                                                                                                                                                                                                                                                                                                                                                                                                    
  plugins:                                                                                                                                                                                                                                 
    - grafana-worldmap-panel                                                                                                                                                                                                               
    - grafana-piechart-panel    
 # grafana service 暴露配置
   service:                                                                                                                                                                                                                                 
    portName: http-web                                                                                                                                                                                                                     
    port: 30080                                                                                                                                                                                                                            
    externalIPs: ["10.1.2.15"]
    
 # 监控数据保留15天
 prometheus: 
   retention: 15d  
# prometheus 部署节点使用node亲和性标签匹配
    affinity:                                                                                                                                                                                                                              
     nodeAffinity:                                                                                                                                                                                                                         
       requiredDuringSchedulingIgnoredDuringExecution:                                                                                                                                                                                     
         nodeSelectorTerms:                                                                                                                                                                                                                
         - matchExpressions:                                                                                                                                                                                                               
           - key: prometheus                                                                                                                                                                                                               
             operator: In                                                                                                                                                                                                                  
             values:                                                                                                                                                                                                                       
             - deploy  
# prometheus 设置内存、cpu的reqeust和limit     
    resources:                                                                                                                                                                                                                             
     requests:                                                                                                                                                                                                                             
       memory: 10Gi                                                                                                                                                                                                                        
       cpu: 10                                                                                                                                                                                                                             
     limits:                                                                                                                                                                                                                               
       memory: 50Gi                                                                                                                                                                                                                        
       cpu: 10 
       
# prometheus 使用外部ip暴露
  service:
    externalIPs: ["10.1.2.15"]  
    
# prometheus数据持久化存储使用local-storage      
    storageSpec:                                                                                                                                                                                                                           
    ## Using PersistentVolumeClaim                                                                                                                                                                                                         
    #                                                                                                                                                                                                                              
      volumeClaimTemplate:                                                                                                                                                                                                                 
        spec:                                                                                                                                                                                                                              
          storageClassName: local-storage                                                                                                                                                                                                  
          accessModes: ["ReadWriteOnce"]                                                                                                                                                                                                   
          resources:                                                                                                                                                                                                                       
            requests:                                                                                                                                                                                                                      
              storage: 200Gi  
              
# 增加gpu-metrics     
    additionalScrapeConfigs:                                                                                                                                                                                                               
      - job_name: gpu-metrics                                                                                                                                                                                                              
        scrape_interval: 1s                                                                                                                                                                                                                
        metrics_path: /metrics                                                                                                                                                                                                             
        scheme: http                                                                                                                                                                                                                       
        kubernetes_sd_configs:                                                                                                                                                                                                             
          - role: endpoints                                                                                                                                                                                                                
            namespaces:                                                                                                                                                                                                                    
              names:                                                                                                                                                                                                                       
                - nvidia-device-plugin                                                                                                                                                                                                     
        relabel_configs:                                                                                                                                                                                                                   
          - source_labels: [__meta_kubernetes_pod_node_name]                                                                                                                                                                               
            action: replace                                                                                                                                                                                                                
            target_label: kubernetes_node 

全量的values.yaml已经上传到csdn不需要积分就可以下载了

https://download.csdn.net/download/weixin_43798031/88046678icon-default.png?t=N658https://download.csdn.net/download/weixin_43798031/88046678

3. 部署prometheus-stack

helm upgrade -i kube-prometheus-stack -f values.yaml . -n kube-prometheus

4. 查看部署情况

root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl  get deployments.apps  -n kube-prometheus                                                                                                                                      
NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE                                                                                                                                                            
kube-prometheus-stack-grafana              1/1     1            1           123m                                                                                                                                                           
kube-prometheus-stack-kube-state-metrics   1/1     1            1           123m                                                                                                                                                           
kube-prometheus-stack-operator             1/1     1            1           123m                                                                                                                                                           
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl  get daemonsets.apps -n kube-prometheus                                                                                                                                        
NAME                                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE                                                                                                                  
kube-prometheus-stack-prometheus-node-exporter   148       148       148     148          148                   123m                                                                                                                 
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl  get statefulsets.apps  -n kube-prometheus                                                                                                                                     
NAME                                              READY   AGE                                                                                                                                                                              
alertmanager-kube-prometheus-stack-alertmanager   1/1     123m                                                                                                                                                                             
prometheus-kube-prometheus-stack-prometheus       1/1     123m   

service

root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl  get svc -n kube-prometheus                                                                                                                                                    
NAME                                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE                                                                                                               
alertmanager-operated                            ClusterIP   None                     9093/TCP,9094/TCP,9094/UDP   123m                                                                                                              
kube-prometheus-stack-alertmanager               ClusterIP   10.111.20.147            9093/TCP                     123m                                                                                                              
kube-prometheus-stack-grafana                    ClusterIP   10.104.171.223   10.1.2.15     30080/TCP                    123m                                                                                                              
kube-prometheus-stack-kube-state-metrics         ClusterIP   10.107.110.116           8080/TCP                     123m                                                                                                              
kube-prometheus-stack-operator                   ClusterIP   10.107.180.72            443/TCP                      123m                                                                                                              
kube-prometheus-stack-prometheus                 ClusterIP   10.102.115.147   10.1.2.15     9090/TCP                     123m                                                                                                              
kube-prometheus-stack-prometheus-export          ClusterIP   10.109.169.13    10.1.2.15     30081/TCP                    3d5h                                                                                                              
kube-prometheus-stack-prometheus-node-exporter   ClusterIP   10.101.152.90            9100/TCP                     123m                                                                                                              
prometheus-operated                              ClusterIP   None                     9090/TCP                     123m 

pv、pvc

root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl  get pv | grep prometh                                                                                                                                                         
prometheus-pv                              200Gi      RWO            Retain           Bound    kube-prometheus/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0           local-storage        
    127m                                                                                                                                                                                                                                   
pvc-43823533-9a35-4ace-b0a3-5853e3b4099e   4Gi        RWO            Retain           Bound    kube-prometheus/alertmanager-kube-prometheus-stack-alertmanager-db-alertmanager-kube-prometheus-stack-alertmanager-0   nfs-retain           
    60d                                                                                                                                                                                                                                    
pvc-cef3dd98-7090-47ac-8cec-c52c78e9237f   2Gi        RWO            Retain           Bound    kube-prometheus/kube-prometheus-stack-grafana                                                                          nfs-retain           
    129m 


root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl  get pvc -n kube-prometheus                                                                                                                                                    
NAME                                                                                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE                                   
alertmanager-kube-prometheus-stack-alertmanager-db-alertmanager-kube-prometheus-stack-alertmanager-0   Bound    pvc-43823533-9a35-4ace-b0a3-5853e3b4099e   4Gi        RWO            nfs-retain      60d                                   
kube-prometheus-stack-grafana                                                                          Bound    pvc-cef3dd98-7090-47ac-8cec-c52c78e9237f   2Gi        RWO            nfs-retain      127m                                  
prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0           Bound    prometheus-pv                              200Gi      RWO            local-storage   127m                                  

解释一下:使用volumeClaimTemplate 会动态的给我们创建出来一个pvc,由于之前已经创建pv了,这个pvc会自动和pv进行绑定

你可能感兴趣的:(kubernetes,prometheus,monitoring)