本例kubernetes集群外的Prometheus Server地址是:172.23.1.12
在monitoring命名空间创建服务发现账号prometheus并授权。
这个文件实现了在 Kubernetes 集群中创建一个名为 prometheus
的服务账户(ServiceAccount),用于后续配置监控工具 Prometheus 访问 Kubernetes API 获取监控数据。同时,也创建了相应的权限配置和授权,包括:
prometheus
,授予 get
、list
、watch
权限来访问 Kubernetes API 中的 nodes
、services
、endpoints
、pods
和 configmaps
资源,以及 ingresses
资源(在 extensions
API 组中)。此外,还授权 get
权限来访问 nodes/metrics
和 /metrics
(非资源 URL)。prometheus
ClusterRole 授权给前面创建的 prometheus
ServiceAccount,使其具有相应的访问权限。这个 ClusterRoleBinding 对象将 ServiceAccount 和 ClusterRole 绑定在一起,使得 ServiceAccount 拥有了 ClusterRole 中定义的权限。root@deploy:/yaml/promethrus-case# cat case4-prom-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
name: monitoring-token
namespace: monitoring
annotations:
kubernetes.io/service-account.name: "prometheus"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
#apiVersion: rbac.authorization.k8s.io/v1beta1
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
root@deploy:/yaml/promethrus-case# kubectl get secrets -n monitoring
NAME TYPE DATA AGE
monitoring-token kubernetes.io/service-account-token 3 14m
root@deploy:/yaml/promethrus-case# kubectl describe secrets monitoring-token -n monitoring
Name: monitoring-token
Namespace: monitoring
Labels: <none>
Annotations: kubernetes.io/service-account.name: prometheus
kubernetes.io/service-account.uid: cff3f380-8d51-4d25-a71b-1d6d5cf1a39c
Type: kubernetes.io/service-account-token
Data
====
namespace: 10 bytes
token: eyJhbGciOiJSUzI1NiIsImtpZCI6Ik4zUTREdWdUMUp5Wk9KSmczbnBFdUk3eXVHYW53THRQVFpsSzhsbVcyS2MifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJtb25pdG9yaW5nIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6Im1vbml0b3JpbmctdG9rZW4iLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImNmZjNmMzgwLThkNTEtNGQyNS1hNzFiLTFkNmQ1Y2YxYTM5YyIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDptb25pdG9yaW5nOnByb21ldGhldXMifQ.g8SAI74UbRs7wUN-2xKWsO_G_3grvBjXCSsGk2_7Te5W2No0jXkD4g57ofWFnLYI7QKQE9XfiE2cn3X0Rq8RJdqBQrZWXBc1jubDViv71ktDGeHtooJFeul4v9IXn5y2wowhl3VLGDEtMyXTb7bk8E6Q5akTupsJ_aw_DtAsuLiVEX51Ldl8FBrXXB453xyCyKWgcSv5dW5J7BJ4wrWZHAIaYXx7QNmF88wennsx5RXeTZ41o378zSfTc0yVKUbSggU-9_kkROdESKbqwGG7zhaWGvOA_OHaKI9ULfMr-Q-Uqw5BMJEs313m_fU4lozHNcSVU9AJexTqn1toW06j3w
ca.crt: 1302 bytes
将token保存到Prometheus Server节点的k8s.token
文件中,后期用于权限验证
root@prometheus-server:~# vim /apps/prometheus/k8s.token
eyJhbGciOiJSUzI1NiIsImtpZCI6Ik4zUTREdWdUMUp5Wk9KSmczbnBFdUk3eXVHYW53THRQVFpsSzhsbVcyS2MifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJtb25pdG9yaW5nIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6Im1vbml0b3JpbmctdG9rZW4iLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImNmZjNmMzgwLThkNTEtNGQyNS1hNzFiLTFkNmQ1Y2YxYTM5YyIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDptb25pdG9yaW5nOnByb21ldGhldXMifQ.g8SAI74UbRs7wUN-2xKWsO_G_3grvBjXCSsGk2_7Te5W2No0jXkD4g57ofWFnLYI7QKQE9XfiE2cn3X0Rq8RJdqBQrZWXBc1jubDViv71ktDGeHtooJFeul4v9IXn5y2wowhl3VLGDEtMyXTb7bk8E6Q5akTupsJ_aw_DtAsuLiVEX51Ldl8FBrXXB453xyCyKWgcSv5dW5J7BJ4wrWZHAIaYXx7QNmF88wennsx5RXeTZ41o378zSfTc0yVKUbSggU-9_kkROdESKbqwGG7zhaWGvOA_OHaKI9ULfMr-Q-Uqw5BMJEs313m_fU4lozHNcSVU9AJexTqn1toW06j3w
在Prometheus的配置文件:/apps/prometheus/prometheus.yml 里面添加 job 配置。
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
# api-server节点发现
- job_name: "kubernetes-apiserver"
kubernetes_sd_configs:
- role: endpoints
api_server: https://172.23.0.11:6443
tls_config:
insecure_skip_verify: true
bearer_token_file: /apps/prometheus/k8s.token
scheme: https
# tls_config:
# insecure_skip_verify: true
# bearer_token_file: /apps/prometheus/k8s.token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# 自定义替换发现服务器端口、协议等
- source_labels: [__address__]
regex: '(.*):6443'
replacement: '${1}:9100'
target_label: __address__
action: replace
- source_labels: [__scheme__]
regex: https
replacement: http
target_label: __scheme__
action: replace
# node节点发现
- job_name: 'kubernetes-node-monitor'
scheme: http
tls_config:
insecure_skip_verify: true
bearer_token_file: /apps/prometheus/k8s.token
kubernetes_sd_configs:
- role: node
api_server: https://172.23.0.11:6443
tls_config:
insecure_skip_verify: true
bearer_token_file: /apps/prometheus/k8s.token
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- source_labels: [__meta_kubernetes_node_label_failure_domain_beta_kubernetes_io_region]
regex: '(.*)'
replacement: '${1}'
action: replace
target_label: LOC
- source_labels: [__meta_kubernetes_node_label_failure_domain_beta_kubernetes_io_region]
regex: '(.*)'
replacement: 'NODE'
action: replace
target_label: Type
- source_labels: [__meta_kubernetes_node_label_failure_domain_beta_kubernetes_io_region]
regex: '(.*)'
replacement: 'k8s-test'
action: replace
target_label: Env
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# 指定namespace的Pod
- job_name: 'k8s-发现指定namespace的所有Pod'
kubernetes_sd_configs:
- role: pod
api_server: https://172.23.0.11:6443
tls_config:
insecure_skip_verify: true
bearer_token_file: /apps/prometheus/k8s.token
namespaces:
names:
- magedu
- monitoring
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# 指定Pod发现条件
- job_name: 'k8s-指定发现条件的Pod'
kubernetes_sd_configs:
- role: pod
api_server: https://172.23.0.11:6443
tls_config:
insecure_skip_verify: true
bearer_token_file: /apps/prometheus/k8s.token
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true # 这段代码中的source_labels指定了要从Pod注解中抽取的标签名称[__meta_kubernetes_pod_annotation_prometheus_io_scrape],它的值表示一个布尔类型的字符串,指示当前Pod是否应该被Prometheus抓取。如果该标签值为true,则表示需要抓取该Pod的指标;否则,该Pod将被忽略。如果这段代码没有成功匹配任何Pod,或者没有符合条件的Pod存在,那么后面的代码将不会执行。这是因为Prometheus的抽取规则(relabel_config)是按顺序执行的,如果前面的规则没有匹配到任何目标,那么后面的规则就不会被执行。
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_label_pod_template_hash]
regex: '(.*)'
replacement: 'k8s-test'
action: replace
target_label: Env
- source_labels: [__meta_kubernetes_pod_ip]
action: replace
target_label: pod_ip
加载配置,可以在任一节点执行(只要它跟Prometheus的网络是通的)。
前提是在 /etc/systemd/system/prometheus.service
需要添加参数:--web.enable-lifecycle
curl -X POST http://172.23.1.12:9090/-/reload
从这往后的实验环境:k8s集群外的Prometheus Server(172.23.1.12)
1、Prometheus配置文件
root@prometheus-server:/apps/prometheus# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "prometheus-k8s-node"
static_configs:
- targets: ["172.23.0.20:9100","172.23.0.21:9100","172.23.0.11:9100"]
- job_name: "prometheus-worknode"
static_configs:
- targets: ["172.23.1.11:9100","172.23.1.13:9100"]
- job_name: "prometheus-cadvisor"
static_configs:
- targets: ["172.23.0.10:8080","172.23.0.20:8080","172.23.0.21:8080"]
官网:Consul by HashiCorp
consul是分布式 key/value 数据存储集群,目前常用于服务的注册和发现。
二进制可执行文件:Consul Versions | HashiCorp Releases
环境:node01-192.168.0.122,node02-192.168.0.123,node03-192.168.0.124。其中将node01作为集群的Leader。
# 在所有节点安装consul
root@consul-node01:/usr/local/src# unzip consul_1.15.1_linux_amd64.zip
Archive: consul_1.15.1_linux_amd64.zip
inflating: consul
root@consul-node01:/usr/local/src# cp consul /usr/local/bin/
root@consul-node01:/usr/local/src# consul -v
Consul v1.15.1
Revision 7c04b6a0
Build Date 2023-03-07T20:35:33Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
# 创建数据目录
root@consul-node01:/usr/local/src# mkdir -p /data/consul
# 参数
consul ageent -server # 以server模式运行consul
-bootstarp # 首次部署使用初始化模式
-bind # 设置集群通信的监听地址
-client # 设置客户端访问的监听地址
-data-dir # 指定数据保存路径
-ui # 启动内置静态web UI服务器,就是能让你登录web界面
-node # 此节点的名称,集群中必须唯一
-datacenter=dc1 # 集群名称,默认是dc1
-join # 加入到已有的consul环境
启动服务
# node01
root@consul-node01:~# nohup consul agent -server -bootstrap -bind=192.168.0.122 -client=192.168.0.122 -data-dir=/data/consul -ui -node=192.168.0.122 &
[1] 3114
# 将node02加入集群
root@consul-node02:~# nohup consul agent -bind=192.168.0.123 -client=192.168.0.123 -data-dir=/data/consul -node=192.168.0.123 -join=192.168.0.122 &
[1] 31855
# 将node03加入集群
root@consul-node03:~# nohup consul agent -bind=192.168.0.124 -client=192.168.0.124 -data-dir=/data/consul -node=192.168.0.124 -join=192.168.0.122 &
[1] 32195
查看日志:你在哪个路径执行命令,日志就保存在哪个路径
root@consul-node01:~# tail -f nohup.out
2023-03-09T15:18:48.578+0800 [INFO] agent.server.serf.lan: serf: EventMemberJoin: 192.168.0.123 192.168.0.123
2023-03-09T15:18:48.579+0800 [INFO] agent.server: member joined, marking health alive: member=192.168.0.123 partition=default
2023-03-09T15:19:26.034+0800 [INFO] agent.server.serf.lan: serf: EventMemberJoin: 192.168.0.124 192.168.0.124
2023-03-09T15:19:26.035+0800 [INFO] agent.server: member joined, marking health alive: member=192.168.0.124 partition=default
登录web页面查看:http://192.168.0.122:8500/
通过consul的api写入数据,将服务注册到Services,前提是要部署node-exporter。所以我们现在三个node节点上部署node-exporter。
# 在任意一台服务器执行都可以,只要它能连上本例的Leader。
# 先将node01和node02注册进去
curl -X PUT -d '{"id": "node-exporter122","name": "node-exporter122","address": "192.168.0.122","port": 9100,"tags": ["node-exporter"],"checks": [{"http": "http://192.168.0.122:9100/","interval": "5s"}]}' http://192.168.0.122:8500/v1/agent/service/register
curl -X PUT -d '{"id": "node-exporter123","name": "node-exporter123","address": "192.168.0.123","port": 9100,"tags": ["node-exporter"],"checks": [{"http": "http://192.168.0.123:9100/","interval": "5s"}]}' http://192.168.0.122:8500/v1/agent/service/register
如何删除注册的service?
# 这将从Consul中删除名为"node-exporter122"的服务。请注意,这将永久删除服务,无法恢复。如果您不确定,请务必先备份数据。
curl -X PUT http://192.168.0.122:8500/v1/agent/service/deregister/node-exporter122
static_configs: # 配置数据源
consul_sd_configs: # 指定是基于consul的服务发现
relabel_configs: # 重新标记
services: [] # 表示匹配consul中的所有service
编辑Prometheus的configmap文件:case3-1-prometheus-cfg.yaml
,追加下面的配置
- job_name: 'consul'
honor_labels: true
metrics_path: /metrics
scheme: http
consul_sd_configs:
- server: 192.168.0.122:8500
services: [] # 发现的目标服务名称,空表示发现所有服务,也可以写指定服务名称(如本例的node-exporter122、node-exporter123)
- server: 192.168.0.123:8500
services: []
- server: 192.168.0.124:8500
services: []
relabel_configs:
- source_labels: [