Prometheus 是一个开源的系统监控和警报工具包,采用拉取(pull)模式从目标收集指标数据。其核心采集机制包括:
数据模型:Prometheus 将所有数据存储为时间序列,每个时间序列由指标名称(metric name)和一组键值对标签(label)唯一标识。时间序列数据格式为:
指标类型:
采集方式:
global:
scrape_interval: 15s # 默认抓取间隔
evaluation_interval: 15s # 规则评估间隔
external_labels: # 外部系统标识
monitor: 'production'
scrape_configs:
- job_name: 'prometheus' # 监控Prometheus自身
scrape_interval: 10s # 覆盖全局设置
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter' # 监控节点指标
metrics_path: '/metrics' # 指标路径
static_configs:
- targets:
- 'node-exporter:9100'
- 'node-exporter2:9100'
relabel_configs: # 标签重写
- source_labels: [__address__]
regex: '(.*):9100'
target_label: 'instance'
replacement: '$1'
scrape_configs:
- job_name: 'file_sd'
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
refresh_interval: 5m # 文件重新加载间隔
scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node # 发现集群节点
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap # 将K8s标签映射为Prometheus标签
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100' # 将kubelet端口替换为node-exporter端口
target_label: __address__
{
"title": "CPU Usage by Instance",
"description": "实时显示各节点CPU使用率",
"type": "gauge",
"datasource": "Prometheus",
"gridPos": { "x": 0, "y": 0, "w": 8, "h": 6 },
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}",
"interval": "15s",
"refId": "A"
}
],
"options": {
"min": 0,
"max": 100,
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 70 },
{ "color": "red", "value": 85 }
]
},
"showThresholdLabels": true,
"showThresholdMarkers": true
}
}
{
"title": "Memory Usage Trend",
"type": "graph",
"datasource": "Prometheus",
"gridPos": { "x": 8, "y": 0, "w": 16, "h": 8 },
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (1024*1024)",
"legendFormat": "Used ({{instance}})",
"step": "60s",
"refId": "A"
},
{
"expr": "node_memory_MemTotal_bytes / (1024*1024)",
"legendFormat": "Total ({{instance}})",
"step": "60s",
"refId": "B"
}
],
"options": {
"legend": {
"show": true,
"values": true,
"min": false,
"max": true,
"current": true,
"total": false,
"alignAsTable": true
},
"tooltip": {
"shared": true,
"sort": 2 // 按值降序排序
},
"yaxes": [
{
"format": "MB",
"min": 0
},
{
"show": false
}
],
"lines": true,
"fill": 1,
"linewidth": 2
}
}
标签使用建议:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- regex: '.*_token;.*' # 删除敏感标签
action: labeldrop
采集优化:
groups:
- name: node_rules
rules:
- record: instance:node_cpu:avg_rate5m
expr: 100 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100
Grafana高级技巧:
"templating": {
"list": [
{
"name": "instance",
"datasource": "Prometheus",
"query": "label_values(node_cpu_seconds_total, instance)",
"refresh": 1,
"multi": true
}
]
}
Prometheus无法抓取指标:
curl http://target:port/metrics
http://prometheus:9090/service-discovery
journalctl -u prometheus -f
Grafana显示无数据:
Grafana > Data Sources > Test
http://prometheus:9090/graph
中测试count({__name__=~".+"})
性能问题:
process_resident_memory_bytes
rate(prometheus_tsdb_head_samples_appended_total[1m])
.*
正则匹配rate()
而非irate()
处理长时间范围scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'source-prometheus:9090'
通过合理配置Prometheus采集和Grafana可视化,您可以构建强大的监控系统,实时掌握系统运行状态,快速定位和解决问题。建议定期审查指标采集策略,删除无用指标,优化存储效率。