Prometheus受启发于Google的Brogmon监控系统(相似的Kubernetes是从Google的Brog系统演变而来),从 2012 年开始由前Google工程师在Soundcloud 以开源软件的 形式进行研发,并且于 2015 年早期对外发布早期版本。Prometheus具有以下特点:易于管理、监控服务的内部运行状态、强大的数据模型、所有采集的监控数据均以指标(metric)的形式保存在内置的时间序列数据库当中(TSDB)。最新的Grafana可视化工具也已经提供了完整的Prometheus支持,基于Grafana可以创建更加精美的监控图标。
1、Prometheus 生态圈组件
Prometheus Server:主服务器,负责收集和存储时间序列数据
client libraies:应用程序代码插桩,将监控指标嵌入到被监控应用程序中
Pushgateway:推送网关, 为支持 short-lived 作业提供一个推送网关
exporter:专门为一些应用开发的数据摄取组件—exporter,例如: HAProxy、 StatsD、Graphite 等等。
Alertmanager:专门用于处理 alert 的组件
2、架构理解
Prometheus Server,里面包含了存储引擎和计算引擎。
Retrieval 组件为取数组件,它会主动从 Pushgateway 或者 Exporter 拉取指标数据。
Service discovery,可以动态发现要监控的目标。
TSDB,数据核心存储与查询。
HTTP server,对外提供 HTTP 服务。
3、采集层
采集层分为两类,一类是生命周期较短的作业,还有一类是生命周期较长的作业。
短作业:直接通过 API,在退出时间指标推送给 Pushgateway。
长作业:Retrieval 组件直接从 Job 或者 Exporter 拉取数据。
4、应用层
应用层主要分为两种,一种是 AlertManager,另一种是数据可视化。
IP |
服务 |
hostname |
192.168.255.101 |
Prometheus Server、Pushgateway、Alertmanager、Node Exporter |
node01 |
192.168.255.102 |
Node Exporter |
node02 |
192.168.255.103 |
Node Exporter |
node03 |
1、获取安装包
[root@node01 ~]# wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
2、解压缩
[root@node01 ~]# tar -zxf prometheus-2.29.1.linux-amd64.tar.gz -C /usr/local/cluster/
3、创建软连接
[root@node01 ~]# ln -s /usr/local/cluster/prometheus-2.29.1.linux-amd64/ /usr/local/cluster/prometheus
4、修改配置文件
[root@node01 ~]# vim /usr/local/cluster/prometheus/prometheus.yml
- job_name: "prometheus"
static_configs:
- targets: ["192.168.255.101:9090"]
- job_name: "pushgateway"
static_configs:
- targets: ["192.168.255.101:9091"]
labels:
instance: pushgateway
- job_name: "node exporter"
static_configs:
- targets: ["192.168.255.101:9100","192.168.255.102:9100","192.168.255.103:9100"]
1、获取安装包
[root@node01 ~]# wget https://github.com/prometheus/pushgateway/releases/download/v1.6.0/pushgateway-1.6.0.linux-amd64.tar.gz
2、解压缩
[root@node01 ~]# tar -zxf pushgateway-1.4.1.linux-amd64.tar.gz -C /usr/local/cluster/
3、创建软链接
[root@node01 ~]# ln -s /usr/local/cluster/pushgateway-1.4.1.linux-amd64/ /usr/local/cluster/pushgateway
[root@node01 ~]# wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
[root@node01 ~]# tar -zxf alertmanager-0.23.0.linux-amd64.tar.gz -C /usr/local/cluster/
[root@node01 ~]# ln -s /usr/local/cluster/alertmanager-0.23.0.linux-amd64/ /usr/local/cluster/alertmanager
集群节点都要安装
1、获取安装包
[root@node01 ~]# wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
2、解压缩
[root@node01 ~]# tar -zxf node_exporter-1.2.2.linux-amd64.tar.gz -C /usr/local/cluster/
3、创建软链接
[root@node01 ~]# ln -s /usr/local/cluster/node_exporter-1.2.2.linux-amd64/ /usr/local/cluster/node_exporter
4、启动服务
[root@node01 ~]# nohup /usr/local/cluster/node_exporter/node_exporter > /usr/local/cluster/node_exporter/node_exporter.log 2>&1 &
5、配置systemctl管理服务
[root@node01 ~]# vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_export
Documentation=https://github.com/prometheus/node_exporter
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/cluster/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
[root@node01 ~]# systemctl start node_exporter.service
1、后台方式运行Prometheus Server
[root@node01 ~]# nohup /usr/local/cluster/prometheus/prometheus --config.file=/usr/local/cluster/prometheus/prometheus.yml > /usr/local/cluster/prometheus/prometheus.log 2>&1 &
[3] 106868
2、启动失败
[root@node01 ~]# ps -ef | grep prometheus
root 130614 62112 0 11:59 pts/0 00:00:00 grep --color=auto prometheus
[root@node01 ~]# netstat -anp | grep 106868
3、查看日志
[root@node01 ~]# more /usr/local/cluster/prometheus/prometheus.log
nohup: ignoring input
level=error ts=2023-07-22T03:57:46.081Z caller=main.go:350 msg="Error loading config (--config.file=/usr
/local/cluster/prometheus/prometheus.yml)" err="parsing YAML file /usr/local/cluster/prometheus/promethe
us.yml: yaml: unmarshal errors:\n line 31: field instance not found in type struct { Targets []string \
"yaml:\\\"targets\\\"\"; Labels model.LabelSet \"yaml:\\\"labels\\\"\" }"
4、配置文件第31行格式有问题instance: pushgateway
[root@node01 ~]# vim /usr/local/cluster/prometheus/prometheus.yml
- job_name: "pushgateway"
static_configs:
- targets: ["192.168.255.101:9091"]
labels:
instance: pushgateway
5、修改配置文件后再次启动
[root@node01 ~]# nohup /usr/local/cluster/prometheus/prometheus --config.file=/usr/local/cluster/prometheus/prometheus.yml > /usr/local/cluster/prometheus/prometheus.log 2>&1 &
[1] 66642
6、查看进程
[root@node01 ~]# ps -ef | grep prometheus
root 66642 87617 0 02:04 pts/2 00:00:00 /usr/local/cluster/prometheus/prometheus --config.file=/usr/local/cluster/prometheus/prometheus.yml
root 75184 87617 0 02:05 pts/2 00:00:00 grep --color=auto prometheus
7、查看日志Server is ready to receive web requests
[root@node01 ~]# tail -100f /usr/local/cluster/prometheus/prometheus.log
nohup: ignoring input
level=info ts=2023-07-22T18:04:30.186Z caller=main.go:390 msg="No time or size retention was set so using the default time retention" duration=15d
level=info ts=2023-07-22T18:04:30.186Z caller=main.go:428 msg="Starting Prometheus" version="(version=2.29.1, branch=HEAD, revision=dcb07e8eac34b5ea37cd229545000b857f1c1637)"
level=info ts=2023-07-22T18:04:30.186Z caller=main.go:433 build_context="(go=go1.16.7, user=root@364730518a4e, date=20210811-14:48:27)"
level=info ts=2023-07-22T18:04:30.186Z caller=main.go:434 host_details="(Linux 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 node01 (none))"
level=info ts=2023-07-22T18:04:30.186Z caller=main.go:435 fd_limits="(soft=1024, hard=4096)"
level=info ts=2023-07-22T18:04:30.186Z caller=main.go:436 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2023-07-22T18:04:30.192Z caller=web.go:541 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2023-07-22T18:04:30.197Z caller=main.go:812 msg="Starting TSDB ..."
level=info ts=2023-07-22T18:04:30.201Z caller=head.go:815 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2023-07-22T18:04:30.201Z caller=head.go:829 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=18.856µs
level=info ts=2023-07-22T18:04:30.201Z caller=head.go:835 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2023-07-22T18:04:30.202Z caller=head.go:892 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2023-07-22T18:04:30.202Z caller=head.go:898 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=34.866µs wal_replay_duration=903.49µs total_replay_duration=976.097µs
level=info ts=2023-07-22T18:04:30.203Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
level=info ts=2023-07-22T18:04:30.204Z caller=main.go:839 fs_type=XFS_SUPER_MAGIC
level=info ts=2023-07-22T18:04:30.204Z caller=main.go:842 msg="TSDB started"
level=info ts=2023-07-22T18:04:30.204Z caller=main.go:969 msg="Loading configuration file" filename=/usr/local/cluster/prometheus/prometheus.yml
level=info ts=2023-07-22T18:04:30.216Z caller=main.go:1006 msg="Completed loading of configuration file" filename=/usr/local/cluster/prometheus/prometheus.yml totalDuration=12.121714ms db_storage=902ns remote_storage=4.468µs web_handler=341ns query_engine=2.685µs scrape=11.038657ms scrape_sd=124.555µs notify=31.409µs notify_sd=14.867µs rules=3.366µs
level=info ts=2023-07-22T18:04:30.216Z caller=main.go:784 msg="Server is ready to receive web requests."
http://192.168.255.101:9090/
当前成功启动3个node exporter和Prometheus
1、nohup方式启动
[root@node01 ~]# nohup /usr/local/cluster/pushgateway/pushgateway --web.listen-address=":9091" > /usr/local/cluster/pushgateway/pushgateway.log 2>&1 &
[3] 94973
2、查看进程
[root@node01 ~]# ps -ef | grep pushgateway
root 94973 87617 0 02:20 pts/2 00:00:00 /usr/local/cluster/pushgateway/pushgateway --web.listen-address=:9091
root 96236 87617 0 02:20 pts/2 00:00:00 grep --color=auto pushgateway
3、查看日志
[root@node01 ~]# tail -100f /usr/local/cluster/pushgateway/pushgateway.log
nohup: ignoring input
level=info ts=2023-07-22T18:20:39.894Z caller=main.go:85 msg="starting pushgateway" version="(version=1.4.1, branch=HEAD, revision=6fa509bbf4f082ab8455057aafbb5403bd6e37a5)"
level=info ts=2023-07-22T18:20:39.894Z caller=main.go:86 build_context="(go=go1.16.4, user=root@da864be5f3f0, date=20210528-14:30:10)"
level=info ts=2023-07-22T18:20:39.896Z caller=main.go:139 listen_address=:9091
level=info ts=2023-07-22T18:20:39.901Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
1、nohup方式启动
[root@node01 ~]# nohup /usr/local/cluster/alertmanager/alertmanager --config.file=/usr/local/cluster/alertmanager/alertmanager.yml > /usr/local/cluster/alertmanager/alertmanager.log 2>&1 &
[5] 128248
2、查看进程
[root@node01 ~]# ps -ef | grep alertmanager
root 128248 87617 1 02:23 pts/2 00:00:00 /usr/local/cluster/alertmanager/alertmanager --config.file=/usr/local/cluster/alertmanager/alertmanager.yml
root 129880 87617 0 02:23 pts/2 00:00:00 grep --color=auto alertmanager
3、查看日志
[root@node01 ~]# tail -100f /usr/local/cluster/alertmanager/alertmanager.log
nohup: ignoring input
level=info ts=2023-07-22T18:23:49.256Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=HEAD, revision=61046b17771a57cfd4c4a51be370ab930a4d7d54)"
level=info ts=2023-07-22T18:23:49.256Z caller=main.go:226 build_context="(go=go1.16.7, user=root@e21a959be8d2, date=20210825-10:48:55)"
level=info ts=2023-07-22T18:23:49.262Z caller=cluster.go:184 component=cluster msg="setting advertise address explicitly" addr=192.168.255.101 port=9094
level=info ts=2023-07-22T18:23:49.270Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2023-07-22T18:23:49.331Z caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=/usr/local/cluster/alertmanager/alertmanager.yml
level=info ts=2023-07-22T18:23:49.331Z caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=/usr/local/cluster/alertmanager/alertmanager.yml
level=info ts=2023-07-22T18:23:49.334Z caller=main.go:518 msg=Listening address=:9093
level=info ts=2023-07-22T18:23:49.334Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
level=info ts=2023-07-22T18:23:51.270Z caller=cluster.go:696 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000230156s
level=info ts=2023-07-22T18:23:59.277Z caller=cluster.go:688 component=cluster msg="gossip settled; proceeding" elapsed=10.006653682s