最近需要搭建监控集群,查找并试了好几种方案,最终选定了prometheus+thanos的方案。先说下我查找的其他方案的缺点吧,毕竟是最后才决定用prometheus的。
不管哪种方案,都是使用的grafana来进行数据展示,所以展示端就不写了。
在prometheus的官网能够很简单的查询到搭建过程,我都是采用的docker搭建。搭建过程如下
curl -fsSL get.docker.com -o get-docker.sh
sudo sh get-docker.sh --mirror Aliyun
启动docker ce
sudo systemctl enable docker
sudo systemctl start docker
测试是否安装正确
docker run hello-world
* docker run -d -p 9090:9090 prom/prometheus --restart=on-failure:5
// 跟上面配置不同
docker run -d -p 9090:9090 -v prometheus:/data/monitor/lib --name prometheusv0 prometheus:v1
* 配置prometheus.yml
find / -name prometheus.yml
vim
添加
//虽然官方推荐这么弄,但是发现其实作用不是太大
global:
external_labels:
replica: A
* 重启prometheus
docker ps
docker container restart {container id}
wget https://dl.google.com/go/go1.11.2.linux-amd64.tar.gz
tar -C /usr/local -xzf go1.11.2.linux-amd64.tar.gz
vim /etc/profile
export PATH=$PATH:/usr/local/go/bin
export GOPATH=/usr/local/gopath
export GOROOT=/usr/local/go
source /etc/profile
mkdir /usr/local/gopath
docker pull prom/pushgateway
docker run -d -p 9091:9091 prom/pushgateway --persistence.file /data/monitor/pushgateway/data/data
yum install git
go get -d github.com/improbable-eng/thanos/...
cd ${GOPATH}/src/github.com/improbable-eng/thanos
make
docker run -d -p 3000:3000 grafana/grafana
安装好之后,单机就能够访问prometheus了,也可能不能,但是通过prometheus的git主页是能够正确配置成功的,需要再配置thanos才能把集群搭起来。而thanos的git主页上面最没有讲清楚的就是两台机器的端口到底怎么配置,搭建的时候还是费了点时间。假设在45.77.154.164和149.28.61.87这两台机器上搭建集群.
在两台机器上都执行
./thanos sidecar \
--http-address 0.0.0.0:19191 \
--grpc-address 0.0.0.0:19091 \
--prometheus.url http://localhost:9090 \
--tsdb.path /usr/local/gopath/src/github.com/prometheus/tsdb/testdata/ \
--cluster.address 0.0.0.0:19391 \
--cluster.peers 45.77.154.164:19391\
--cluster.peers 149.28.61.87:19391 \
--log.level=debug
./thanos query \
--http-address 0.0.0.0:19192 \
--log.level=debug \
--cluster.peers 149.28.61.87:19391 \
--cluster.peers 45.77.154.164:19391 \
--query.replica-label replica
可能会报connect: no route to host
打开集群端口
firewall-cmd --zone=public --add-port=19391/tcp --permanent
systemctl restart firewalld
打开查询端口
firewall-cmd --zone=public --add-port=19192/tcp --permanent
由于试了很多遍,导致之前记得文档可能不大清晰了,记得这样的话通过其中一台机器thanos的端口就能访问另外一台机器上的指标了。这样集群就搭建成功了。由于实际生产的时候指标并不多,所以还没有说有什么大数据量的问题。
虽然官方推荐的是使用pull的方式来获取指标,但是当实际使用的时候,发现没有特别简单的服务发现应用,需要手动把所有的机器的ip都配置上,太麻烦了。就使用了pushgateway来进行push。
现在的服务大多数都是部署在多台机器上,而如果不对机器进行区分的话,查询出来的指标会在多台机器上的实际指标上跳动。而有这些不方便的地方之后,就写了个jar包来进行包装。主要做的事情就是
主要的类是
//获取机器的ip
public class BusinessMonitorConstant {
private static Logger log = LoggerFactory.getLogger(BusinessMonitorConstant.class);
public static final CollectorRegistry REGISTRY = new CollectorRegistry();
private static InetAddress inetAddress = findFirstNonLoopbackAddress();
public static String getIp() {
return inetAddress == null ? "" : inetAddress.getHostAddress();
}
public static InetAddress findFirstNonLoopbackAddress() {
InetAddress result = null;
try {
int lowest = 2147483647;
Enumeration nics = NetworkInterface.getNetworkInterfaces();
label55:
while(true) {
NetworkInterface ifc;
while(true) {
do {
if (!nics.hasMoreElements()) {
break label55;
}
ifc = (NetworkInterface)nics.nextElement();
} while(!ifc.isUp());
log.trace("Testing interface: " + ifc.getDisplayName());
if (ifc.getIndex() >= lowest && result != null) {
if (result != null) {
continue;
}
break;
}
lowest = ifc.getIndex();
break;
}
Enumeration addrs = ifc.getInetAddresses();
while(addrs.hasMoreElements()) {
InetAddress address = (InetAddress)addrs.nextElement();
if (address instanceof Inet4Address && !address.isLoopbackAddress()) {
log.trace("Found non-loopback interface: " + ifc.getDisplayName());
result = address;
}
}
}
} catch (IOException var8) {
log.error("Cannot get first non-loopback address", var8);
}
if (result != null) {
return result;
} else {
try {
return InetAddress.getLocalHost();
} catch (UnknownHostException var7) {
log.warn("Unable to retrieve localhost");
return null;
}
}
}
}
@Service
public class MetricsSchedule {
private CollectorRegistry registry = BusinessMonitorConstant.REGISTRY;
private static Logger log = LoggerFactory.getLogger(PrometheusAspect.class);
private PushGateway pushGateway;
private String ip;
@PostConstruct
private void init() {
String pushUrl = "配置文件或者从其他动态配置系统获取";
if (StringUtils.isEmpty(pushUrl)) {
log.error("do not set prometheus.url in common config. ignore push");
return;
}
URL url;
//send metrics to remote
try {
url = new URL(pushUrl);
} catch (MalformedURLException e) {
log.error("new URL error. url: {}", pushUrl, e);
return;
}
pushGateway = new PushGateway(url);
InetAddress firstNonLoopbackAddress = BusinessMonitorConstant.findFirstNonLoopbackAddress();
if (firstNonLoopbackAddress != null) {
ip = firstNonLoopbackAddress.getHostAddress();
}
}
//定时上传数据,以免每个系统都要写定时上传
@Scheduled(fixedRate = 20000L)
public void scheduleUpMetrics() {
try {
if (pushGateway == null) {
log.debug("do not set prometheus.url in common config. ignore push");
return;
}
pushGateway.pushAdd(registry, ip);
} catch (IOException e) {
log.error("prometheus post metrics io error", e);
//这里这么写是因为之前测试用域名push,所以弄个实时再获取的,其实当push url定了不用这么在获取一遍
String pushUrl = "配置文件或者其他配置获取子系统获取";
//send metrics to remote
try {
url = new URL(pushUrl);
} catch (MalformedURLException e1) {
log.error("new URL error. url: {}", pushUrl, e1);
return;
}
pushGateway = new PushGateway(url);
}
}
}
别人使用的时候只使用这个类,其他类不用
/** Usage sample:
*
* MetricsProxy.registerCounter("common_business_monitor").inc();
*
*
* When use it , please add try {} catch {}
* @date 2018/11/30
**/
public class MetricsProxy {
//没必要加volatile,因为register会报错,外层会try catch。 不用那么严格
private static final Map counters = Maps.newConcurrentMap();
private static final Map gauges = Maps.newConcurrentMap();
private static final Map histograms = Maps.newConcurrentMap();
private static final Map summaries = Maps.newConcurrentMap();
private static final CollectorRegistry registry = BusinessMonitorConstant.REGISTRY;
private static String nameSupplement = "通过配置文件获取区分不同环境";
public static Counter registerCounter(String name) {
Counter counter = counters.get(name);
//这里不这么写的话,当刚重启,并发registerCounter的时候,有可能抛出异常
if(counter == null) {
synchronized (counters) {
counter = counters.get(name);
if(counter == null) {
Counter register = Counter.build(name + "_" + nameSupplement, "help msg").register(registry);
counters.put(name, register);
return register;
} else {
return counter;
}
}
} else {
return counter;
}
}
public static Gauge registerGauge(String name) {
Gauge gauge = gauges.get(name);
if(gauge == null) {
synchronized (gauges) {
gauge = gauges.get(name);
if(gauge == null) {
Gauge register = Gauge.build(name + "_" + nameSupplement, "help msg").register(registry);
gauges.put(name, register);
return register;
} else {
return gauge;
}
}
} else {
return gauge;
}
}
public static Summary registerSummary(String name) {
Summary summary = summaries.get(name);
if(summary == null) {
synchronized (summaries) {
summary = summaries.get(name);
if (summary == null) {
Summary register = Summary.build(name + "_" + nameSupplement, "help msg").register(registry);
summaries.put(name, register);
return register;
} else {
return summary;
}
}
} else {
return summary;
}
}
public static Histogram registerHistogram(String name) {
Histogram histogram = histograms.get(name);
if(histogram == null) {
synchronized (histograms) {
histogram = histograms.get(name);
if (histogram == null) {
Histogram register = Histogram.build(name + "_" + nameSupplement, "help msg").register(registry);
histograms.put(name, register);
return register;
} else {
return histogram;
}
}
} else {
return histogram;
}
}
}
当别人要使用的时候,引入jar包,再只用这一行代码就能打上一个counter指标了。
MetricsProxy.registerCounter("common_business_monitor").inc();
通过以上方法打的每个指标后面就带上了机器的ip,这样如果要查询真实的指标数量,就要加上sum了。
发现counter的increase函数其实并不准。虽然offset是准的,但是处理不了机器重启之后offset就是负的的问题。increase用的是线性回归还是平均值来估算的,所以短时间之内预测的数值不准,但是时间越拉越长,也就越准,而且每个时间间隔内打的点越多一般也就能够越准确。
暂时就想到这么多。