prometheus + thanos监控集群搭建

最近需要搭建监控集群,查找并试了好几种方案,最终选定了prometheus+thanos的方案。先说下我查找的其他方案的缺点吧,毕竟是最后才决定用prometheus的。

各种方案的取舍

不管哪种方案,都是使用的grafana来进行数据展示,所以展示端就不写了。

  1. graphite + whisper + carbon. whisper+carbon是使用graphite的时候默认用的组件,虽然是能够替换的,但是替换的工作量不算小。而且虽然能够搭建集群,但是集群间的访问非常慢。
  2. influxdb + Telegraf。 influxdb只有商业版才支持集群,虽然非商业版有代理分片的机制,但是查询不大适合。(单机支持的负载)[https://jasper-zhang1.gitbooks.io/influxdb/content/Guide/hardware_sizing.html]
  3. 在上面两个方案否定之后,最后查到了prometheus + thanos的方案。

搭建单机

在prometheus的官网能够很简单的查询到搭建过程,我都是采用的docker搭建。搭建过程如下

  1. 安装docker
 curl -fsSL get.docker.com -o get-docker.sh
 sudo sh get-docker.sh --mirror Aliyun
 启动docker ce
 sudo systemctl enable docker
 sudo systemctl start docker
 测试是否安装正确
 docker run hello-world
  1. 安装prometheus
* docker run -d -p 9090:9090 prom/prometheus  --restart=on-failure:5
// 跟上面配置不同
docker run -d -p 9090:9090 -v prometheus:/data/monitor/lib --name prometheusv0 prometheus:v1


* 配置prometheus.yml
find / -name prometheus.yml
vim
添加
//虽然官方推荐这么弄,但是发现其实作用不是太大
global:
  external_labels:
    replica: A
* 重启prometheus
docker  ps
docker container restart  {container id}
  1. 安装go
wget https://dl.google.com/go/go1.11.2.linux-amd64.tar.gz
tar -C /usr/local -xzf go1.11.2.linux-amd64.tar.gz
vim  /etc/profile
export PATH=$PATH:/usr/local/go/bin
export GOPATH=/usr/local/gopath
export GOROOT=/usr/local/go
source /etc/profile
mkdir /usr/local/gopath
  1. 安装pushGathway
docker pull prom/pushgateway

docker run -d -p 9091:9091 prom/pushgateway  --persistence.file /data/monitor/pushgateway/data/data
  1. 安装git
yum install git
  1. 安装thanos
go get -d github.com/improbable-eng/thanos/...
cd ${GOPATH}/src/github.com/improbable-eng/thanos
make
  1. 安装grafana
docker run -d -p 3000:3000 grafana/grafana

集群配置

安装好之后,单机就能够访问prometheus了,也可能不能,但是通过prometheus的git主页是能够正确配置成功的,需要再配置thanos才能把集群搭起来。而thanos的git主页上面最没有讲清楚的就是两台机器的端口到底怎么配置,搭建的时候还是费了点时间。假设在45.77.154.164和149.28.61.87这两台机器上搭建集群.

在两台机器上都执行

./thanos sidecar \
   --http-address              0.0.0.0:19191 \
   --grpc-address              0.0.0.0:19091 \
   --prometheus.url            http://localhost:9090 \
   --tsdb.path                 /usr/local/gopath/src/github.com/prometheus/tsdb/testdata/ \
   --cluster.address           0.0.0.0:19391 \
   --cluster.peers             45.77.154.164:19391\
  --cluster.peers             149.28.61.87:19391 \
   --log.level=debug

./thanos query \
      --http-address              0.0.0.0:19192 \
      --log.level=debug        \
      --cluster.peers             149.28.61.87:19391 \
      --cluster.peers             45.77.154.164:19391 \
      --query.replica-label replica

可能会报connect: no route to host
打开集群端口
firewall-cmd --zone=public --add-port=19391/tcp --permanent
systemctl restart firewalld
打开查询端口
firewall-cmd --zone=public --add-port=19192/tcp --permanent

由于试了很多遍,导致之前记得文档可能不大清晰了,记得这样的话通过其中一台机器thanos的端口就能访问另外一台机器上的指标了。这样集群就搭建成功了。由于实际生产的时候指标并不多,所以还没有说有什么大数据量的问题。

其他注意事项

  1. 虽然官方推荐的是使用pull的方式来获取指标,但是当实际使用的时候,发现没有特别简单的服务发现应用,需要手动把所有的机器的ip都配置上,太麻烦了。就使用了pushgateway来进行push。

  2. 现在的服务大多数都是部署在多台机器上,而如果不对机器进行区分的话,查询出来的指标会在多台机器上的实际指标上跳动。而有这些不方便的地方之后,就写了个jar包来进行包装。主要做的事情就是

  • 定时push数据到pushGateway
  • 添加job = ip来区分不同机器上传的数据,不然监控服务端只会交替显示不同机器的指标
  • 由于项目部署在多个环境,而所有环境都是使用的一套监控,所以加了个对不同环境的区分。

主要的类是

//获取机器的ip
public class BusinessMonitorConstant {
    private static Logger log = LoggerFactory.getLogger(BusinessMonitorConstant.class);
    public static final CollectorRegistry REGISTRY = new CollectorRegistry();
    private static InetAddress inetAddress = findFirstNonLoopbackAddress();

    public static String getIp() {
        return inetAddress == null ? "" : inetAddress.getHostAddress();
    }


    public static InetAddress findFirstNonLoopbackAddress() {
        InetAddress result = null;

        try {
            int lowest = 2147483647;
            Enumeration nics = NetworkInterface.getNetworkInterfaces();

            label55:
            while(true) {
                NetworkInterface ifc;
                while(true) {
                    do {
                        if (!nics.hasMoreElements()) {
                            break label55;
                        }

                        ifc = (NetworkInterface)nics.nextElement();
                    } while(!ifc.isUp());

                    log.trace("Testing interface: " + ifc.getDisplayName());
                    if (ifc.getIndex() >= lowest && result != null) {
                        if (result != null) {
                            continue;
                        }
                        break;
                    }

                    lowest = ifc.getIndex();
                    break;
                }

                Enumeration addrs = ifc.getInetAddresses();

                while(addrs.hasMoreElements()) {
                    InetAddress address = (InetAddress)addrs.nextElement();
                    if (address instanceof Inet4Address && !address.isLoopbackAddress()) {
                        log.trace("Found non-loopback interface: " + ifc.getDisplayName());
                        result = address;
                    }
                }
            }
        } catch (IOException var8) {
            log.error("Cannot get first non-loopback address", var8);
        }

        if (result != null) {
            return result;
        } else {
            try {
                return InetAddress.getLocalHost();
            } catch (UnknownHostException var7) {
                log.warn("Unable to retrieve localhost");
                return null;
            }
        }
    }
}
@Service
public class MetricsSchedule {

    private CollectorRegistry registry = BusinessMonitorConstant.REGISTRY;
    private static Logger log = LoggerFactory.getLogger(PrometheusAspect.class);


    private PushGateway pushGateway;
    private String ip;

    @PostConstruct
    private void init() {
        String pushUrl =   "配置文件或者从其他动态配置系统获取";
        if (StringUtils.isEmpty(pushUrl)) {
            log.error("do not set prometheus.url in common config. ignore push");
            return;
        }
        URL url;
        //send metrics to remote
        try {
            url = new URL(pushUrl);
        } catch (MalformedURLException e) {
            log.error("new URL error. url: {}", pushUrl, e);
            return;
        }
        pushGateway = new PushGateway(url);
        InetAddress firstNonLoopbackAddress = BusinessMonitorConstant.findFirstNonLoopbackAddress();
        if (firstNonLoopbackAddress != null) {
            ip = firstNonLoopbackAddress.getHostAddress();
        }
    }

//定时上传数据,以免每个系统都要写定时上传
    @Scheduled(fixedRate = 20000L)
    public void scheduleUpMetrics() {
        try {
            if (pushGateway == null) {
                log.debug("do not set prometheus.url in common config. ignore push");
                return;
            }
            pushGateway.pushAdd(registry, ip);
        } catch (IOException e) {
            log.error("prometheus post metrics io error", e);
          //这里这么写是因为之前测试用域名push,所以弄个实时再获取的,其实当push url定了不用这么在获取一遍
            String pushUrl = "配置文件或者其他配置获取子系统获取";
            //send metrics to remote
            try {
                url = new URL(pushUrl);
            } catch (MalformedURLException e1) {
                log.error("new URL error. url: {}", pushUrl, e1);
                return;
            }
            pushGateway = new PushGateway(url);
        }
    }

}

别人使用的时候只使用这个类,其他类不用

/** Usage sample:
 *   

* MetricsProxy.registerCounter("common_business_monitor").inc(); *

* * When use it , please add try {} catch {} * @date 2018/11/30 **/ public class MetricsProxy { //没必要加volatile,因为register会报错,外层会try catch。 不用那么严格 private static final Map counters = Maps.newConcurrentMap(); private static final Map gauges = Maps.newConcurrentMap(); private static final Map histograms = Maps.newConcurrentMap(); private static final Map summaries = Maps.newConcurrentMap(); private static final CollectorRegistry registry = BusinessMonitorConstant.REGISTRY; private static String nameSupplement = "通过配置文件获取区分不同环境"; public static Counter registerCounter(String name) { Counter counter = counters.get(name); //这里不这么写的话,当刚重启,并发registerCounter的时候,有可能抛出异常 if(counter == null) { synchronized (counters) { counter = counters.get(name); if(counter == null) { Counter register = Counter.build(name + "_" + nameSupplement, "help msg").register(registry); counters.put(name, register); return register; } else { return counter; } } } else { return counter; } } public static Gauge registerGauge(String name) { Gauge gauge = gauges.get(name); if(gauge == null) { synchronized (gauges) { gauge = gauges.get(name); if(gauge == null) { Gauge register = Gauge.build(name + "_" + nameSupplement, "help msg").register(registry); gauges.put(name, register); return register; } else { return gauge; } } } else { return gauge; } } public static Summary registerSummary(String name) { Summary summary = summaries.get(name); if(summary == null) { synchronized (summaries) { summary = summaries.get(name); if (summary == null) { Summary register = Summary.build(name + "_" + nameSupplement, "help msg").register(registry); summaries.put(name, register); return register; } else { return summary; } } } else { return summary; } } public static Histogram registerHistogram(String name) { Histogram histogram = histograms.get(name); if(histogram == null) { synchronized (histograms) { histogram = histograms.get(name); if (histogram == null) { Histogram register = Histogram.build(name + "_" + nameSupplement, "help msg").register(registry); histograms.put(name, register); return register; } else { return histogram; } } } else { return histogram; } } }

当别人要使用的时候,引入jar包,再只用这一行代码就能打上一个counter指标了。

 MetricsProxy.registerCounter("common_business_monitor").inc();
  1. 通过以上方法打的每个指标后面就带上了机器的ip,这样如果要查询真实的指标数量,就要加上sum了。

  2. 发现counter的increase函数其实并不准。虽然offset是准的,但是处理不了机器重启之后offset就是负的的问题。increase用的是线性回归还是平均值来估算的,所以短时间之内预测的数值不准,但是时间越拉越长,也就越准,而且每个时间间隔内打的点越多一般也就能够越准确。

暂时就想到这么多。

你可能感兴趣的:(杂项)