美团cat源码解析

写在前面

cat是美团开源的监控系统,目前在github有14k+的star,美团cat能够以300:1的监控比例收集实时指标,来提供系统丰富的性能指标、健康状况、实时告警等。
美团技术团队也写过cat的相关文章:
1.https://tech.meituan.com/2018/11/01/cat-in-depth-java-application-monitoring.html
2.https://tech.meituan.com/2018/11/01/cat-pr.html

美团cat架构图

client

先来看client怎么处理指标数据的,以下是一个嵌入业务代码的demo。首先会调用Cat.newTransaction生成transaction指标,然后可以调用logEvent、logMetricForCount生成其他指标,最后标记transaction状态,调用complete结束指标。

Transaction t = Cat.newTransaction("URL", "pageName");

try {
    Cat.logEvent("URL.Server", "serverIp", Event.SUCCESS, "ip=${serverIp}");
    Cat.logMetricForCount("metric.key");
    Cat.logMetricForDuration("metric.key", 5);

    yourBusiness();

    t.setStatus(Transaction.SUCCESS);
} catch (Exception e) {
    t.setStatus(e);
    Cat.logError(e);
} finally {
    t.complete();
}

Cat.newTransaction会继续调用MessageProducer的newTransaction进行。

    public static Transaction newTransaction(String type, String name) {
        try {
            return Cat.getProducer().newTransaction(type, name);
        } catch (Exception e) {
            errorHandler(e);
            return NullMessage.TRANSACTION;
        }
    }

MessageProducer的newTransaction会先检查线程ThreadLocal的Context或者MessageTree是否为空,是则需要初始化。初始化就是判断是否命中采样,生成Context并且设置到ThreadLocal即可。然后生成DefaultTransaction,设置到消息树并推入栈,最后返回transaction。

    @Override
    public Transaction newTransaction(String type, String name) {
        // this enable CAT client logging cat message without explicit setup
        if (!m_manager.hasContext()) {
            m_manager.setup();
        }

        DefaultTransaction transaction = new DefaultTransaction(type, name, m_manager);

        m_manager.start(transaction, false);
        return transaction;
    }

    @Override
    public boolean hasContext() {
        Context context = m_context.get();
        boolean has = context != null;

        if (has) {
            MessageTree tree = context.m_tree;

            if (tree == null) {
                return false;
            }
        }
        return has;
    }

    @Override
    public void setup() {
        Context ctx;

        if (m_domain != null) {
            ctx = new Context(m_domain.getId(), m_hostName, m_domain.getIp());
        } else {
            ctx = new Context("Unknown", m_hostName, "");
        }
        double samplingRate = m_configManager.getSampleRatio();

        if (samplingRate < 1.0 && hitSample(samplingRate)) {
            ctx.m_tree.setHitSample(true);
        }
        m_context.set(ctx);
    }

    @Override
    public void start(Transaction transaction, boolean forked) {
        Context ctx = getContext();

        if (ctx != null) {
            ctx.start(transaction, forked);

            if (transaction instanceof TaggedTransaction) {
                TaggedTransaction tt = (TaggedTransaction) transaction;

                m_taggedTransactions.put(tt.getTag(), tt);
            }
        } else if (m_firstMessage) {
            m_firstMessage = false;
            m_logger.warn("CAT client is not enabled because it's not initialized yet");
        }
    }

        public void start(Transaction transaction, boolean forked) {
            if (!m_stack.isEmpty()) {
                // Do NOT make strong reference from parent transaction to forked transaction.
                // Instead, we create a "soft" reference to forked transaction later, via linkAsRunAway()
                // By doing so, there is no need for synchronization between parent and child threads.
                // Both threads can complete() anytime despite the other thread.
                if (!(transaction instanceof ForkedTransaction)) {
                    Transaction parent = m_stack.peek();
                    addTransactionChild(transaction, parent);
                }
            } else {
                m_tree.setMessage(transaction);
            }

            if (!forked) {
                m_stack.push(transaction);
            }
        }

Cat.logEvent会调用newEvent生成Event,然后将nameValuePairs设置到data字段中,接着设置status字段,最后调用complete结束指标。complete方法会对event的m_completed字段设置为true,然后调用m_manager.add将event加入到transaction中。其他指标也会像Event一样处理,就不细说了。

    @Override
    public void logEvent(String type, String name, String status, String nameValuePairs) {
        Event event = newEvent(type, name);

        if (nameValuePairs != null && nameValuePairs.length() > 0) {
            event.addData(nameValuePairs);
        }

        event.setStatus(status);
        event.complete();
    }

    @Override
    public void complete() {
        setCompleted(true);

        if (m_manager != null) {
            m_manager.add(this);
        }
    }

    @Override
    public void add(Message message) {
        Context ctx = getContext();

        if (ctx != null) {
            ctx.add(message);
        }
    }

    public void add(Message message) {
            if (m_stack.isEmpty()) {
                MessageTree tree = m_tree.copy();

                tree.setMessage(message);
                flush(tree, true);
            } else {
                Transaction parent = m_stack.peek();

                addTransactionChild(message, parent);
            }
        }

最后看一下transaction的complete,如果transaction已经完成过了,则加入一个BadInstrument的event到子指标中,否则设置m_completed字段为true,然后调用m_manager.end继续处理,最后清除ThreadLocal。m_manager.end会继续调用Context的end方法,将栈上的transaction弹出直至栈为空或者弹出的transaction是当前transaction,然后判断栈是否为空,为空的话将调用manager.flush,将消息树发送到server进行处理。

@Override
    public void complete() {
        try {
            if (isCompleted()) {
                // complete() was called more than once
                DefaultEvent event = new DefaultEvent("cat", "BadInstrument");

                event.setStatus("TransactionAlreadyCompleted");
                event.complete();
                addChild(event);
            } else {
                if (m_durationInMicro == -1) {
                    m_durationInMicro = (System.nanoTime() - m_durationStart) / 1000L;
                }
                setCompleted(true);
                if (m_manager != null) {
                    m_manager.end(this);
                }
            }
        } catch (Exception e) {
            // ignore
        }
    }

    @Override
    public void end(Transaction transaction) {
        Context ctx = getContext();

        if (ctx != null && transaction.isStandalone()) {
            if (ctx.end(this, transaction)) {
                m_context.remove();
            }
        }
    }

    public boolean end(DefaultMessageManager manager, Transaction transaction) {
            if (!m_stack.isEmpty()) {
                Transaction current = m_stack.pop();

                if (transaction == current) {
                    m_validator.validate(m_stack.isEmpty() ? null : m_stack.peek(), current);
                } else {
                    while (transaction != current && !m_stack.empty()) {
                        m_validator.validate(m_stack.peek(), current);

                        current = m_stack.pop();
                    }
                }

                if (m_stack.isEmpty()) {
                    MessageTree tree = m_tree.copy();

                    m_tree.setMessageId(null);
                    m_tree.setMessage(null);

                    if (m_totalDurationInMicros > 0) {
                        adjustForTruncatedTransaction((Transaction) tree.getMessage());
                    }

                    manager.flush(tree, true);
                    return true;
                }
            }

            return false;
        }

接下来看manager.flush,调用m_transportManager.getSender获取到MessageSender(TcpSocketSender),然后调用其send方法发送消息树。

    public void flush(MessageTree tree, boolean clearContext) {
        MessageSender sender = m_transportManager.getSender();

        if (sender != null && isMessageEnabled()) {
            sender.send(tree);

            if (clearContext) {
                reset();
            }
        } else {
            m_throttleTimes++;

            if (m_throttleTimes % 10000 == 0 || m_throttleTimes == 1) {
                m_logger.info("Cat Message is throttled! Times:" + m_throttleTimes);
            }
        }
    }
    @Override
    public void send(MessageTree tree) {
        if (!m_configManager.isBlock()) {
            double sampleRatio = m_configManager.getSampleRatio();

            if (tree.canDiscard() && sampleRatio < 1.0 && (!tree.isHitSample())) {
                processTreeInClient(tree);
            } else {
                offer(tree);
            }
        }
    }

TcpSocketSender的send方法会看消息树是否采样与丢弃等,若需要发送则调用offer继续处理,offer方法就是将消息树丢到queue中返回。

    @Override
    public void send(MessageTree tree) {
        if (!m_configManager.isBlock()) {
            double sampleRatio = m_configManager.getSampleRatio();

            if (tree.canDiscard() && sampleRatio < 1.0 && (!tree.isHitSample())) {
                processTreeInClient(tree);
            } else {
                offer(tree);
            }
        }
    }

    private void offer(MessageTree tree) {
        if (m_configManager.isAtomicMessage(tree)) {
            boolean result = m_atomicQueue.offer(tree);

            if (!result) {
                logQueueFullInfo(tree);
            }
        } else {
            boolean result = m_queue.offer(tree);

            if (!result) {
                logQueueFullInfo(tree);
            }
        }
    }

会有线程调用processNormalMessage对queue上的消息树进行取出然后发送,将消息树进行系列化,然后调用writeAndFlush发送到server即可。

private void processNormalMessage() {
        while (true) {
            ChannelFuture channel = m_channelManager.channel();

            if (channel != null) {
                try {
                    MessageTree tree = m_queue.poll();

                    if (tree != null) {
                        sendInternal(channel, tree);
                        tree.setMessage(null);
                    } else {
                        try {
                            Thread.sleep(5);
                        } catch (Exception e) {
                            m_active = false;
                        }
                        break;
                    }
                } catch (Throwable t) {
                    m_logger.error("Error when sending message over TCP socket!", t);
                }
            } else {
                try {
                    Thread.sleep(5);
                } catch (Exception e) {
                    m_active = false;
                }
            }
        }
    }

public void sendInternal(ChannelFuture channel, MessageTree tree) {
        if (tree.getMessageId() == null) {
            tree.setMessageId(m_factory.getNextId());
        }

        ByteBuf buf = m_codec.encode(tree);

        int size = buf.readableBytes();

        channel.channel().writeAndFlush(buf);

        if (m_statistics != null) {
            m_statistics.onBytes(size);
        }
    }

Server

来到server端,跟client端的TcpSocketSender对应的是server端的TcpSocketReceiver,当消息树传输到server端时,会通过decode方法解析,调用m_handler.handle进行处理。

        protected void decode(ChannelHandlerContext ctx, ByteBuf buffer, List out) throws Exception {
            if (buffer.readableBytes() < 4) {
                return;
            }
            buffer.markReaderIndex();
            int length = buffer.readInt();
            buffer.resetReaderIndex();
            if (buffer.readableBytes() < length + 4) {
                return;
            }
            try {
                if (length > 0) {
                    ByteBuf readBytes = buffer.readBytes(length + 4);

                    readBytes.markReaderIndex();
                    //readBytes.readInt();

                    DefaultMessageTree tree = (DefaultMessageTree) CodecHandler.decode(readBytes);

                    // readBytes.retain();
                    readBytes.resetReaderIndex();
                    tree.setBuffer(readBytes);
                    m_handler.handle(tree);
                    m_processCount++;

                    long flag = m_processCount % CatConstants.SUCCESS_COUNT;

                    if (flag == 0) {
                        m_serverStateManager.addMessageTotal(CatConstants.SUCCESS_COUNT);
                    }
                } else {
                    // client message is error
                    buffer.readBytes(length);
                    BufReleaseHelper.release(buffer);
                }
            } catch (Exception e) {
                m_serverStateManager.addMessageTotalLoss(1);
                m_logger.error(e.getMessage(), e);
            }
        }
    }
 
 

来到DefaultMessageHandler的handle方法,如果m_consumer没有初始化,则调用lookup进行初始化,然后调用其consume方法继续处理。

    @Override
    public void handle(MessageTree tree) {
        if (m_consumer == null) {
            m_consumer = lookup(MessageConsumer.class);
        }

        try {
            m_consumer.consume(tree);
        } catch (Throwable e) {
            m_logger.error("Error when consuming message in " + m_consumer + "! tree: " + tree, e);
        }
    }

来到RealtimeConsumer的consume方法,会根据消息树的时间戳调用m_periodManager.findPeriod找到对应的Period,然后调用其distribute方法处理消息树。

    @Override
    public void consume(MessageTree tree) {
        long timestamp = tree.getMessage().getTimestamp();
        Period period = m_periodManager.findPeriod(timestamp);

        if (period != null) {
            period.distribute(tree);
        } else {
            m_serverStateManager.addNetworkTimeError(1);
        }
    }

在其中很值得一提的是PeriodManager对Period的管理模式,PeriodManager的会不断检测当前时间是否需要预加载下一个Period,是否需要终结已经结束的上一个Period。

@Override
    public void run() {
        while (m_active) {
            try {
                long now = System.currentTimeMillis();
                long value = m_strategy.next(now);

                if (value > 0) {
                    startPeriod(value);
                } else if (value < 0) {
                    // last period is over,make it asynchronous
                    Threads.forGroup("cat").start(new EndTaskThread(-value));
                }
            } catch (Throwable e) {
                Cat.logError(e);
            }

            try {
                Thread.sleep(1000L);
            } catch (InterruptedException e) {
                break;
            }
        }
    }

    private void startPeriod(long startTime) {
        long endTime = startTime + m_strategy.getDuration();
        Period period = new Period(startTime, endTime, m_analyzerManager, m_serverStateManager, m_logger);

        m_periods.add(period);
        period.start();
    }

    private void endPeriod(long startTime) {
        int len = m_periods.size();

        for (int i = 0; i < len; i++) {
            Period period = m_periods.get(i);

            if (period.isIn(startTime)) {
                period.finish();
                m_periods.remove(i);
                break;
            }
        }
    }

回到正题,period.distribute会遍历所有类型的PeriodTask,若存在相同类型的PeriodTask,则根据哈希随机到其中一个PeriodTask,调用其enqueue进行处理。enqueue则会将消息树丢进m_queue中。

public void distribute(MessageTree tree) {
        m_serverStateManager.addMessageTotal(tree.getDomain(), 1);
        boolean success = true;
        String domain = tree.getDomain();

        for (Entry> entry : m_tasks.entrySet()) {
            List tasks = entry.getValue();
            int length = tasks.size();
            int index = 0;
            boolean manyTasks = length > 1;

            if (manyTasks) {
                index = Math.abs(domain.hashCode()) % length;
            }
            PeriodTask task = tasks.get(index);
            boolean enqueue = task.enqueue(tree);

            if (!enqueue) {
                if (manyTasks) {
                    task = tasks.get((index + 1) % length);
                    enqueue = task.enqueue(tree);

                    if (!enqueue) {
                        success = false;
                    }
                } else {
                    success = false;
                }
            }
        }

        if ((!success) && (!tree.isProcessLoss())) {
            m_serverStateManager.addMessageTotalLoss(tree.getDomain(), 1);

            tree.setProcessLoss(true);
        }
    }

    public boolean enqueue(MessageTree tree) {
        if (m_analyzer.isEligable(tree)) {
            boolean result = m_queue.offer(tree);

            if (!result) { // trace queue overflow
                m_queueOverflow++;

                if (m_queueOverflow % (10 * CatConstants.ERROR_COUNT) == 0) {
                    String date = new SimpleDateFormat("yyyy-MM-dd HH:mm").format(new Date(m_analyzer.getStartTime()));
                    m_logger.warn(m_analyzer.getClass().getSimpleName() + " queue overflow number " + m_queueOverflow + " analyzer time:" + date);
                }
            }
            return result;
        } else {
            return true;
        }
    }

PeriodTask会不断调用m_analyzer.analyze,处理m_queue中的消息树。

    @Override
    public void run() {
        try {
            m_analyzer.analyze(m_queue);
        } catch (Exception e) {
            Cat.logError(e);
        }
    }

来到AbstractMessageAnalyzer的analyze方法,会调用抽象方法process继续处理。

@Override
    public void analyze(MessageQueue queue) {
        while (!isTimeout() && isActive()) {
            MessageTree tree = queue.poll();

            if (tree != null) {
                try {
                    process(tree);
                } catch (Throwable e) {
                    m_errors++;

                    if (m_errors == 1 || m_errors % 10000 == 0) {
                        Cat.logError(e);
                    }
                }
            }
        }

        while (true) {
            MessageTree tree = queue.poll();

            if (tree != null) {
                try {
                    process(tree);
                } catch (Throwable e) {
                    m_errors++;

                    if (m_errors == 1 || m_errors % 10000 == 0) {
                        Cat.logError(e);
                    }
                }
            } else {
                break;
            }
        }
    }

我们重点分析一下TransactionAnalyzer和DumpAnalyzer两个分析器,分别是报表和logview的核心分析器。

TransactionAnalyzer的process方法会取出当前小时的报表,然后调用processTransaction或者processBatchTransaction进行处理。

    @Override
    public void process(MessageTree tree) {
        String domain = tree.getDomain();
        TransactionReport report = m_reportManager.getHourlyReport(getStartTime(), domain, true);
        List transactions = tree.findOrCreateTransactions();

        for (Transaction t : transactions) {
            String data = String.valueOf(t.getData());

            if (data.length() > 0 && data.charAt(0) == CatConstants.BATCH_FLAG) {
                processBatchTransaction(tree, report, t, data);
            } else {
                processTransaction(report, tree, t);
            }
        }

        if (System.currentTimeMillis() > m_nextClearTime) {
            m_nextClearTime = m_nextClearTime + TimeHelper.ONE_MINUTE;

            Threads.forGroup("cat").start(new Runnable() {

                @Override
                public void run() {
                    cleanUpReports();
                }
            });
        }
    }

选取processTransaction进行分析,会先调用findOrCreateMachine找到对应的Machine,然后调用findOrCreateType找到对应的TransactionType,再调用findOrCreateName找到对应的TransactionName,最后调用processTypeAndName继续处理。

    private void processTransaction(TransactionReport report, MessageTree tree, Transaction t) {
        String type = t.getType();
        String name = t.getName();

        if (!m_filterConfigManager.discardTransaction(type, name)) {
            boolean valid = checkForTruncatedMessage(tree, t);

            if (valid) {
                String ip = tree.getIpAddress();
                TransactionType transactionType = findOrCreateType(report.findOrCreateMachine(ip), type);
                TransactionName transactionName = findOrCreateName(transactionType, name, report.getDomain());

                processTypeAndName(t, transactionType, transactionName, tree, t.getDurationInMillis());
            }
        }
    }

先看一下findOrCreateMachine、findOrCreateType、findOrCreateName这三个方法的处理。对应的关系是machine->type->name。

   public Machine findOrCreateMachine(String ip) {
      Machine machine = m_machines.get(ip);

      if (machine == null) {
         synchronized (m_machines) {
            machine = m_machines.get(ip);

            if (machine == null) {
               machine = new Machine(ip);
               m_machines.put(ip, machine);
            }
         }
      }

      return machine;
   }

   public TransactionType findOrCreateType(String id) {
      TransactionType type = m_types.get(id);

      if (type == null) {
         synchronized (m_types) {
            type = m_types.get(id);

            if (type == null) {
               type = new TransactionType(id);
               m_types.put(id, type);
            }
         }
      }

      return type;
   }

  public TransactionName findOrCreateName(String id) {
      TransactionName name = m_names.get(id);

      if (name == null) {
         synchronized (m_names) {
            name = m_names.get(id);

            if (name == null) {
               name = new TransactionName(id);
               m_names.put(id, name);
            }
         }
      }

      return name;
   }

接下来继续看processTypeAndName,是对transaction的报表的总体指标进行统计,包括总数、失败总数、最长最短耗时、总耗时、每个耗时的总数等,然后继续调用processNameGraph和processTypeRange继续处理。

private void processTypeAndName(Transaction t, TransactionType type, TransactionName name, MessageTree tree,
                            double duration) {
        String messageId = tree.getMessageId();

        type.incTotalCount();
        name.incTotalCount();

        type.setSuccessMessageUrl(messageId);
        name.setSuccessMessageUrl(messageId);

        if (!t.isSuccess()) {
            type.incFailCount();
            name.incFailCount();

            String statusCode = formatStatus(t.getStatus());

            findOrCreateStatusCode(name, statusCode).incCount();
        }

        int allDuration = DurationComputer.computeDuration((int) duration);
        double sum = duration * duration;

        if (type.getMax() <= duration) {
            type.setLongestMessageUrl(messageId);
        }
        if (name.getMax() <= duration) {
            name.setLongestMessageUrl(messageId);
        }
        name.setMax(Math.max(name.getMax(), duration));
        name.setMin(Math.min(name.getMin(), duration));
        name.setSum(name.getSum() + duration);
        name.setSum2(name.getSum2() + sum);
        name.findOrCreateAllDuration(allDuration).incCount();

        type.setMax(Math.max(type.getMax(), duration));
        type.setMin(Math.min(type.getMin(), duration));
        type.setSum(type.getSum() + duration);
        type.setSum2(type.getSum2() + sum);
        type.findOrCreateAllDuration(allDuration).incCount();

        long current = t.getTimestamp() / 1000 / 60;
        int min = (int) (current % (60));
        boolean statistic = m_statisticManager.shouldStatistic(type.getId(), tree.getDomain());

        processNameGraph(t, name, min, duration, statistic, allDuration);
        processTypeRange(t, type, min, duration, statistic, allDuration);
    }

processNameGraph和processTypeRange主要是对分钟范围的总数、失败总数、总耗时、最大耗时、最小耗时、每个耗时的总数的指标统计。

    private void processNameGraph(Transaction t, TransactionName name, int min, double d, boolean statistic,
                            int allDuration) {
        int dk = formatDurationDistribute(d);

        Duration duration = name.findOrCreateDuration(dk);
        Range range = name.findOrCreateRange(min);

        duration.incCount();
        range.incCount();

        if (!t.isSuccess()) {
            range.incFails();
        }

        range.setSum(range.getSum() + d);
        range.setMax(Math.max(range.getMax(), d));
        range.setMin(Math.min(range.getMin(), d));

        if (statistic) {
            range.findOrCreateAllDuration(allDuration).incCount();
        }
    }

private void processTypeRange(Transaction t, TransactionType type, int min, double d, boolean statistic,
                            int allDuration) {
        Range2 range = type.findOrCreateRange2(min);

        if (!t.isSuccess()) {
            range.incFails();
        }

        range.incCount();
        range.setSum(range.getSum() + d);
        range.setMax(Math.max(range.getMax(), d));
        range.setMin(Math.min(range.getMin(), d));

        if (statistic) {
            range.findOrCreateAllDuration(allDuration).incCount();
        }
    }

另外提一下报表的存储,当结束一个PeriodTask的时候,会调用m_analyzer.doCheckpoint将报表存储在文件和db上。

    public void finish() {
        try {
            m_analyzer.doCheckpoint(true);
            m_analyzer.destroy();
        } catch (Exception e) {
            Cat.logError(e);
        }
    }

来看一下TransactionAnalyzer的doCheckpoint方法,会判断是否是结束并且非本地模式,是的话就将报表存储在文件和db上,否则只存储在文件上。

    @Override
    public synchronized void doCheckpoint(boolean atEnd) {
        if (atEnd && !isLocalMode()) {
            m_reportManager.storeHourlyReports(getStartTime(), StoragePolicy.FILE_AND_DB, m_index);
        } else {
            m_reportManager.storeHourlyReports(getStartTime(), StoragePolicy.FILE, m_index);
        }
    }

继续分析DefaultReportManager的storeHourlyReports方法,除了执行钩子方法之外,就是调用storeFile存储报表到文件上,调用storeDatabase存储报表到db上。

    @Override
    public void storeHourlyReports(long startTime, StoragePolicy policy, int index) {
        Transaction t = Cat.newTransaction("Checkpoint", m_name);
        Map reports = m_reports.get(startTime);
        ReportBucket bucket = null;

        try {
            t.addData("reports", reports == null ? 0 : reports.size());

            if (reports != null) {
                Set errorDomains = new HashSet();

                for (String domain : reports.keySet()) {
                    if (!m_validator.validate(domain)) {
                        errorDomains.add(domain);
                    }
                }
                for (String domain : errorDomains) {
                    reports.remove(domain);
                }
                if (!errorDomains.isEmpty()) {
                    m_logger.info("error domain:" + errorDomains);
                }

                m_reportDelegate.beforeSave(reports);

                if (policy.forFile()) {
                    bucket = m_bucketManager.getReportBucket(startTime, m_name, index);

                    try {
                        storeFile(reports, bucket);
                    } finally {
                        m_bucketManager.closeBucket(bucket);
                    }
                }

                if (policy.forDatabase()) {
                    storeDatabase(startTime, reports);
                }
            }
            t.setStatus(Message.SUCCESS);
        } catch (Throwable e) {
            Cat.logError(e);
            t.setStatus(e);
            m_logger.error(String.format("Error when storing %s reports of %s!", m_name, new Date(startTime)), e);
        } finally {
            cleanup(startTime);
            t.complete();

            if (bucket != null) {
                m_bucketManager.closeBucket(bucket);
            }
        }
    }

storeFile将报表序列化为xml格式,将报表存储在m_writeDataFile上,并且写入具体index到m_writeIndexFile上。

    private void storeFile(Map reports, ReportBucket bucket) {
        for (T report : reports.values()) {
            try {
                String domain = m_reportDelegate.getDomain(report);
                String xml = m_reportDelegate.buildXml(report);

                bucket.storeById(domain, xml);
            } catch (Exception e) {
                Cat.logError(e);
            }
        }
    }

    @Override
    public boolean storeById(String id, String report) throws IOException {
        byte[] content = report.getBytes("utf-8");
        int length = content.length;
        byte[] num = String.valueOf(length).getBytes("utf-8");

        m_writeLock.lock();

        try {
            m_writeDataFile.write(num);
            m_writeDataFile.write('\n');
            m_writeDataFile.write(content);
            m_writeDataFile.write('\n');
            m_writeDataFile.flush();

            long offset = m_writeDataFileLength;
            String line = id + '\t' + offset + '\n';
            byte[] data = line.getBytes("utf-8");

            m_writeDataFileLength += num.length + 1 + length + 1;
            m_writeIndexFile.write(data);
            m_writeIndexFile.flush();
            m_idToOffsets.put(id, offset);
            return true;
        } finally {
            m_writeLock.unlock();
        }
    }

storeDatabase先将HourlyReport写入db,再将报表信息HourlyReportContent写入db,然后调用m_reportDelegate.createHourlyTask生成周、月等聚合报表。

private void storeDatabase(long startTime, Map reports) {
        Date period = new Date(startTime);
        String ip = NetworkInterfaceManager.INSTANCE.getLocalHostAddress();

        for (T report : reports.values()) {
            try {
                String domain = m_reportDelegate.getDomain(report);
                HourlyReport r = m_reportDao.createLocal();

                r.setName(m_name);
                r.setDomain(domain);
                r.setPeriod(period);
                r.setIp(ip);
                r.setType(1);

                m_reportDao.insert(r);

                int id = r.getId();
                byte[] binaryContent = m_reportDelegate.buildBinary(report);
                HourlyReportContent content = m_reportContentDao.createLocal();

                content.setReportId(id);
                content.setContent(binaryContent);
                content.setPeriod(period);
                m_reportContentDao.insert(content);
                m_reportDelegate.createHourlyTask(report);
            } catch (Throwable e) {
                Cat.getProducer().logError(e);
            }
        }
    }

继续分析createHourlyTask,会创建不同周期的task并存储在db上,等待工作线程捞取处理。

    @Override
    public boolean createHourlyTask(TransactionReport report) {
        String domain = report.getDomain();

        if (domain.equals(Constants.ALL) || m_configManager.validateDomain(domain)) {
            return m_taskManager.createTask(report.getStartTime(), domain, TransactionAnalyzer.ID,
                  TaskProlicy.ALL_EXCLUED_HOURLY);
        } else {
            return true;
        }
    }

    public boolean createTask(Date period, String domain, String name, TaskCreationPolicy prolicy) {
        try {
            if (prolicy.shouldCreateHourlyTask()) {
                insertToDatabase(period, domain, name, REPORT_HOUR);
            }

            Calendar cal = Calendar.getInstance();
            cal.setTime(period);

            int hour = cal.get(Calendar.HOUR_OF_DAY);
            cal.add(Calendar.HOUR_OF_DAY, -hour);
            Date currentDay = cal.getTime();

            if (prolicy.shouldCreateDailyTask()) {
                insertToDatabase(new Date(currentDay.getTime() - ONE_DAY), domain, name, REPORT_DAILY);
            }

            if (prolicy.shouldCreateWeeklyTask()) {
                int dayOfWeek = cal.get(Calendar.DAY_OF_WEEK);
                if (dayOfWeek == 7) {
                    insertToDatabase(new Date(currentDay.getTime() - 7 * ONE_DAY), domain, name, REPORT_WEEK);
                }
            }
            if (prolicy.shouldCreateMonthTask()) {
                int dayOfMonth = cal.get(Calendar.DAY_OF_MONTH);

                if (dayOfMonth == 1) {
                    cal.add(Calendar.MONTH, -1);
                    insertToDatabase(cal.getTime(), domain, name, REPORT_MONTH);
                }
            }
            return true;
        } catch (DalException e) {
            Cat.logError(e);
            return false;
        }
    }

    protected void insertToDatabase(Date period, String domain, String name, int reportType) throws DalException {
        Task task = m_taskDao.createLocal();
        task.setCreationDate(new Date());
        task.setProducer(NetworkInterfaceManager.INSTANCE.getLocalHostAddress());
        task.setReportDomain(domain);
        task.setReportName(name);
        task.setReportPeriod(period);
        task.setStatus(STATUS_TODO);
        task.setTaskType(reportType);
        m_taskDao.insert(task);
    }

在TaskConsumer中,会不断捞取待处理的task并调用processTask处理。

@Override
    public void run() {
        String localIp = getLoaclIp();
        while (m_running) {
            try {
                if (checkTime()) {
                    Task task = findDoingTask(localIp);
                    if (task == null) {
                        task = findTodoTask();
                    }
                    boolean again = false;
                    if (task != null) {
                        try {
                            task.setConsumer(localIp);
                            if (task.getStatus() == TaskConsumer.STATUS_DOING || updateTodoToDoing(task)) {
                                int retryTimes = 0;
                                while (!processTask(task)) {
                                    retryTimes++;
                                    if (retryTimes < MAX_TODO_RETRY_TIMES) {
                                        taskRetryDuration();
                                    } else {
                                        updateDoingToFailure(task);
                                        again = true;
                                        break;
                                    }
                                }
                                if (!again) {
                                    updateDoingToDone(task);
                                }
                            }
                        } catch (Throwable e) {
                            Cat.logError(task.toString(), e);
                        }
                    } else {
                        taskNotFoundDuration();
                    }
                } else {
                    try {
                        Thread.sleep(60 * 1000);
                    } catch (InterruptedException e) {
                        // Ignore
                    }
                }
            } catch (Throwable e) {
                Cat.logError(e);
            }
        }
        m_stopped = true;
    }

processTask会继续调用m_reportFacade.builderReport处理,最终调用不同的bulidTask方法处理。

@Override
    protected boolean processTask(Task doing) {
        boolean result = false;
        Transaction t = Cat.newTransaction("Task", doing.getReportName());

        t.addData(doing.toString());
        try {
            result = m_reportFacade.builderReport(doing);
            t.setStatus(Transaction.SUCCESS);
        } catch (Throwable e) {
            Cat.logError(e);
            t.setStatus(e);
        } finally {
            t.complete();
        }
        return result;
    }

    public boolean builderReport(Task task) {
        try {
            if (task == null) {
                return false;
            }
            int type = task.getTaskType();
            String reportName = task.getReportName();
            String reportDomain = task.getReportDomain();
            Date reportPeriod = task.getReportPeriod();
            TaskBuilder reportBuilder = getReportBuilder(reportName);

            if (reportBuilder == null) {
                Cat.logError(new RuntimeException("no report builder for type:" + " " + reportName));
                return false;
            } else {
                boolean result = false;

                if (type == TaskManager.REPORT_HOUR) {
                    result = reportBuilder.buildHourlyTask(reportName, reportDomain, reportPeriod);
                } else if (type == TaskManager.REPORT_DAILY) {
                    result = reportBuilder.buildDailyTask(reportName, reportDomain, reportPeriod);
                } else if (type == TaskManager.REPORT_WEEK) {
                    result = reportBuilder.buildWeeklyTask(reportName, reportDomain, reportPeriod);
                } else if (type == TaskManager.REPORT_MONTH) {
                    result = reportBuilder.buildMonthlyTask(reportName, reportDomain, reportPeriod);
                }
                if (result) {
                    return result;
                } else {
                    m_logger.error(task.toString());
                }
            }
        } catch (Exception e) {
            m_logger.error("Error when building report," + e.getMessage(), e);
            Cat.logError(e);
            return false;
        }
        return false;
    }

来看一下TransactionReportBuilder的buildMonthlyTask方法,主要是调用queryDailyReportsByDuration得到月聚合报表,然后生成MonthlyReport写入db。

    @Override
    public boolean buildMonthlyTask(String name, String domain, Date period) {
        Date end = null;

        if (period.equals(TimeHelper.getCurrentMonth())) {
            end = TimeHelper.getCurrentDay();
        } else {
            end = TaskHelper.nextMonthStart(period);
        }
        TransactionReport transactionReport = queryDailyReportsByDuration(domain, period, end);
        MonthlyReport report = new MonthlyReport();

        report.setCreationDate(new Date());
        report.setDomain(domain);
        report.setIp(NetworkInterfaceManager.INSTANCE.getLocalHostAddress());
        report.setName(name);
        report.setPeriod(period);
        report.setType(1);
        byte[] binaryContent = DefaultNativeBuilder.build(transactionReport);
        return m_reportService.insertMonthlyReport(report, binaryContent);
    }

queryDailyReportsByDuration方法内部通过TransactionReportDailyGraphCreator生成GraphTrend(每天趋势图),HistoryTransactionReportMerger聚合生成月机器级别聚合TransactionReport,通过TransactionReportCountFilter聚合所有机器总体指标。里面大量使用访问者模式,有兴趣的可以自己翻代码研究一下。

    private TransactionReport queryDailyReportsByDuration(String domain, Date start, Date end) {
        long startTime = start.getTime();
        long endTime = end.getTime();
        double duration = (end.getTime() - start.getTime()) * 1.0 / TimeHelper.ONE_DAY;

        HistoryTransactionReportMerger merger = new HistoryTransactionReportMerger(new TransactionReport(domain)).setDuration(duration);
        TransactionReport transactionReport = merger.getTransactionReport();

        TransactionReportDailyGraphCreator creator = new TransactionReportDailyGraphCreator(transactionReport, (int) duration, start);

        for (; startTime < endTime; startTime += TimeHelper.ONE_DAY) {
            try {
                TransactionReport reportModel = m_reportService.queryReport(domain, new Date(startTime), new Date(startTime + TimeHelper.ONE_DAY));
                creator.createGraph(reportModel);
                reportModel.accept(merger);
            } catch (Exception e) {
                Cat.logError(e);
            }
        }
        transactionReport.setStartTime(start);
        transactionReport.setEndTime(end);

        new TransactionReportCountFilter(m_serverConfigManager.getMaxTypeThreshold(),
                m_atomicMessageConfigManager.getMaxNameThreshold(domain), m_serverConfigManager.getTypeNameLengthLimit())
                .visitTransactionReport(transactionReport);
        return transactionReport;
    }

home

home是美团cat的管理端,可以查看报表或者logview,我们分别对查询当前报表、历史报表、logview的处理进行分析。

首先看到transaction下的Handler.handleOutbound方法,会根据请求调用不同的方法处理后返回,查询当前报表对应HOURLY_REPORT,查询历史报表对应HISTORY_REPORT。

@Override
    @OutboundActionMeta(name = "t")
    public void handleOutbound(Context ctx) throws ServletException, IOException {
        Cat.logMetricForCount("http-request-transaction");
        
        Model model = new Model(ctx);
        Payload payload = ctx.getPayload();

        normalize(model, payload);
        String domain = payload.getDomain();
        Action action = payload.getAction();
        String ipAddress = payload.getIpAddress();
        String group = payload.getGroup();
        String type = payload.getType();
        String name = payload.getName();
        String ip = payload.getIpAddress();
        Date start = payload.getHistoryStartDate();
        Date end = payload.getHistoryEndDate();

        if (StringUtils.isEmpty(group)) {
            group = m_configManager.queryDefaultGroup(domain);
            payload.setGroup(group);
        }
        model.setGroupIps(m_configManager.queryIpByDomainAndGroup(domain, group));
        model.setGroups(m_configManager.queryDomainGroup(payload.getDomain()));

        switch (action) {
        case HOURLY_REPORT:
            TransactionReport report = getHourlyReport(payload);
            report = m_mergeHelper.mergeAllMachines(report, ipAddress);

            if (report != null) {
                model.setReport(report);
                buildTransactionMetaInfo(model, payload, report);
            }
            break;
        case HISTORY_REPORT:
            report = m_reportService.queryReport(domain, payload.getHistoryStartDate(), payload.getHistoryEndDate());
            report = m_mergeHelper.mergeAllMachines(report, ipAddress);

            if (report != null) {
                model.setReport(report);
                buildTransactionMetaInfo(model, payload, report);
            }
            break;
        case HISTORY_GRAPH:
            report = m_reportService.queryReport(domain, start, end);

            if (Constants.ALL.equalsIgnoreCase(ip)) {
                buildDistributionInfo(model, type, name, report);
            }

            report = m_mergeHelper.mergeAllMachines(report, ip);
            new TransactionTrendGraphBuilder().buildTrendGraph(model, payload, report);
            break;
        case GRAPHS:
            report = getHourlyGraphReport(model, payload);

            if (Constants.ALL.equalsIgnoreCase(ipAddress)) {
                buildDistributionInfo(model, type, name, report);
            }
            if (name == null || name.length() == 0) {
                name = Constants.ALL;
            }

            report = m_mergeHelper.mergeAllNames(report, ip, name);

            model.setReport(report);
            buildTransactionNameGraph(model, report, type, name, ip);
            break;
        case HOURLY_GROUP_REPORT:
            report = getHourlyReport(payload);
            report = filterReportByGroup(report, domain, group);
            report = m_mergeHelper.mergeAllMachines(report, ipAddress);

            if (report != null) {
                model.setReport(report);

                buildTransactionMetaInfo(model, payload, report);
            }
            break;
        case HISTORY_GROUP_REPORT:
            report = m_reportService.queryReport(domain, payload.getHistoryStartDate(), payload.getHistoryEndDate());
            report = filterReportByGroup(report, domain, group);
            report = m_mergeHelper.mergeAllMachines(report, ipAddress);

            if (report != null) {
                model.setReport(report);
                buildTransactionMetaInfo(model, payload, report);
            }
            break;
        case GROUP_GRAPHS:
            report = getHourlyGraphReport(model, payload);
            report = filterReportByGroup(report, domain, group);
            buildDistributionInfo(model, type, name, report);

            if (name == null || name.length() == 0) {
                name = Constants.ALL;
            }
            report = m_mergeHelper.mergeAllNames(report, ip, name);

            model.setReport(report);
            buildTransactionNameGraph(model, report, type, name, ip);
            break;
        case HISTORY_GROUP_GRAPH:
            report = m_reportService.queryReport(domain, start, end);
            report = filterReportByGroup(report, domain, group);

            buildDistributionInfo(model, type, name, report);

            report = m_mergeHelper.mergeAllMachines(report, ip);
            new TransactionTrendGraphBuilder().buildTrendGraph(model, payload, report);
            break;
        }

        if (payload.isXml()) {
            m_xmlViewer.view(ctx, model);
        } else {
            m_jspViewer.view(ctx, model);
        }
    }

HOURLY_REPORT请求会先调用getHourlyReport,请求各个server获取到机器的实时报表,然后通过聚合获取到TransactionReport。

private TransactionReport getHourlyReport(Payload payload) {
        String domain = payload.getDomain();
        String ipAddress = payload.getIpAddress();
        ModelRequest request = new ModelRequest(domain, payload.getDate()).setProperty("type", payload.getType())
                                .setProperty("ip", ipAddress);

        if (m_service.isEligable(request)) {
            ModelResponse response = m_service.invoke(request);
            TransactionReport report = response.getModel();

            return report;
        } else {
            throw new RuntimeException("Internal error: no eligable transaction service registered for " + request + "!");
        }
    }

    @Override
    public ModelResponse invoke(final ModelRequest request) {
        int requireSize = 0;
        final List> responses = Collections.synchronizedList(new ArrayList>());
        final Semaphore semaphore = new Semaphore(0);
        final Transaction t = Cat.getProducer().newTransaction("ModelService", getClass().getSimpleName());
        int count = 0;

        t.setStatus(Message.SUCCESS);
        t.addData("request", request);
        t.addData("thread", Thread.currentThread());

        for (final ModelService service : m_allServices) {
            if (!service.isEligable(request)) {
                continue;
            }

            // save current transaction so that child thread can access it
            if (service instanceof ModelServiceWithCalSupport) {
                ((ModelServiceWithCalSupport) service).setParentTransaction(t);
            }
            requireSize++;

            m_configManager.getModelServiceExecutorService().submit(new Runnable() {
                @Override
                public void run() {
                    try {
                        ModelResponse response = service.invoke(request);

                        if (response.getException() != null) {
                            logError(response.getException());
                        }
                        if (response != null && response.getModel() != null) {
                            responses.add(response);
                        }
                    } catch (Exception e) {
                        logError(e);
                        t.setStatus(e);
                    } finally {
                        semaphore.release();
                    }
                }
            });

            count++;
        }

        try {
            semaphore.tryAcquire(count, 10000, TimeUnit.MILLISECONDS); // 10 seconds timeout
        } catch (InterruptedException e) {
            // ignore it
            t.setStatus(e);
        } finally {
            t.complete();
        }

        String requireAll = request.getProperty("requireAll");

        if (requireAll != null && responses.size() != requireSize) {
            String data = "require:" + requireSize + " actual:" + responses.size();
            Cat.logEvent("FetchReportError:" + this.getClass().getSimpleName(), request.getDomain(), Event.SUCCESS, data);

            return null;
        }
        ModelResponse aggregated = new ModelResponse();
        T report = merge(request, responses);

        aggregated.setModel(report);
        return aggregated;
    }

如果是查询ALL维度的报表,则会调用TransactionMergeHelper的mergeAllMachines进行继续的聚合所有机器的指标。

    public TransactionReport mergeAllMachines(TransactionReport report, String ipAddress) {
        if (StringUtils.isEmpty(ipAddress) || Constants.ALL.equalsIgnoreCase(ipAddress)) {
            AllMachineMerger all = new AllMachineMerger();

            all.visitTransactionReport(report);
            report = all.getReport();
        }
        return report;
    }

HISTORY_REPORT请求会先调用m_reportService.queryReport从数据库获取到对应的历史报表,如MonthlyReport为例。获取到报表后,还是会判断是否需要聚合所有机器指标,就不细说了。

    @Override
    public TransactionReport queryMonthlyReport(String domain, Date start) {
        TransactionReport transactionReport = new TransactionReport(domain);

        try {
            MonthlyReport entity = m_monthlyReportDao
                                    .findReportByDomainNamePeriod(start, domain, TransactionAnalyzer.ID,    MonthlyReportEntity.READSET_FULL);
            transactionReport = queryFromMonthlyBinary(entity.getId(), domain);
        } catch (DalNotFoundException e) {
            // ignore
        } catch (Exception e) {
            Cat.logError(e);
        }
        return convert(transactionReport);
    }

logview下的Handler.getLogView方法,对应着查询logview的处理。

@Override
    @OutboundActionMeta(name = "m")
    public void handleOutbound(Context ctx) throws ServletException, IOException {
        Model model = new Model(ctx);
        Payload payload = ctx.getPayload();

        model.setAction(payload.getAction());
        model.setPage(ReportPage.LOGVIEW);
        model.setDomain(payload.getDomain());
        model.setDate(payload.getDate());

        String messageId = getMessageId(payload);
        String logView = null;
        MessageId msgId = MessageId.parse(messageId);

        if (checkStorageTime(msgId)) {
            logView = getLogView(messageId, payload.isWaterfall());

            if (logView == null || logView.length() == 0) {
                Cat.logEvent("Logview", msgId.getDomain() + ":Fail", Event.SUCCESS, messageId);
            } else {
                Cat.logEvent("Logview", "Success", Event.SUCCESS, messageId);
            }
        } else {
            Cat.logEvent("Logview", "OldMessage", Event.SUCCESS, messageId);
        }

        switch (payload.getAction()) {
        case VIEW:
            model.setTable(logView);
            break;
        }

        m_jspViewer.view(ctx, model);
    }

    private String getLogView(String messageId, boolean waterfall) {
        try {
            if (messageId != null) {
                MessageId id = MessageId.parse(messageId);
                long timestamp = id.getTimestamp();
                ModelRequest request = new ModelRequest(id.getDomain(), timestamp) //
                                        .setProperty("messageId", messageId) //
                                        .setProperty("waterfall", String.valueOf(waterfall)) //
                                        .setProperty("timestamp", String.valueOf(timestamp));

                if (m_service.isEligable(request)) {
                    ModelResponse response = m_service.invoke(request);
                    String logview = response.getModel();

                    return logview;
                } else {
                    throw new RuntimeException("Internal error: no eligible logview service registered for " + request + "!");
                }
            }
        } catch (Exception e) {
            Cat.logError(e);
            return null;
        }

        return null;
    }

一样的,会请求各个server获取到机器存储在磁盘的logview,原理是通过MessageId找到index,再通过index读取data并返回。选取拥有结果的response返回,上文已经分析过logview的写入,查找其实就是写入的反向操作,就不细说了。

    @Override
    protected String merge(ModelRequest request, List> responses) {
        for (ModelResponse response : responses) {
            if (response != null) {
                String model = response.getModel();

                if (model != null) {
                    return model;
                }
            }
        }

        return null;
    }

写在最后

1.https://zhuanlan.zhihu.com/p/114718897
携程对cat的二次优化,指出了cat的不足之处,同时优化思路很值得借鉴。

你可能感兴趣的:(美团cat源码解析)