写在前面
cat是美团开源的监控系统,目前在github有14k+的star,美团cat能够以300:1的监控比例收集实时指标,来提供系统丰富的性能指标、健康状况、实时告警等。
美团技术团队也写过cat的相关文章:
1.https://tech.meituan.com/2018/11/01/cat-in-depth-java-application-monitoring.html
2.https://tech.meituan.com/2018/11/01/cat-pr.html
client
先来看client怎么处理指标数据的,以下是一个嵌入业务代码的demo。首先会调用Cat.newTransaction生成transaction指标,然后可以调用logEvent、logMetricForCount生成其他指标,最后标记transaction状态,调用complete结束指标。
Transaction t = Cat.newTransaction("URL", "pageName");
try {
Cat.logEvent("URL.Server", "serverIp", Event.SUCCESS, "ip=${serverIp}");
Cat.logMetricForCount("metric.key");
Cat.logMetricForDuration("metric.key", 5);
yourBusiness();
t.setStatus(Transaction.SUCCESS);
} catch (Exception e) {
t.setStatus(e);
Cat.logError(e);
} finally {
t.complete();
}
Cat.newTransaction会继续调用MessageProducer的newTransaction进行。
public static Transaction newTransaction(String type, String name) {
try {
return Cat.getProducer().newTransaction(type, name);
} catch (Exception e) {
errorHandler(e);
return NullMessage.TRANSACTION;
}
}
MessageProducer的newTransaction会先检查线程ThreadLocal的Context或者MessageTree是否为空,是则需要初始化。初始化就是判断是否命中采样,生成Context并且设置到ThreadLocal即可。然后生成DefaultTransaction,设置到消息树并推入栈,最后返回transaction。
@Override
public Transaction newTransaction(String type, String name) {
// this enable CAT client logging cat message without explicit setup
if (!m_manager.hasContext()) {
m_manager.setup();
}
DefaultTransaction transaction = new DefaultTransaction(type, name, m_manager);
m_manager.start(transaction, false);
return transaction;
}
@Override
public boolean hasContext() {
Context context = m_context.get();
boolean has = context != null;
if (has) {
MessageTree tree = context.m_tree;
if (tree == null) {
return false;
}
}
return has;
}
@Override
public void setup() {
Context ctx;
if (m_domain != null) {
ctx = new Context(m_domain.getId(), m_hostName, m_domain.getIp());
} else {
ctx = new Context("Unknown", m_hostName, "");
}
double samplingRate = m_configManager.getSampleRatio();
if (samplingRate < 1.0 && hitSample(samplingRate)) {
ctx.m_tree.setHitSample(true);
}
m_context.set(ctx);
}
@Override
public void start(Transaction transaction, boolean forked) {
Context ctx = getContext();
if (ctx != null) {
ctx.start(transaction, forked);
if (transaction instanceof TaggedTransaction) {
TaggedTransaction tt = (TaggedTransaction) transaction;
m_taggedTransactions.put(tt.getTag(), tt);
}
} else if (m_firstMessage) {
m_firstMessage = false;
m_logger.warn("CAT client is not enabled because it's not initialized yet");
}
}
public void start(Transaction transaction, boolean forked) {
if (!m_stack.isEmpty()) {
// Do NOT make strong reference from parent transaction to forked transaction.
// Instead, we create a "soft" reference to forked transaction later, via linkAsRunAway()
// By doing so, there is no need for synchronization between parent and child threads.
// Both threads can complete() anytime despite the other thread.
if (!(transaction instanceof ForkedTransaction)) {
Transaction parent = m_stack.peek();
addTransactionChild(transaction, parent);
}
} else {
m_tree.setMessage(transaction);
}
if (!forked) {
m_stack.push(transaction);
}
}
Cat.logEvent会调用newEvent生成Event,然后将nameValuePairs设置到data字段中,接着设置status字段,最后调用complete结束指标。complete方法会对event的m_completed字段设置为true,然后调用m_manager.add将event加入到transaction中。其他指标也会像Event一样处理,就不细说了。
@Override
public void logEvent(String type, String name, String status, String nameValuePairs) {
Event event = newEvent(type, name);
if (nameValuePairs != null && nameValuePairs.length() > 0) {
event.addData(nameValuePairs);
}
event.setStatus(status);
event.complete();
}
@Override
public void complete() {
setCompleted(true);
if (m_manager != null) {
m_manager.add(this);
}
}
@Override
public void add(Message message) {
Context ctx = getContext();
if (ctx != null) {
ctx.add(message);
}
}
public void add(Message message) {
if (m_stack.isEmpty()) {
MessageTree tree = m_tree.copy();
tree.setMessage(message);
flush(tree, true);
} else {
Transaction parent = m_stack.peek();
addTransactionChild(message, parent);
}
}
最后看一下transaction的complete,如果transaction已经完成过了,则加入一个BadInstrument的event到子指标中,否则设置m_completed字段为true,然后调用m_manager.end继续处理,最后清除ThreadLocal。m_manager.end会继续调用Context的end方法,将栈上的transaction弹出直至栈为空或者弹出的transaction是当前transaction,然后判断栈是否为空,为空的话将调用manager.flush,将消息树发送到server进行处理。
@Override
public void complete() {
try {
if (isCompleted()) {
// complete() was called more than once
DefaultEvent event = new DefaultEvent("cat", "BadInstrument");
event.setStatus("TransactionAlreadyCompleted");
event.complete();
addChild(event);
} else {
if (m_durationInMicro == -1) {
m_durationInMicro = (System.nanoTime() - m_durationStart) / 1000L;
}
setCompleted(true);
if (m_manager != null) {
m_manager.end(this);
}
}
} catch (Exception e) {
// ignore
}
}
@Override
public void end(Transaction transaction) {
Context ctx = getContext();
if (ctx != null && transaction.isStandalone()) {
if (ctx.end(this, transaction)) {
m_context.remove();
}
}
}
public boolean end(DefaultMessageManager manager, Transaction transaction) {
if (!m_stack.isEmpty()) {
Transaction current = m_stack.pop();
if (transaction == current) {
m_validator.validate(m_stack.isEmpty() ? null : m_stack.peek(), current);
} else {
while (transaction != current && !m_stack.empty()) {
m_validator.validate(m_stack.peek(), current);
current = m_stack.pop();
}
}
if (m_stack.isEmpty()) {
MessageTree tree = m_tree.copy();
m_tree.setMessageId(null);
m_tree.setMessage(null);
if (m_totalDurationInMicros > 0) {
adjustForTruncatedTransaction((Transaction) tree.getMessage());
}
manager.flush(tree, true);
return true;
}
}
return false;
}
接下来看manager.flush,调用m_transportManager.getSender获取到MessageSender(TcpSocketSender),然后调用其send方法发送消息树。
public void flush(MessageTree tree, boolean clearContext) {
MessageSender sender = m_transportManager.getSender();
if (sender != null && isMessageEnabled()) {
sender.send(tree);
if (clearContext) {
reset();
}
} else {
m_throttleTimes++;
if (m_throttleTimes % 10000 == 0 || m_throttleTimes == 1) {
m_logger.info("Cat Message is throttled! Times:" + m_throttleTimes);
}
}
}
@Override
public void send(MessageTree tree) {
if (!m_configManager.isBlock()) {
double sampleRatio = m_configManager.getSampleRatio();
if (tree.canDiscard() && sampleRatio < 1.0 && (!tree.isHitSample())) {
processTreeInClient(tree);
} else {
offer(tree);
}
}
}
TcpSocketSender的send方法会看消息树是否采样与丢弃等,若需要发送则调用offer继续处理,offer方法就是将消息树丢到queue中返回。
@Override
public void send(MessageTree tree) {
if (!m_configManager.isBlock()) {
double sampleRatio = m_configManager.getSampleRatio();
if (tree.canDiscard() && sampleRatio < 1.0 && (!tree.isHitSample())) {
processTreeInClient(tree);
} else {
offer(tree);
}
}
}
private void offer(MessageTree tree) {
if (m_configManager.isAtomicMessage(tree)) {
boolean result = m_atomicQueue.offer(tree);
if (!result) {
logQueueFullInfo(tree);
}
} else {
boolean result = m_queue.offer(tree);
if (!result) {
logQueueFullInfo(tree);
}
}
}
会有线程调用processNormalMessage对queue上的消息树进行取出然后发送,将消息树进行系列化,然后调用writeAndFlush发送到server即可。
private void processNormalMessage() {
while (true) {
ChannelFuture channel = m_channelManager.channel();
if (channel != null) {
try {
MessageTree tree = m_queue.poll();
if (tree != null) {
sendInternal(channel, tree);
tree.setMessage(null);
} else {
try {
Thread.sleep(5);
} catch (Exception e) {
m_active = false;
}
break;
}
} catch (Throwable t) {
m_logger.error("Error when sending message over TCP socket!", t);
}
} else {
try {
Thread.sleep(5);
} catch (Exception e) {
m_active = false;
}
}
}
}
public void sendInternal(ChannelFuture channel, MessageTree tree) {
if (tree.getMessageId() == null) {
tree.setMessageId(m_factory.getNextId());
}
ByteBuf buf = m_codec.encode(tree);
int size = buf.readableBytes();
channel.channel().writeAndFlush(buf);
if (m_statistics != null) {
m_statistics.onBytes(size);
}
}
Server
来到server端,跟client端的TcpSocketSender对应的是server端的TcpSocketReceiver,当消息树传输到server端时,会通过decode方法解析,调用m_handler.handle进行处理。
protected void decode(ChannelHandlerContext ctx, ByteBuf buffer, List
来到DefaultMessageHandler的handle方法,如果m_consumer没有初始化,则调用lookup进行初始化,然后调用其consume方法继续处理。
@Override
public void handle(MessageTree tree) {
if (m_consumer == null) {
m_consumer = lookup(MessageConsumer.class);
}
try {
m_consumer.consume(tree);
} catch (Throwable e) {
m_logger.error("Error when consuming message in " + m_consumer + "! tree: " + tree, e);
}
}
来到RealtimeConsumer的consume方法,会根据消息树的时间戳调用m_periodManager.findPeriod找到对应的Period,然后调用其distribute方法处理消息树。
@Override
public void consume(MessageTree tree) {
long timestamp = tree.getMessage().getTimestamp();
Period period = m_periodManager.findPeriod(timestamp);
if (period != null) {
period.distribute(tree);
} else {
m_serverStateManager.addNetworkTimeError(1);
}
}
在其中很值得一提的是PeriodManager对Period的管理模式,PeriodManager的会不断检测当前时间是否需要预加载下一个Period,是否需要终结已经结束的上一个Period。
@Override
public void run() {
while (m_active) {
try {
long now = System.currentTimeMillis();
long value = m_strategy.next(now);
if (value > 0) {
startPeriod(value);
} else if (value < 0) {
// last period is over,make it asynchronous
Threads.forGroup("cat").start(new EndTaskThread(-value));
}
} catch (Throwable e) {
Cat.logError(e);
}
try {
Thread.sleep(1000L);
} catch (InterruptedException e) {
break;
}
}
}
private void startPeriod(long startTime) {
long endTime = startTime + m_strategy.getDuration();
Period period = new Period(startTime, endTime, m_analyzerManager, m_serverStateManager, m_logger);
m_periods.add(period);
period.start();
}
private void endPeriod(long startTime) {
int len = m_periods.size();
for (int i = 0; i < len; i++) {
Period period = m_periods.get(i);
if (period.isIn(startTime)) {
period.finish();
m_periods.remove(i);
break;
}
}
}
回到正题,period.distribute会遍历所有类型的PeriodTask,若存在相同类型的PeriodTask,则根据哈希随机到其中一个PeriodTask,调用其enqueue进行处理。enqueue则会将消息树丢进m_queue中。
public void distribute(MessageTree tree) {
m_serverStateManager.addMessageTotal(tree.getDomain(), 1);
boolean success = true;
String domain = tree.getDomain();
for (Entry> entry : m_tasks.entrySet()) {
List tasks = entry.getValue();
int length = tasks.size();
int index = 0;
boolean manyTasks = length > 1;
if (manyTasks) {
index = Math.abs(domain.hashCode()) % length;
}
PeriodTask task = tasks.get(index);
boolean enqueue = task.enqueue(tree);
if (!enqueue) {
if (manyTasks) {
task = tasks.get((index + 1) % length);
enqueue = task.enqueue(tree);
if (!enqueue) {
success = false;
}
} else {
success = false;
}
}
}
if ((!success) && (!tree.isProcessLoss())) {
m_serverStateManager.addMessageTotalLoss(tree.getDomain(), 1);
tree.setProcessLoss(true);
}
}
public boolean enqueue(MessageTree tree) {
if (m_analyzer.isEligable(tree)) {
boolean result = m_queue.offer(tree);
if (!result) { // trace queue overflow
m_queueOverflow++;
if (m_queueOverflow % (10 * CatConstants.ERROR_COUNT) == 0) {
String date = new SimpleDateFormat("yyyy-MM-dd HH:mm").format(new Date(m_analyzer.getStartTime()));
m_logger.warn(m_analyzer.getClass().getSimpleName() + " queue overflow number " + m_queueOverflow + " analyzer time:" + date);
}
}
return result;
} else {
return true;
}
}
PeriodTask会不断调用m_analyzer.analyze,处理m_queue中的消息树。
@Override
public void run() {
try {
m_analyzer.analyze(m_queue);
} catch (Exception e) {
Cat.logError(e);
}
}
来到AbstractMessageAnalyzer的analyze方法,会调用抽象方法process继续处理。
@Override
public void analyze(MessageQueue queue) {
while (!isTimeout() && isActive()) {
MessageTree tree = queue.poll();
if (tree != null) {
try {
process(tree);
} catch (Throwable e) {
m_errors++;
if (m_errors == 1 || m_errors % 10000 == 0) {
Cat.logError(e);
}
}
}
}
while (true) {
MessageTree tree = queue.poll();
if (tree != null) {
try {
process(tree);
} catch (Throwable e) {
m_errors++;
if (m_errors == 1 || m_errors % 10000 == 0) {
Cat.logError(e);
}
}
} else {
break;
}
}
}
我们重点分析一下TransactionAnalyzer和DumpAnalyzer两个分析器,分别是报表和logview的核心分析器。
TransactionAnalyzer的process方法会取出当前小时的报表,然后调用processTransaction或者processBatchTransaction进行处理。
@Override
public void process(MessageTree tree) {
String domain = tree.getDomain();
TransactionReport report = m_reportManager.getHourlyReport(getStartTime(), domain, true);
List transactions = tree.findOrCreateTransactions();
for (Transaction t : transactions) {
String data = String.valueOf(t.getData());
if (data.length() > 0 && data.charAt(0) == CatConstants.BATCH_FLAG) {
processBatchTransaction(tree, report, t, data);
} else {
processTransaction(report, tree, t);
}
}
if (System.currentTimeMillis() > m_nextClearTime) {
m_nextClearTime = m_nextClearTime + TimeHelper.ONE_MINUTE;
Threads.forGroup("cat").start(new Runnable() {
@Override
public void run() {
cleanUpReports();
}
});
}
}
选取processTransaction进行分析,会先调用findOrCreateMachine找到对应的Machine,然后调用findOrCreateType找到对应的TransactionType,再调用findOrCreateName找到对应的TransactionName,最后调用processTypeAndName继续处理。
private void processTransaction(TransactionReport report, MessageTree tree, Transaction t) {
String type = t.getType();
String name = t.getName();
if (!m_filterConfigManager.discardTransaction(type, name)) {
boolean valid = checkForTruncatedMessage(tree, t);
if (valid) {
String ip = tree.getIpAddress();
TransactionType transactionType = findOrCreateType(report.findOrCreateMachine(ip), type);
TransactionName transactionName = findOrCreateName(transactionType, name, report.getDomain());
processTypeAndName(t, transactionType, transactionName, tree, t.getDurationInMillis());
}
}
}
先看一下findOrCreateMachine、findOrCreateType、findOrCreateName这三个方法的处理。对应的关系是machine->type->name。
public Machine findOrCreateMachine(String ip) {
Machine machine = m_machines.get(ip);
if (machine == null) {
synchronized (m_machines) {
machine = m_machines.get(ip);
if (machine == null) {
machine = new Machine(ip);
m_machines.put(ip, machine);
}
}
}
return machine;
}
public TransactionType findOrCreateType(String id) {
TransactionType type = m_types.get(id);
if (type == null) {
synchronized (m_types) {
type = m_types.get(id);
if (type == null) {
type = new TransactionType(id);
m_types.put(id, type);
}
}
}
return type;
}
public TransactionName findOrCreateName(String id) {
TransactionName name = m_names.get(id);
if (name == null) {
synchronized (m_names) {
name = m_names.get(id);
if (name == null) {
name = new TransactionName(id);
m_names.put(id, name);
}
}
}
return name;
}
接下来继续看processTypeAndName,是对transaction的报表的总体指标进行统计,包括总数、失败总数、最长最短耗时、总耗时、每个耗时的总数等,然后继续调用processNameGraph和processTypeRange继续处理。
private void processTypeAndName(Transaction t, TransactionType type, TransactionName name, MessageTree tree,
double duration) {
String messageId = tree.getMessageId();
type.incTotalCount();
name.incTotalCount();
type.setSuccessMessageUrl(messageId);
name.setSuccessMessageUrl(messageId);
if (!t.isSuccess()) {
type.incFailCount();
name.incFailCount();
String statusCode = formatStatus(t.getStatus());
findOrCreateStatusCode(name, statusCode).incCount();
}
int allDuration = DurationComputer.computeDuration((int) duration);
double sum = duration * duration;
if (type.getMax() <= duration) {
type.setLongestMessageUrl(messageId);
}
if (name.getMax() <= duration) {
name.setLongestMessageUrl(messageId);
}
name.setMax(Math.max(name.getMax(), duration));
name.setMin(Math.min(name.getMin(), duration));
name.setSum(name.getSum() + duration);
name.setSum2(name.getSum2() + sum);
name.findOrCreateAllDuration(allDuration).incCount();
type.setMax(Math.max(type.getMax(), duration));
type.setMin(Math.min(type.getMin(), duration));
type.setSum(type.getSum() + duration);
type.setSum2(type.getSum2() + sum);
type.findOrCreateAllDuration(allDuration).incCount();
long current = t.getTimestamp() / 1000 / 60;
int min = (int) (current % (60));
boolean statistic = m_statisticManager.shouldStatistic(type.getId(), tree.getDomain());
processNameGraph(t, name, min, duration, statistic, allDuration);
processTypeRange(t, type, min, duration, statistic, allDuration);
}
processNameGraph和processTypeRange主要是对分钟范围的总数、失败总数、总耗时、最大耗时、最小耗时、每个耗时的总数的指标统计。
private void processNameGraph(Transaction t, TransactionName name, int min, double d, boolean statistic,
int allDuration) {
int dk = formatDurationDistribute(d);
Duration duration = name.findOrCreateDuration(dk);
Range range = name.findOrCreateRange(min);
duration.incCount();
range.incCount();
if (!t.isSuccess()) {
range.incFails();
}
range.setSum(range.getSum() + d);
range.setMax(Math.max(range.getMax(), d));
range.setMin(Math.min(range.getMin(), d));
if (statistic) {
range.findOrCreateAllDuration(allDuration).incCount();
}
}
private void processTypeRange(Transaction t, TransactionType type, int min, double d, boolean statistic,
int allDuration) {
Range2 range = type.findOrCreateRange2(min);
if (!t.isSuccess()) {
range.incFails();
}
range.incCount();
range.setSum(range.getSum() + d);
range.setMax(Math.max(range.getMax(), d));
range.setMin(Math.min(range.getMin(), d));
if (statistic) {
range.findOrCreateAllDuration(allDuration).incCount();
}
}
另外提一下报表的存储,当结束一个PeriodTask的时候,会调用m_analyzer.doCheckpoint将报表存储在文件和db上。
public void finish() {
try {
m_analyzer.doCheckpoint(true);
m_analyzer.destroy();
} catch (Exception e) {
Cat.logError(e);
}
}
来看一下TransactionAnalyzer的doCheckpoint方法,会判断是否是结束并且非本地模式,是的话就将报表存储在文件和db上,否则只存储在文件上。
@Override
public synchronized void doCheckpoint(boolean atEnd) {
if (atEnd && !isLocalMode()) {
m_reportManager.storeHourlyReports(getStartTime(), StoragePolicy.FILE_AND_DB, m_index);
} else {
m_reportManager.storeHourlyReports(getStartTime(), StoragePolicy.FILE, m_index);
}
}
继续分析DefaultReportManager的storeHourlyReports方法,除了执行钩子方法之外,就是调用storeFile存储报表到文件上,调用storeDatabase存储报表到db上。
@Override
public void storeHourlyReports(long startTime, StoragePolicy policy, int index) {
Transaction t = Cat.newTransaction("Checkpoint", m_name);
Map reports = m_reports.get(startTime);
ReportBucket bucket = null;
try {
t.addData("reports", reports == null ? 0 : reports.size());
if (reports != null) {
Set errorDomains = new HashSet();
for (String domain : reports.keySet()) {
if (!m_validator.validate(domain)) {
errorDomains.add(domain);
}
}
for (String domain : errorDomains) {
reports.remove(domain);
}
if (!errorDomains.isEmpty()) {
m_logger.info("error domain:" + errorDomains);
}
m_reportDelegate.beforeSave(reports);
if (policy.forFile()) {
bucket = m_bucketManager.getReportBucket(startTime, m_name, index);
try {
storeFile(reports, bucket);
} finally {
m_bucketManager.closeBucket(bucket);
}
}
if (policy.forDatabase()) {
storeDatabase(startTime, reports);
}
}
t.setStatus(Message.SUCCESS);
} catch (Throwable e) {
Cat.logError(e);
t.setStatus(e);
m_logger.error(String.format("Error when storing %s reports of %s!", m_name, new Date(startTime)), e);
} finally {
cleanup(startTime);
t.complete();
if (bucket != null) {
m_bucketManager.closeBucket(bucket);
}
}
}
storeFile将报表序列化为xml格式,将报表存储在m_writeDataFile上,并且写入具体index到m_writeIndexFile上。
private void storeFile(Map reports, ReportBucket bucket) {
for (T report : reports.values()) {
try {
String domain = m_reportDelegate.getDomain(report);
String xml = m_reportDelegate.buildXml(report);
bucket.storeById(domain, xml);
} catch (Exception e) {
Cat.logError(e);
}
}
}
@Override
public boolean storeById(String id, String report) throws IOException {
byte[] content = report.getBytes("utf-8");
int length = content.length;
byte[] num = String.valueOf(length).getBytes("utf-8");
m_writeLock.lock();
try {
m_writeDataFile.write(num);
m_writeDataFile.write('\n');
m_writeDataFile.write(content);
m_writeDataFile.write('\n');
m_writeDataFile.flush();
long offset = m_writeDataFileLength;
String line = id + '\t' + offset + '\n';
byte[] data = line.getBytes("utf-8");
m_writeDataFileLength += num.length + 1 + length + 1;
m_writeIndexFile.write(data);
m_writeIndexFile.flush();
m_idToOffsets.put(id, offset);
return true;
} finally {
m_writeLock.unlock();
}
}
storeDatabase先将HourlyReport写入db,再将报表信息HourlyReportContent写入db,然后调用m_reportDelegate.createHourlyTask生成周、月等聚合报表。
private void storeDatabase(long startTime, Map reports) {
Date period = new Date(startTime);
String ip = NetworkInterfaceManager.INSTANCE.getLocalHostAddress();
for (T report : reports.values()) {
try {
String domain = m_reportDelegate.getDomain(report);
HourlyReport r = m_reportDao.createLocal();
r.setName(m_name);
r.setDomain(domain);
r.setPeriod(period);
r.setIp(ip);
r.setType(1);
m_reportDao.insert(r);
int id = r.getId();
byte[] binaryContent = m_reportDelegate.buildBinary(report);
HourlyReportContent content = m_reportContentDao.createLocal();
content.setReportId(id);
content.setContent(binaryContent);
content.setPeriod(period);
m_reportContentDao.insert(content);
m_reportDelegate.createHourlyTask(report);
} catch (Throwable e) {
Cat.getProducer().logError(e);
}
}
}
继续分析createHourlyTask,会创建不同周期的task并存储在db上,等待工作线程捞取处理。
@Override
public boolean createHourlyTask(TransactionReport report) {
String domain = report.getDomain();
if (domain.equals(Constants.ALL) || m_configManager.validateDomain(domain)) {
return m_taskManager.createTask(report.getStartTime(), domain, TransactionAnalyzer.ID,
TaskProlicy.ALL_EXCLUED_HOURLY);
} else {
return true;
}
}
public boolean createTask(Date period, String domain, String name, TaskCreationPolicy prolicy) {
try {
if (prolicy.shouldCreateHourlyTask()) {
insertToDatabase(period, domain, name, REPORT_HOUR);
}
Calendar cal = Calendar.getInstance();
cal.setTime(period);
int hour = cal.get(Calendar.HOUR_OF_DAY);
cal.add(Calendar.HOUR_OF_DAY, -hour);
Date currentDay = cal.getTime();
if (prolicy.shouldCreateDailyTask()) {
insertToDatabase(new Date(currentDay.getTime() - ONE_DAY), domain, name, REPORT_DAILY);
}
if (prolicy.shouldCreateWeeklyTask()) {
int dayOfWeek = cal.get(Calendar.DAY_OF_WEEK);
if (dayOfWeek == 7) {
insertToDatabase(new Date(currentDay.getTime() - 7 * ONE_DAY), domain, name, REPORT_WEEK);
}
}
if (prolicy.shouldCreateMonthTask()) {
int dayOfMonth = cal.get(Calendar.DAY_OF_MONTH);
if (dayOfMonth == 1) {
cal.add(Calendar.MONTH, -1);
insertToDatabase(cal.getTime(), domain, name, REPORT_MONTH);
}
}
return true;
} catch (DalException e) {
Cat.logError(e);
return false;
}
}
protected void insertToDatabase(Date period, String domain, String name, int reportType) throws DalException {
Task task = m_taskDao.createLocal();
task.setCreationDate(new Date());
task.setProducer(NetworkInterfaceManager.INSTANCE.getLocalHostAddress());
task.setReportDomain(domain);
task.setReportName(name);
task.setReportPeriod(period);
task.setStatus(STATUS_TODO);
task.setTaskType(reportType);
m_taskDao.insert(task);
}
在TaskConsumer中,会不断捞取待处理的task并调用processTask处理。
@Override
public void run() {
String localIp = getLoaclIp();
while (m_running) {
try {
if (checkTime()) {
Task task = findDoingTask(localIp);
if (task == null) {
task = findTodoTask();
}
boolean again = false;
if (task != null) {
try {
task.setConsumer(localIp);
if (task.getStatus() == TaskConsumer.STATUS_DOING || updateTodoToDoing(task)) {
int retryTimes = 0;
while (!processTask(task)) {
retryTimes++;
if (retryTimes < MAX_TODO_RETRY_TIMES) {
taskRetryDuration();
} else {
updateDoingToFailure(task);
again = true;
break;
}
}
if (!again) {
updateDoingToDone(task);
}
}
} catch (Throwable e) {
Cat.logError(task.toString(), e);
}
} else {
taskNotFoundDuration();
}
} else {
try {
Thread.sleep(60 * 1000);
} catch (InterruptedException e) {
// Ignore
}
}
} catch (Throwable e) {
Cat.logError(e);
}
}
m_stopped = true;
}
processTask会继续调用m_reportFacade.builderReport处理,最终调用不同的bulidTask方法处理。
@Override
protected boolean processTask(Task doing) {
boolean result = false;
Transaction t = Cat.newTransaction("Task", doing.getReportName());
t.addData(doing.toString());
try {
result = m_reportFacade.builderReport(doing);
t.setStatus(Transaction.SUCCESS);
} catch (Throwable e) {
Cat.logError(e);
t.setStatus(e);
} finally {
t.complete();
}
return result;
}
public boolean builderReport(Task task) {
try {
if (task == null) {
return false;
}
int type = task.getTaskType();
String reportName = task.getReportName();
String reportDomain = task.getReportDomain();
Date reportPeriod = task.getReportPeriod();
TaskBuilder reportBuilder = getReportBuilder(reportName);
if (reportBuilder == null) {
Cat.logError(new RuntimeException("no report builder for type:" + " " + reportName));
return false;
} else {
boolean result = false;
if (type == TaskManager.REPORT_HOUR) {
result = reportBuilder.buildHourlyTask(reportName, reportDomain, reportPeriod);
} else if (type == TaskManager.REPORT_DAILY) {
result = reportBuilder.buildDailyTask(reportName, reportDomain, reportPeriod);
} else if (type == TaskManager.REPORT_WEEK) {
result = reportBuilder.buildWeeklyTask(reportName, reportDomain, reportPeriod);
} else if (type == TaskManager.REPORT_MONTH) {
result = reportBuilder.buildMonthlyTask(reportName, reportDomain, reportPeriod);
}
if (result) {
return result;
} else {
m_logger.error(task.toString());
}
}
} catch (Exception e) {
m_logger.error("Error when building report," + e.getMessage(), e);
Cat.logError(e);
return false;
}
return false;
}
来看一下TransactionReportBuilder的buildMonthlyTask方法,主要是调用queryDailyReportsByDuration得到月聚合报表,然后生成MonthlyReport写入db。
@Override
public boolean buildMonthlyTask(String name, String domain, Date period) {
Date end = null;
if (period.equals(TimeHelper.getCurrentMonth())) {
end = TimeHelper.getCurrentDay();
} else {
end = TaskHelper.nextMonthStart(period);
}
TransactionReport transactionReport = queryDailyReportsByDuration(domain, period, end);
MonthlyReport report = new MonthlyReport();
report.setCreationDate(new Date());
report.setDomain(domain);
report.setIp(NetworkInterfaceManager.INSTANCE.getLocalHostAddress());
report.setName(name);
report.setPeriod(period);
report.setType(1);
byte[] binaryContent = DefaultNativeBuilder.build(transactionReport);
return m_reportService.insertMonthlyReport(report, binaryContent);
}
queryDailyReportsByDuration方法内部通过TransactionReportDailyGraphCreator生成GraphTrend(每天趋势图),HistoryTransactionReportMerger聚合生成月机器级别聚合TransactionReport,通过TransactionReportCountFilter聚合所有机器总体指标。里面大量使用访问者模式,有兴趣的可以自己翻代码研究一下。
private TransactionReport queryDailyReportsByDuration(String domain, Date start, Date end) {
long startTime = start.getTime();
long endTime = end.getTime();
double duration = (end.getTime() - start.getTime()) * 1.0 / TimeHelper.ONE_DAY;
HistoryTransactionReportMerger merger = new HistoryTransactionReportMerger(new TransactionReport(domain)).setDuration(duration);
TransactionReport transactionReport = merger.getTransactionReport();
TransactionReportDailyGraphCreator creator = new TransactionReportDailyGraphCreator(transactionReport, (int) duration, start);
for (; startTime < endTime; startTime += TimeHelper.ONE_DAY) {
try {
TransactionReport reportModel = m_reportService.queryReport(domain, new Date(startTime), new Date(startTime + TimeHelper.ONE_DAY));
creator.createGraph(reportModel);
reportModel.accept(merger);
} catch (Exception e) {
Cat.logError(e);
}
}
transactionReport.setStartTime(start);
transactionReport.setEndTime(end);
new TransactionReportCountFilter(m_serverConfigManager.getMaxTypeThreshold(),
m_atomicMessageConfigManager.getMaxNameThreshold(domain), m_serverConfigManager.getTypeNameLengthLimit())
.visitTransactionReport(transactionReport);
return transactionReport;
}
home
home是美团cat的管理端,可以查看报表或者logview,我们分别对查询当前报表、历史报表、logview的处理进行分析。
首先看到transaction下的Handler.handleOutbound方法,会根据请求调用不同的方法处理后返回,查询当前报表对应HOURLY_REPORT,查询历史报表对应HISTORY_REPORT。
@Override
@OutboundActionMeta(name = "t")
public void handleOutbound(Context ctx) throws ServletException, IOException {
Cat.logMetricForCount("http-request-transaction");
Model model = new Model(ctx);
Payload payload = ctx.getPayload();
normalize(model, payload);
String domain = payload.getDomain();
Action action = payload.getAction();
String ipAddress = payload.getIpAddress();
String group = payload.getGroup();
String type = payload.getType();
String name = payload.getName();
String ip = payload.getIpAddress();
Date start = payload.getHistoryStartDate();
Date end = payload.getHistoryEndDate();
if (StringUtils.isEmpty(group)) {
group = m_configManager.queryDefaultGroup(domain);
payload.setGroup(group);
}
model.setGroupIps(m_configManager.queryIpByDomainAndGroup(domain, group));
model.setGroups(m_configManager.queryDomainGroup(payload.getDomain()));
switch (action) {
case HOURLY_REPORT:
TransactionReport report = getHourlyReport(payload);
report = m_mergeHelper.mergeAllMachines(report, ipAddress);
if (report != null) {
model.setReport(report);
buildTransactionMetaInfo(model, payload, report);
}
break;
case HISTORY_REPORT:
report = m_reportService.queryReport(domain, payload.getHistoryStartDate(), payload.getHistoryEndDate());
report = m_mergeHelper.mergeAllMachines(report, ipAddress);
if (report != null) {
model.setReport(report);
buildTransactionMetaInfo(model, payload, report);
}
break;
case HISTORY_GRAPH:
report = m_reportService.queryReport(domain, start, end);
if (Constants.ALL.equalsIgnoreCase(ip)) {
buildDistributionInfo(model, type, name, report);
}
report = m_mergeHelper.mergeAllMachines(report, ip);
new TransactionTrendGraphBuilder().buildTrendGraph(model, payload, report);
break;
case GRAPHS:
report = getHourlyGraphReport(model, payload);
if (Constants.ALL.equalsIgnoreCase(ipAddress)) {
buildDistributionInfo(model, type, name, report);
}
if (name == null || name.length() == 0) {
name = Constants.ALL;
}
report = m_mergeHelper.mergeAllNames(report, ip, name);
model.setReport(report);
buildTransactionNameGraph(model, report, type, name, ip);
break;
case HOURLY_GROUP_REPORT:
report = getHourlyReport(payload);
report = filterReportByGroup(report, domain, group);
report = m_mergeHelper.mergeAllMachines(report, ipAddress);
if (report != null) {
model.setReport(report);
buildTransactionMetaInfo(model, payload, report);
}
break;
case HISTORY_GROUP_REPORT:
report = m_reportService.queryReport(domain, payload.getHistoryStartDate(), payload.getHistoryEndDate());
report = filterReportByGroup(report, domain, group);
report = m_mergeHelper.mergeAllMachines(report, ipAddress);
if (report != null) {
model.setReport(report);
buildTransactionMetaInfo(model, payload, report);
}
break;
case GROUP_GRAPHS:
report = getHourlyGraphReport(model, payload);
report = filterReportByGroup(report, domain, group);
buildDistributionInfo(model, type, name, report);
if (name == null || name.length() == 0) {
name = Constants.ALL;
}
report = m_mergeHelper.mergeAllNames(report, ip, name);
model.setReport(report);
buildTransactionNameGraph(model, report, type, name, ip);
break;
case HISTORY_GROUP_GRAPH:
report = m_reportService.queryReport(domain, start, end);
report = filterReportByGroup(report, domain, group);
buildDistributionInfo(model, type, name, report);
report = m_mergeHelper.mergeAllMachines(report, ip);
new TransactionTrendGraphBuilder().buildTrendGraph(model, payload, report);
break;
}
if (payload.isXml()) {
m_xmlViewer.view(ctx, model);
} else {
m_jspViewer.view(ctx, model);
}
}
HOURLY_REPORT请求会先调用getHourlyReport,请求各个server获取到机器的实时报表,然后通过聚合获取到TransactionReport。
private TransactionReport getHourlyReport(Payload payload) {
String domain = payload.getDomain();
String ipAddress = payload.getIpAddress();
ModelRequest request = new ModelRequest(domain, payload.getDate()).setProperty("type", payload.getType())
.setProperty("ip", ipAddress);
if (m_service.isEligable(request)) {
ModelResponse response = m_service.invoke(request);
TransactionReport report = response.getModel();
return report;
} else {
throw new RuntimeException("Internal error: no eligable transaction service registered for " + request + "!");
}
}
@Override
public ModelResponse invoke(final ModelRequest request) {
int requireSize = 0;
final List> responses = Collections.synchronizedList(new ArrayList>());
final Semaphore semaphore = new Semaphore(0);
final Transaction t = Cat.getProducer().newTransaction("ModelService", getClass().getSimpleName());
int count = 0;
t.setStatus(Message.SUCCESS);
t.addData("request", request);
t.addData("thread", Thread.currentThread());
for (final ModelService service : m_allServices) {
if (!service.isEligable(request)) {
continue;
}
// save current transaction so that child thread can access it
if (service instanceof ModelServiceWithCalSupport) {
((ModelServiceWithCalSupport) service).setParentTransaction(t);
}
requireSize++;
m_configManager.getModelServiceExecutorService().submit(new Runnable() {
@Override
public void run() {
try {
ModelResponse response = service.invoke(request);
if (response.getException() != null) {
logError(response.getException());
}
if (response != null && response.getModel() != null) {
responses.add(response);
}
} catch (Exception e) {
logError(e);
t.setStatus(e);
} finally {
semaphore.release();
}
}
});
count++;
}
try {
semaphore.tryAcquire(count, 10000, TimeUnit.MILLISECONDS); // 10 seconds timeout
} catch (InterruptedException e) {
// ignore it
t.setStatus(e);
} finally {
t.complete();
}
String requireAll = request.getProperty("requireAll");
if (requireAll != null && responses.size() != requireSize) {
String data = "require:" + requireSize + " actual:" + responses.size();
Cat.logEvent("FetchReportError:" + this.getClass().getSimpleName(), request.getDomain(), Event.SUCCESS, data);
return null;
}
ModelResponse aggregated = new ModelResponse();
T report = merge(request, responses);
aggregated.setModel(report);
return aggregated;
}
如果是查询ALL维度的报表,则会调用TransactionMergeHelper的mergeAllMachines进行继续的聚合所有机器的指标。
public TransactionReport mergeAllMachines(TransactionReport report, String ipAddress) {
if (StringUtils.isEmpty(ipAddress) || Constants.ALL.equalsIgnoreCase(ipAddress)) {
AllMachineMerger all = new AllMachineMerger();
all.visitTransactionReport(report);
report = all.getReport();
}
return report;
}
HISTORY_REPORT请求会先调用m_reportService.queryReport从数据库获取到对应的历史报表,如MonthlyReport为例。获取到报表后,还是会判断是否需要聚合所有机器指标,就不细说了。
@Override
public TransactionReport queryMonthlyReport(String domain, Date start) {
TransactionReport transactionReport = new TransactionReport(domain);
try {
MonthlyReport entity = m_monthlyReportDao
.findReportByDomainNamePeriod(start, domain, TransactionAnalyzer.ID, MonthlyReportEntity.READSET_FULL);
transactionReport = queryFromMonthlyBinary(entity.getId(), domain);
} catch (DalNotFoundException e) {
// ignore
} catch (Exception e) {
Cat.logError(e);
}
return convert(transactionReport);
}
logview下的Handler.getLogView方法,对应着查询logview的处理。
@Override
@OutboundActionMeta(name = "m")
public void handleOutbound(Context ctx) throws ServletException, IOException {
Model model = new Model(ctx);
Payload payload = ctx.getPayload();
model.setAction(payload.getAction());
model.setPage(ReportPage.LOGVIEW);
model.setDomain(payload.getDomain());
model.setDate(payload.getDate());
String messageId = getMessageId(payload);
String logView = null;
MessageId msgId = MessageId.parse(messageId);
if (checkStorageTime(msgId)) {
logView = getLogView(messageId, payload.isWaterfall());
if (logView == null || logView.length() == 0) {
Cat.logEvent("Logview", msgId.getDomain() + ":Fail", Event.SUCCESS, messageId);
} else {
Cat.logEvent("Logview", "Success", Event.SUCCESS, messageId);
}
} else {
Cat.logEvent("Logview", "OldMessage", Event.SUCCESS, messageId);
}
switch (payload.getAction()) {
case VIEW:
model.setTable(logView);
break;
}
m_jspViewer.view(ctx, model);
}
private String getLogView(String messageId, boolean waterfall) {
try {
if (messageId != null) {
MessageId id = MessageId.parse(messageId);
long timestamp = id.getTimestamp();
ModelRequest request = new ModelRequest(id.getDomain(), timestamp) //
.setProperty("messageId", messageId) //
.setProperty("waterfall", String.valueOf(waterfall)) //
.setProperty("timestamp", String.valueOf(timestamp));
if (m_service.isEligable(request)) {
ModelResponse response = m_service.invoke(request);
String logview = response.getModel();
return logview;
} else {
throw new RuntimeException("Internal error: no eligible logview service registered for " + request + "!");
}
}
} catch (Exception e) {
Cat.logError(e);
return null;
}
return null;
}
一样的,会请求各个server获取到机器存储在磁盘的logview,原理是通过MessageId找到index,再通过index读取data并返回。选取拥有结果的response返回,上文已经分析过logview的写入,查找其实就是写入的反向操作,就不细说了。
@Override
protected String merge(ModelRequest request, List> responses) {
for (ModelResponse response : responses) {
if (response != null) {
String model = response.getModel();
if (model != null) {
return model;
}
}
}
return null;
}
写在最后
1.https://zhuanlan.zhihu.com/p/114718897
携程对cat的二次优化,指出了cat的不足之处,同时优化思路很值得借鉴。