Google Testing Blog:Optimal Logging

谷歌工程师总结的关于日志记录的最佳实践经验

Google Testing Blog:Optimal Logging(日志记录最佳实践)
原文链接:http://club.oneapm.com/t/optimal-logging/561/2
译自Optimal Logging,by Anthony Vallone,Google Testing Blog

要找到一个系统问题的根本原因,你需要多长时间?5分钟?还是5天?如果你的答案接近5分钟,很大可能是因为你的生产环境和测试环境使用了非常好的日志记录。更常见的情况是,诸如日志、异常处理、甚至测试这类非核心的工作,被当作一种出现问题后的补救方式。同异常处理和测试一样,日志记录真的也需要策略,无论是生产环境还是测试环境。永远不要低估日志的作用。有了使用得当的日志,你甚至可以说debug不是必需的。下面是多年来对我非常有用的日志记录指导原则。

保持适度

切勿记录过多。大量的磁盘空间被日志占用说明你没有想过应该记录什么。如果记录了太多,你就还需要设计出复杂的方法来减少磁盘访问、保留历史记录、归档大量数据、以及在这些数据中查询。最关键的是,你将发现在这么多垃圾中找到有用信息是多么的困难。

唯一一个比记录过多日志还差的事是,记录的过少。日志通常有两个主要目的:定位问题和事件确认。如果你的日志不能明确一个bug的原因,或者某个事务是否执行,你就记录的过少了。

适合记录的:

  • 重要启动配置
  • 错误
  • 警告
  • 持久性数据的更改
  • 主要系统组件间的请求和响应
  • 重要的状态变化
  • 用户交互
  • 有已知失败风险的调用
  • 较长时间的等待
  • 长期运行任务的周期性进度
  • 重要的逻辑分支和导向分支的条件
  • 从高层方法来汇总处理流程的步骤和事件——避免在底层方法中记录复杂流程的每一步骤

不适合记录的:

  • 方法入口——不要记录方法入口,除非它非常重要或者是日志处于调试级别。
  • 循环中的数据——避免在循环的多次迭代中记录日志。如果是小循环,或者是间歇性的记录,倒也无妨。
  • 大消息或者文件的内容——截断或者使用一些有利于调试的方式进行汇总。
  • 良性错误——那些不是真正错误的错误会令日志的读者感到困惑。当异常处理是执行成功的一部分时,有时会遇到这种情况。
  • 重复的错误——不要重复的记录相同或者相似的错误。这样可能会快速的填满日志,并且隐藏掉真正的问题。各种类型错误发生的频率最好有监视器(monitor)处理。日志只需捕获问题的详细信息。

多个日志级别

不要把所有信息都记录在同一个日志级别中。绝大多数的日志库都提供多个级别,系统启动时可以进行指定。这样可以很方便的控制日志详尽程度。

典型的级别有:

  • 调试——最详细,但是只有在开发或者调试时适用。
  • 信息——最常用的级别。
  • 警告——奇怪的或者不在预期之内的一些状态,但是可接受。
  • 错误——有错误发生,但是流程不受影响。
  • 严重——流程无法继续,系统将关闭或者重启。

从实际使用来讲,只需要两种级别的日志配置:

  • 生产环境——除了调试级别,其他全开。如果生产环境发生了问题,日志应该能够指明原因。
  • 开发和调试——编写新代码或是尝试复现问题时,打开全部级别。

测试日志同样重要

日志的质量对于测试代码和产品代码同样重要。当一次测试运行失败时,日志应当明确的指出这个错误是来自测试本身还是生产系统。如果做不到这一点,那么测试的日志是有问题的。

测试日志应该必需包括:

  • 测试执行环境
  • 初始状态
  • 准备步骤
  • 用例步骤
  • 与系统的交互
  • 期望的结果
  • 实际的结果
  • 清理步骤

利用临时日志队列实现条件性的详细信息控制

发生错误时,日志应当包含大量的详细信息。但不幸的是,当遇到一个错误时,导致这个错误发生的详细信息可能已经无法获得了。如果你听从了“不要记录过多”的建议,在错误日志之前的那些日志可能无法提供足够的细节。解决这个问题的一个好的方式是,在内存中创建临时的日志队列。在事务的处理过程中,将每一步的详尽信息追加到队列中。如果事务成功完成,丢弃这个队列,只记录一个汇总。如果发生了错误,就把错误和队列里的全部内容记录下来。这一方法对于系统交互的日志尤其有效。

问题是机遇

当生产环境出现问题时,你必将集中精力寻找并且修复问题,但是也不要忘记考虑一下日志。如果你费了很大力气才找到问题的原因,这将是个非常好的机会来改善你的日志。修复问题前,先修复你的日志记录,使其可以清楚的指明问题原因。如果这个问题又一次发生,将会很容易辨认。

如果无法复现问题,或者测试结果不确定,改进日志以便可以在问题再次发生时将其记录下来。

在整个开发的生命周期内,都应该持续的利用问题来改进日志。写新代码时,试着少用debug,只使用日志。这些日志是否能够说明发生了什么?如果不能,日志就是不充分的。

最好也记录性能数据

记录时间数据可以用来帮助定位性能问题。例如,要找到一个大型系统的超时原因是很困难的,除非你能够追踪每一个重要处理步骤的耗时情况。这是很容易做到的,只需记录那些可能会比较耗时的调用的开始和结束时间即可:

  • 重要的系统调用
  • 网络请求
  • CPU密集运算
  • 连接设备的交互
  • 事务
  • 在多线程和多进程中追踪痕迹

在涉及到多线程或多进程的处理时,要为事务创建独一的标识。事务初始化时创建ID,将它传入每一个为此事务工作的部分。当记录关于此事务的日志时,每一个部分都应该记录下这个ID。这样,在多个事务并行执行时,追踪一个特定的事务会容易很多。

监控和日志相互完善

一个生产服务应该既有日志也有监控。监控提供了一种实时的对于系统状态的统计汇总。它可以提醒你,是否一定比例的某个类型请求失败了,是否系统正在经受不正常的流量访问,性能是否在下降,或者其他的一些异常。在某些情况下,只是这些信息就可以为找到问题原因提供线索。不过,大多数情况下,监控警报只是为了简单的触发你的调查。监控将问题的症状展现给我们。日志则针对各个事务提供了详细的信息和状态,这样你才能全面的理解问题的原因。


英语原文:

by Anthony Vallone

How long does it take to find the root cause of a failure in your system? Five minutes? Five days? If you answered close to five minutes, it’s very likely that your production system and tests have great logging. All too often, seemingly unessential features like logging, exception handling, and (dare I say it) testing are an implementation afterthought. Like exception handling and testing, you really need to have a strategy for logging in both your systems and your tests. Never underestimate the power of logging. With optimal logging, you can even eliminate the necessity for debuggers. Below are some guidelines that have been useful to me over the years.

Channeling Goldilocks

Never log too much. Massive, disk-quota burning logs are a clear indicator that little thought was put in to logging. If you log too much, you’ll need to devise complex approaches to minimize disk access, maintain log history, archive large quantities of data, and query these large sets of data. More importantly, you’ll make it very difficult to find valuable information in all the chatter.

The only thing worse than logging too much is logging too little. There are normally two main goals of logging: help with bug investigation and event confirmation. If your log can’t explain the cause of a bug or whether a certain transaction took place, you are logging too little.

Good things to log:

  • Important startup configuration
  • Errors
  • Warnings
  • Changes to persistent data
  • Requests and responses between major system components
  • Significant state changes
  • User interactions
  • Calls with a known risk of failure
  • Waits on conditions that could take measurable time to satisfy
  • Periodic progress during long-running tasks
  • Significant branch points of logic and conditions that led to the branch
  • Summaries of processing steps or events from high level functions - Avoid logging every step of a complex process in low-level functions.

Bad things to log:

  • Function entry - Don’t log a function entry unless it is significant or logged at the debug level. Data within a loop - Avoid logging from many iterations of a loop. It is OK to log from iterations of small loops or to log periodically from large loops.
  • Content of large messages or files - Truncate or summarize the data in some way that will be useful to debugging.
  • Benign errors - Errors that are not really errors can confuse the log reader. This sometimes happens when exception handling is part of successful execution flow.
  • Repetitive errors - Do not repetitively log the same or similar error. This can quickly fill a log and hide the actual cause. Frequency of error types is best handled by monitoring. Logs only need to capture detail for some of those errors.

There is More Than One Level

Don't log everything at the same log level. Most logging libraries offer several log levels, and you can enable certain levels at system startup. This provides a convenient control for log verbosity.

The classic levels are:

  • Debug - verbose and only useful while developing and/or debugging.
  • Info - the most popular level.
  • Warning - strange or unexpected states that are acceptable.
  • Error - something went wrong, but the process can recover.
  • Critical - the process cannot recover, and it will shutdown or restart.

Practically speaking, only two log configurations are needed:

  • Production - Every level is enabled except debug. If something goes wrong in production, the logs should reveal the cause.
  • Development & Debug - While developing new code or trying to reproduce a production issue, enable all levels.

Test Logs Are Important To

Log quality is equally important in test and production code. When a test fails, the log should clearly show whether the failure was a problem with the test or production system. If it doesn't, then test logging is broken.

Test logs should always contain:

  • Test execution environment
  • Initial state
  • Setup steps
  • Test case steps
  • Interactions with the system
  • Expected results
  • Actual results
  • Teardown steps

Conditional Verbosity With Temporary Log Queues

When errors occur, the log should contain a lot of detail. Unfortunately, detail that led to an error is often unavailable once the error is encountered. Also, if you’ve followed advice about not logging too much, your log records prior to the error record may not provide adequate detail. A good way to solve this problem is to create temporary, in-memory log queues. Throughout processing of a transaction, append verbose details about each step to the queue. If the transaction completes successfully, discard the queue and log a summary. If an error is encountered, log the content of the entire queue and the error. This technique is especially useful for test logging of system interactions.

Failures and Flakiness Are Opportunities

When production problems occur, you’ll obviously be focused on finding and correcting the problem, but you should also think about the logs. If you have a hard time determining the cause of an error, it's a great opportunity to improve your logging. Before fixing the problem, fix your logging so that the logs clearly show the cause. If this problem ever happens again, it’ll be much easier to identify.

If you cannot reproduce the problem, or you have a flaky test, enhance the logs so that the problem can be tracked down when it happens again.

Using failures to improve logging should be used throughout the development process. While writing new code, try to refrain from using debuggers and only use the logs. Do the logs describe what is going on? If not, the logging is insufficient.

Might As Well Log Performance Data

Logged timing data can help debug performance issues. For example, it can be very difficult to determine the cause of a timeout in a large system, unless you can trace the time spent on every significant processing step. This can be easily accomplished by logging the start and finish times of calls that can take measurable time:

  • Significant system calls
  • Network requests
  • CPU intensive operations
  • Connected device interactions
  • Transactions

Following the Trail Through Many Threads and Processes

You should create unique identifiers for transactions that involve processing across many threads and/or processes. The initiator of the transaction should create the ID, and it should be passed to every component that performs work for the transaction. This ID should be logged by each component when logging information about the transaction. This makes it much easier to trace a specific transaction when many transactions are being processed concurrently.

Monitoring and Logging Complement Each Other

A production service should have both logging and monitoring. Monitoring provides a real-time statistical summary of the system state. It can alert you if a percentage of certain request types are failing, it is experiencing unusual traffic patterns, performance is degrading, or other anomalies occur. In some cases, this information alone will clue you to the cause of a problem. However, in most cases, a monitoring alert is simply a trigger for you to start an investigation. Monitoring shows the symptoms of problems. Logs provide details and state on individual transactions, so you can fully understand the cause of problems.

你可能感兴趣的:(Google Testing Blog:Optimal Logging)