Best practices for the logging REST API
In a large-scale distributed environment, the log data may be the only information
that is available to the developer for debugging issues. Auditing and logging, if done
right, can help tremendously in figuring such production issues and replaying the
sequence of steps that occurred before an issue. The following sections list a few best
practices for logging to understand system behavior and reasoning for performance
and other issues.
Including a detailed consistent pattern across service logs
It is a good practice for a logging pattern to at least include the following:
• Date and current time
• Logging level
• The name of the thread
• The simple logger name
• The detailed message
Obfuscating sensitive data
It is very important to mask or obfuscate sensitive data in production logs to protect
the risk of compromising confidential and critical customer information. Password
obfuscators can be used in the logging filter, which will mask passwords, credit
card numbers, and so on from the logs. Personally identifiable information (PIIis
information that can be used by itself or along with some other information to identify
a person. Examples of PII can be a person's name, e-mail, credit card number, and
so on. Data representing PII should be masked using various techniques such as
substitution, shuffling, encryption, and other techniques.
For more details, checkhttp://en.wikipedia.org/wiki/Data_masking.
Identifying the caller or the initiator as part of the logs
It is a good practice to identify the initiator of the call in the logs. The API may
be called by a variety of clients, for example, mobile, the Web, or other services.
Adding a way to identify the caller may help debug issues in case the problems
are specific to a client.
Do not log payloads by default
Have a configurable option to log payloads so that by default no payload is logged.
This will ensure, for resources dealing with sensitive data, the payloads do not get
logged in the default case.
Identifying meta-information related to the request
Every request should have some details on how long it took to execute the request,
the status of the request, and the size of the request. This will help to identify
latency issues as well as any other performance issues that may come up with
large messages.
Tying the logging system with a monitoring system
Ensure the data from the logs can also be tied to a monitoring system, which can
collect data related to SLA metrics and other statistics in the background.
Case studies of logging frameworks in distributed environments in various platforms
Facebook has developed a homegrown solution called Scribe, which
is a server for aggregating streaming log data. This can handle the
large number of requests per day across servers distributed globally.
The servers send data, which can be processed, diagnosed, indexed,
summarized, or aggregated. Scribe is designed to scale to a very large
number of nodes. It is designed to be robust to survive network and
node failures. There is a scribe server running on every node in the
system. It is configured to aggregate messages and sends them to a
central scribe server in larger groups. If the central scribe server goes
down, messages are written to a file by the local scribe server on the
local disk and sends them when the central server recovers. For more
details, check https://github.com/facebookarchive/scribe.
Dapper is Google's tracing system, which samples data from the
thousands of requests and provides sufficient information to trace data.
Traces are collected in local logfiles and then pulled in Google's BigTable
database. Google has found out that sampling sufficient information
for common cases can help trace the details. For more details, check
http://research.google.com/pubs/pub36356.html.
读书笔记:RESTful Java Patterns and Best Practices