1. Tools Method: A tools-oriented method is as follows:
1) List available performance tools (optionally, install or purchase more).
2) For each tool, list useful metrics it provided.
3) For each metric, list possible rules for interpretation.
The result of this is a prescriptive checklist showing which tool to run, which metrics to read, and how to interpret them. While this can be fairly effective, it relies exclusively on available (or known) tools, which can provide an incomplete view of the system, similar to the streetlight anti-method.
2. The USE Method: The utilization, saturation, and errors (USE) method should be used early in a performance investigation, to identify systemic bottlenecks.It can be summarized this way:
For every resource, check utilization, saturation, and errors.
These terms are defined as follows:
a) Resource: all physical server functional components (CPUS, busses, ...). Some software resources can also be examined, provided the metrics make sense.
b) Utilization: for a set time interval, the percentage of time that the resource was busy servicing work. While busy, the resource may still be able to accept more work; the degree to which it cannot do so is identified by saturation.
c) Saturation: the degree to which the resource has extra work that it can't service, often waiting on a queue.
d) Errors: the count of error events.
In contrast with tool method, the USE method involves iterating with system resources instead of tools. This helps you create a complete list of questions to ask, and only then do you search for tools to answer them. The USE method also directs analysis to a limited number of key metrics, so that all system resources are checked as quickly as possible.
3. The USE method metrics are usually expressed as follows:
a) Utilization: as a percent over a time interval(e.g., "One CPU is running at 90% utilization").
b) Saturation: as a wait-queue length(e.g., "The CPUs have an average queue-length of four").
c) Errors: number of error reported(e.g., "This network interface has had 50 late collisions").
Here are some general suggestions for interrupting the metric types:
a) Utilization: utilization at 100% is usually a sign of bottleneck(check saturation and its effect to confirm). Utilization beyond 60% can be a problem for a couple of reasons: depending on the interval, it can hide short bursts of 100% utilization. Also, some resources such s disks(but not CPUs) usually cannot be interrupted during an operation, even for higher-priority work. As utilization increases, queueing delays become more frequent and noticeable.
b) Saturation: any degree of saturation can be a problem(nonzero). It may be measured as the length of a wait queue, or as time spent waiting on the queue.
c) Errors: Nonzero error counters are worth investigating, especially if they are increasing while performance id poor.