You might be familiar with Linux load averages already. Load averages are the three numbers shown with the uptime
and top
commands - they look like this:
Most people have an inkling of what the load averages mean: the three numbers represent averages over progressively longer periods of time (one, five, and fifteen minute averages), and that lower numbers are better. Higher numbers represent a problem or an overloaded machine. But, what's the the threshold? What constitutes "good" and "bad" load average values? When should you be concerned over a load average value, and when should you scramble to fix it ASAP?
First, a little background on what the load average values mean. We'll start out with the simplest case: a machine with one single-core processor.
A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator ... sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they're in for delays.
So, Bridge Operator, what numbering system are you going to use? How about:
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages