Finding Performance Bottlenecks in Linux

Finding Performance Bottlenecks in Linux

There isn't a computer professional who, at some point, hasn't wondered whether their system(s) are slow due to legitimate load, or inefficiency. The beauty is there's no real reason to sit and wonder. In the case of Linux (and many other operating systems), all of the information you need is at your fingertips. You just have to know how to find it.

Computing bottlenecks occur in four basic areas: CPU, RAM, network, and disk I/O. Linux offers a huge collection of tools for collecting and viewing information about each. Let's take a look at some useful techniques, and some of the easier solutions to each area if you find problems.

CPU Performance Inspection

Most new computers today come with multiple CPUs, or some approximation thereof. Some tools allow you to view the individual performance of each of these. However, since the goal here is to measure overall performance, this article focuses on working with a single CPU value. See the man pages for each command for whether it offers flags to go further.

One excellent tool for monitoring CPU performance is sar. This program may not be installed by default on your system, look for the sysstat utilities package for your distribution. Typing sar without any arguments gives you something similar to what you'll see in Figure 1.

Figure 1: An example of default sar output.

From left to right, sar gives you the time the measurement was taken, which CPU it's reporting on (or in our case, all as a collective whole), and then the percentage of CPU in use at that time for:

  • %user - User space (non-kernel programs)
  • %nice - Programs whose priority had been altered with the nice or renice commands
  • %system - Kernel space (the kernel itself plus modules)
  • %iowait - Waiting to fulfill a disk I/O request
  • %steal - Forced to wait for the hypervisor to finish servicing another virtual CPU, in the case of virtual machines
  • %idle - Waiting for new instructions

While all of these columns are interesting, the one that quickly lets you determine if you're CPU-bound is %idle. In the case of Figure 1, this CPU (or bank of CPUs) is practically at the beach on vacation. If the numbers were significantly higher, you would need to consider upgrading the CPU, stopping unnecessary processes, or moving some of the services off of this computer and onto another to improve CPU utilization.

RAM Performance Inspection

The nice thing about sar is that you can also use it to look at your memory. When invoked as sar -r, you see something similar to Figure 2.

Figure 2: An example of sar memory output, invoked with sar -r.

From left to right, this output tells us the time the sample was taken, and then:

  • kbmemfree - Unused memory in kb
  • kbmemused - Amount of memory utilized by user space applications in kb
  • %memused - The percentage of your RAM currently in use
  • kbbuffers - Amount of memory in kb that your kernel is using to buffer data
  • kbcached - Amount of memory in kb that the kernel is using to cache data
  • kbswpfree - Unused swap space in kb
  • kbswpused - Used swap space in kb
  • %swpused - The percentage of your swap space currently in use
  • kbswpcad - Amount of cached swap in kb

Again, while all of these columns are useful, two give you a quick picture of whether your problem is with memory: %memused, and %swpused. While Figure 1 showed a CPU that was sunning itself in Aruba, %memused shows that this computer is consistently operating at the edge of its RAM capacity. The %swpused column tells us that on the other hand, the machine isn't being pushed so hard that it's having to move code from RAM into swap space on the hard drive. For the timespan shown in the measurements, then, you aren't experiencing poor performance.

However, don't be alarmed by the fact that this machine looks like it's one step from having to push things into swap. The kernel's memory manager will put the most active applications in physical RAM (in ps's STAT column or top's S column you'll see R for running), and the idle applications into swap (in ps or top these will show as S for sleeping), so just the raw percentages of how much RAM and swap you're using don't show the whole picture. Typing ps aux will let you see how many processes at a particular time are sleeping, and what percentage of memory (and CPU) each is using. Knowing how much RAM, how much swap, and how many processes are sleeping, along with how much RAM these processes are using, will help you better understand if you're having RAM bottlenecks. Factors such as shared memory can also make it look like you're using more RAM than you really are.

The solutions for improving RAM performance are similar to those for CPU: add more RAM, stop unnecessary programs, or move some of your services off onto another machine. It's also possible that you're suffering memory leaks or that something you're running is very RAM-inefficient. These topics bear further discussion in another article.

Disk I/O Performance Inspection

Yet another reason to use sar is that this Swiss army knife of performance information tools can also tell you how your drives are doing. Type sar -dp and you'll see something like what's shown in Figure 3.

Figure 3: The beginning of sar I/O output, invoked with sar -dp.

This combination of flags shows you information per device, as seeing just the summary information (sar -b) doesn't give you any real reference points at a glance. From left to right, this output gives you the time the measurement was taken, as well as:

  • DEV - The physical device in question
  • rd_sec/s - Number of sectors (1 sector = 512 bytes) read per second
  • wr_sec/s - Number of sectors written per second
  • avgrq-sz - Average number of sectors issued to the device
  • avgqu-sz - Average queue length of requests issued to the device
  • await - Average number of milliseconds I/O requests for this device had to wait before being handled, including how long it took to handle them
  • svctm - Average time number of milliseconds I/O requests for this device had to wait before being handled
  • %util - Percentage of CPU time taken up by I/O requests being issued to the device

Notice in this case that the percentage is not the most interesting value here. Avgqu-sz and svctm are the two most useful values for determining if you have an I/O-bound machine. The longer the queue, the more requests are piling up before they're being serviced. The longer they have to wait before being serviced, the slower everything gets.

On an I/O-bound machine, solutions include faster drives (including RAID arrays and other remote storage), organizing your partitions so that I/O-heavy programs aren't all trying to write to the same physical drive, and of course splitting off services onto other machines to spread the load. Very high disk I/O values could in fact mean that you're using a lot of swap.

Network Performance Inspection

While sar (as sar -n ALL) can also show you network performance data, in this case it's a bit of overkill. A quick ifconfig(you may need to include the path) can give you some basic information for a quick visual inspect, as shown in Figure 4.

Figure 4: Network information displayed with /sbin/ifconfig.

The key to understanding this output for performance monitoring purposes is to know that T stands for Transmit and R stands for Receive. If you see values greater than zero for errors, dropped, overruns, and collisions, then you may very well have a network bottleneck problem. The first thing to do is check all of your connections, and equipment such as switches and hubs. Also, check at a few different times and see if the problem is persistent. If it continues, it bears further investigation.

In the case of all four of these issues, this article just skims the surface of both investigation techniques and solutions. In general, you'll want to take these measurements multiple times to see if the problems are persistent or come and go. You might even want to set up cron jobs to take these measurements on an automatic basis.

Further installments will address the larger issues of monitoring performance over time, making tweaks that don't involve having to upgrade hardware, and things developers can do to address performance issues with their own software.

Dee-Ann LeBlanc is a freelance writer, editor, trainer, course developer, and journalist essentially specializing in helping people better understand Linux and open source.

你可能感兴趣的:(Finding Performance Bottlenecks in Linux)