If you’re old enough to remember floppy drives, you’ve heard the symptoms of a disk I/O bottleneck. For example, while Oregon Trail loaded the next scene, you’d hear the drive grinding away, reading data from the disk. The CPU would sit idle during this time, twiddling its fingers waiting for data. If that floppy drive was faster, you’d be running the Columbia River rapids by now.
It’s more difficult to detect an I/O bottleneck if the disk isn’t on your desktop. I’ll look at four important disk I/O questions for web apps:
Disk I/O encompasses the input/output operations on a physical disk. If you’re reading data from a file on a disk, the processor needs to wait for the file to be read (the same goes for writing).
The killer when working with a disk? Access time. This is the time required for a computer to process a data request from the processor and then retrieve the required data from the storage device. Since hard disks are mechanical, you need to wait for the disk to rotate to the required disk sector.
Disk latency is around 13ms, but it depends on the quality and rotational speed of the hard drive. RAM latency is around 83 nanoseconds. How big is the difference? If RAM was an F-18 Hornet with a max speed of 1,190 mph (more than 1.5x the speed of sound), disk access speed is a banana slug with a top speed of 0.007 mph.
This is why caching data in memory is so important for performance – the difference in latency between RAM and a hard drive is enormous*.
Your I/O wait measurement is the canary for an I/O bottleneck. I/O Wait is the percentage of time your processors are waiting on the disk.
For example, lets say it takes 1 second to grab 10,000 rows from MySQL and perform some operations on those rows.
The disk is being accessed while the rows are retreived. During this time the processor is idle. It’s waiting on the disk. In the example above, disk access took 700 ms, so I/O wait is 70%.
You can check your I/O wait percentage via top
, a command available on every flavor of Linux:
If your I/O wait percentage is greater than (1/# of CPU cores) then your CPUs are waiting a significant amount of time for the disk subsystem to catch up.
In the output above, I/O wait is 12.1%. This server has 8 cores (via cat /proc/cpuinfo
). This is very close to (1/8 cores = 0.125). Disk access may be slowing the application down if I/O wait is consistently around this threshold.
For random disk access (a database, mail server, file server, etc), you should focus on how many input/output operations can be performed per-second (IOPS).
Four primary factors impact IOPS:
Even if a banana slug follows all of the tips in The 4 Hour Body, it will never be as fast as an F-18 Hornet. Likewise, you can tune your disk hardware for better performance, but it’s complicated and will not approach the speed of RAM.
If you’re hitting a disk I/O bottleneck now, tuning your hardware likely isn’t the fastest remedy. Hardware changes likely involve significant testing, data migration, and communication between application developers and sys admins.
When we see I/O bottlenecks at the Blue Box Group, we first try to tweak the service that’s using the most I/O and cache more of its data in RAM. For example, we usually configure our database servers to have as much RAM as possible (up to 64 GB or so), and then have MySQL cache as much as possible in memory.
It’s important to measure disk performance on data-heavy servers so you can judge how changes in your application impact disk performance over time. Ad-hoc readings of top
output don’t give much context: is what you’re seeing normal? Is this just a momentary peak? How did I/O Wait look 2 months ago?
Scout has two key plugins for measuring your disk performance.