tglg

RAID on Linux: tuning and troubleshooting, v. 2.65

This work is licensed under a Creative Commons Attribution 2.5 License.

Other texts

Abstract

In order to animate a database server (MySQL and PostgreSQL) in an OLTP (mostly read requests, data occupying approx 800GB) context I bought (November 2006) a 3ware disk controller connected to a SuperMicro mainboard, bi-Xeon bi-core (hyperthreading OFF) 3.2 GHz, model 6, stepping 4, 2 MB L0 cache, 12 GB RAM, running Debian (etch) Linux AMD64 (2.6.18, now 2.6.24.2, no patches).

On the RAID side I compared 3ware integrated functions to the 'md' Linux driver (beware: 'md' is not 'dm' (which is ATAPI RAID)).

This induced major performance-related problems. This document describes this process and offers hints, especially useful to the beginner, when it comes to measuring or enhancing disk performance under Linux.

Interested? Concerned? Comment or suggestion? Drop me a note!

Please let me know which problem you were able to fix, especially if it is not listed here.

Context

The first controller used was a 9550SX-12.

It managed 6 spindles IBM/Hitachi Desktar HDS725050KLA360 (7200 rpm, 16 Mo cache, average seek time: 8.2 ms)

It was replaced by a 9650SE-16ML and we added 6 drives: Western Digital Caviar GP 1 TB WD10EACS (5400 rpm(?), 8.9 ms(?)).

In this document any information not starting with "9650" refers to the 9550SX-12.

System layout

Adopted disk subsystem layout:

A hardware RAID1 for:
- swap partition. Rationale: if a swap drive dies the system may crash, therefore I put it on a RAID1. I don't parallelize I/O (stripe) the swap because it will only be used to store rarely-used pages, not to sustain memory overcoming.
- root filesystem
- journal of the data filesystem
A software (Linux 'md') RAID10 for the data.

The RAID1 is done and I now test the 10 others drives. This setup is currently tested (mainly RAID10 on 3are and on Linux 'md'), if you want me to run a given (random I/O) test/benchmark tool please describe it to me, along with the reason why it is pertinent. Disks available for tests: 500 GB Hitachi (6 spindles), 1 TB Western Digital (4 spindles)

Objective

I'm mainly interested in small (~1k to ~512k, ~26k on average) random reads and marginally (10%, by amount of request and data size) in small random writes, while preserving decent sequential throughput.

Assessment

In vanilla condition the RAID5 array (6 Hitachi) managed by the 3ware 9550 gave ~55 random read/second (IOPS, or IO/s), just like the disk used by my Thinkpad bought in 2003, then ~140 after tweaking. That's ridiculous because everyone credits any modern disk (7200 rpm, don't neglect rotational delay) with at least 80 random reads/second. On a 6-spindle RAID5 there are always 5 drives potentially useful (at least 400 fully random IOPS), therefore I expect at the very least 300 random IOPS (64KB blocks).

I'm disappointed (this is mitigated by the 3ware support service patent will to help) and, for now, can not recommend any 3ware controller. The controller may not be the culprit but if some thingy (mainboard, cable, hard disk...) does not work at best why is this high-end controller unable to warn about it? Why is it able to obtain fairly good sequential performance?

Terminology

A glossary proposed in 3ware's user's guide (revision March 2007, page 270) contains:

Stripe size. The size of the data written to each disk drive in RAID unit levels that support striping.

'man md' (version 2.5.6) uses the term 'stripe' and explains:

The RAID0 driver assigns the first chunk of the array to the first device, the second chunk to the second device, and so on until all drives have been assigned one chunk. This collection of chunks forms a stripe. ((...)) When any block in a RAID4 array is modified the parity block for that stripe (i.e. the block in the parity device at the same device offset as the stripe) is also modified so that the parity block always contains the "parity" for the whole stripe.

... and so on. For 'md' a 'chunk' resides on a single spindle and a 'stripe' is formed by taking a chunk per spindle, while for 3ware the 'stripe size' is the size of what 'md' calls a 'chunk', which is part of what 'md' calls a 'stripe'.

Problems

The default 3ware RAID5 (hardware) parameters on 6 spindles ensured:

fair sequential read throughput (about 200 MB/s),
bad sequential write (50 MB/s)
awful random (max 55 IOPS when randomly reading 64K blocks)

Moreover many intensive I/Os suck the commanding processor into 'iowait' state. This is annoying, not critical.

After tweaking I obtained ~140 random read IOPS (iozone) on 64k blocks and no more iowaits pits. Replacing the RAID5 by a RAID10 (on a 9650, but I bet that a 9550 offers the same) offered a tremendous improvement: ~310 IOPS (randomio mix), 300 MB/s sequential read, ~105 MB sequential write.

I now test 'md' on top of isolated disks ('single', in 3ware terminology: RAID is not done by the controller), obtaining with randomio mix (same parameters as above) a realistic ~450 (randomio, raw device) and ~320 (filesystem level) mixed (20% write) IOPS. With 4 more drives this RAID10 'md' made of 10 heterogeneous disks realistically (all requests served in less than 1/2 second, average ~27 ms) tops at ~730 IOPS (randomio mix, XFS).

Pertinent documents

Here are interesting pieces about similar or related problems:

Why prefer Linux software RAID?
Linux RAID Smackdown: Crush RAID 5 with RAID 10
Random read on 3ware 9550
Too many years of awful 3ware performance(...),
iowait: use a recent kernel (2.6.24 and up), and reduce (dychotomy!) nr_requests (echo 256 > /sys/block/DeviceName/queue/nr_requests) while keeping an eye upon performances. Read Kernel Bug Tracker Bug 7372 (some, albeit often obsolete, interesting stuff on (RedHat) Bugzilla Bug 121434: Extremely high iowait with 3Ware array and moderate disk activity)
3Ware SATA Raid 9550SX Performance. Check the conclusion
3ware + RAID5 + xfs performance (gbakos, 2005)
ZFS, XFS, and EXT4 filesystems compared
the invaluable 'Linux RAID' mailing-list and wiki
Unix (Linux): optimization
The Storage Brick (Harry Mangalam)
CentOS and disk optimization

Discussion

Software RAID (Linux MD) allows spindle hot-swap (with the 3Ware SATA controller in JBOD setup). In addition, Linux MD allows you to say that you KNOW that even though a RAID appears bad, it's okay. That is, you can reinsert the failed drive, and if you KNOW that the data hasn't changed on any drives, you can ask MD to just pretend that nothing happened. (Many thanks to Jan Ingvoldstad for this information!).

The amount of interrupts/context switches may induce CPU load, while bus load is not neglectable because, on a software RAID1, every block written has to be sent to at least two spindles, while the RAID hardware controller needs only a single copy then writes it to as many spindles as necessary.

A 3ware controller is AFAIK optimized for streaming, but I fail to understand how this bias can alleviate random I/O performance. The bottomline, according to some, is that 3ware was sold (in 2006) to AMCC which may be less interested in the Linux market.

RAID5 slows write operations down because it must write the 'parity' block, which is, as far as I know, an XOR of all data blocks contents (or is it a Reed-Solomon code?). In order to maintain it, even a single block write must be preceeded by reading the old block and the existing parity block (this is the "small write problem"). It seems OK to me because my application mainly does read operations, moreover RAID5 eats up a smaller amount of disk space (for 'parity' information) than RAID1 (mirroring), leaving more disk space for data.

RAID5 may not be a clever choice, many experts prefer RAID10 which can simultaneously serve up to a request per spindle and performs better in degraded mode (albeit some wrote that RAID5 already does that). The XFS filesystem is also often judged more adequate than ext3 for huge filesystems. Any information about this, especially about how the 3ware 9550 manages this, will be welcome.

AFAIK the larger the stripe size, up to the average size (or harmonic mean?) of data read per a single request, the better it is for random access because the average number of spindles mobilized during a single read operation will be low, meaning that many spindles will be able to simultaneously seek for data. Allocating a too lenghty stripe size may lower IOPS since uselessly large blocks are then moved around.

Is it that the 3ware controller checks parity on each access, leading to reading at least a block on each disk (a 'stripe' + parity) wherever a single-block read is sufficient? Mobilizing all disks for each read operation leads to an access time equal to the access time of the slowest drive (the 'mechanically slowest' or the one doing the longest seek), which may explain our results. I don't think so as there is no hint about this in the user's manual, moreover the corresponding checking (tw_cli'show verify', 'start verify'... commands) would be only useful for sleeping data, as such faults will be detected on-the-fly during each read, but it can be a bug side-effect. A reply from the kind 3ware customer support service ("if any one of your disks in RAID has slower performance compared to other, it will result in poor performance") seems weird... I then asked if "does any read ... lead to a full-stripe read, even if the amount of data needed is contained in a single disk block? In such a case will switching to RAID10 lead to major gain in random read performance (IOPS)?" and the reply was "the above mentioned statement is common(not only for RAID5)".

At this point it appears that some bug forbids in any/most cases, at least with my (pretty common) setup, reading simultaneously different blocks on different spindles. I asked the support service: "Is the 9550 able, in RAID5 mode, to sort its request queue containing mainly single-disk-block read requests in order to do as many simultaneous accesses to different spindles as possible? If not, can it do that in RAID10 mode?". (Is anyone happy with the random I/O performance of a 3ware 9xxx under FreeBSD or MS-Windows? Which RAID level and hard disks do you use?)

The controller seems able to do much better on a RAID10 array, but my tests did not confirm this.

Writing on a RAID5

Some say that, in order to update (write into an already written stripe) on a RAID5, all blocks composing the stripe must be read in order to calculate the new 'parity'. This is false, the new parity is equal to "(replaced data) XOR (new data) XOR (existing parity)", therefore the logic only has to read the old block and the parity block, calculate then write the new blocks (data and parity).

The existing data block was often already read by an application request, leaving it in a cache. But the common write-postponing strategy (for example when using a BBU: battery-backed unit, which 'commits' into battery-backed memory instead of disk, leaving the real disk-write to a write-back cache) may induce a delay leading to it to vanish from the cache, implying a new disk read.

Moreover I wonder if some efficient approaches are used by software or hardware RAID stacks.

Context

Please read the abstract.

Here are some details, as seen by Linux with the 9550 and 6 Hitachi on 3ware's RAID5:

dmesg
lspci
mtrr
cpuinfo
CLI: show all c0 (9650)
CLI: show all c0u0 (9650)
CLI: show all c0p0
CLI: show all c0p1 (9650)
CLI: show all c0p2
CLI: show all c0p3
CLI: show all c0p4
CLI: show all c0p5

As for the IO queue scheduler ("I/O elevator"), I prefer using CFQ because it theoritically enables the box to do database serving at high priority along with some low-priority batches also grinding the RAID (but ONLY when the database does not use it). Beware: use ionice if there are other processes requesting disk IO, for example by invoking your critical software with 'ionice -c1 -n7 ...', then check by 'ionice -p ((PID))'. Keep a root shell (for example under 'screen') at 'ionice -c1 -n4 ...', just to have a way to control things if a process in the realtime I/O scheduler class goes havoc.

Testing

Monitoring

To appreciate system behavior and measure performances with various parameters I used, by always letting at least one of them run in a dedicated window:

dstat

Easy to grasp

iostat

Part of the 'sysstat' package. Useful to peek at system (kernel and disk) activity.

Use, for example, with '-mx DeviceName 1'. All columns are useful (check '%util'!), read the manpage.

top

Check the 'wa' column (CPU time used to wait for IO)

Approach

Define your first objective: high sequential read throughput? high sequential write? fast random (small) read? fast random (small) write? Do you need a respected absolute maximal service time per request or the highest agregated throughput (at the cost of some requests requiring much more time)? Do you want low CPU load during IO (if your job is IO bound you may want to accellerate IO even if it loads the CPU)?... Devise those with respect to the needs (application), then chose a second objective, a third, and so on...

Devise your standard set of tests, with respect to your objectives. It can be a shell script invoking the test tools, do the housekeeping work: dumping all current disk subsystem parameters, emptying the caches, keep the results and a timestamp in a file... (not on the tested volume!). It may accept an argument disabling, for the current run, some types of tests.

You will run it then tweak just a single parameter then launch it and assess the results, then tweak again, assess, tweak, assess... From time to time emitting an hypothesis and trying to devise a sub test validating it.

Context

Jan Ingvoldstad wrote: In all the test cases I see on the net comparing software RAID (typically Linux MD) with hardware RAID, there seems to be so much attention on running a "clean" system to get the raw performance figures that people forget what they're going to do with all those disks. The system will most likely be doing something _else_ than just filesystem I/O. It will either be an integrated server running applications locally, or it will be serving file systems remotely via CIFS or NFS. This introduces additional performance hits, both in terms of CPU and memory usage, which are not negligible.

We don't want any other use of the disk subsystem during the test, therefore:

test in 'single-user' mode, without any application software running, is recommended. This is impossible if the machine is a remote one
stop all non-system or needed daemons, especially those reading or writing frequently or massively to any filesystem, is necessary. Modify the system startup procedure in order to forbid them from starting upon boot. Don't forget to re-enable them after the tweakings.

The test must reflect your need, in other terms it must simulate the real activity of your application (software mix). Therefore the ideal test is done with real applications and data, but it can be pretty long if you test every combination of the potential values of every parameter, leading us to eliminate most combinations by first using dedicated (empirical) testing software, delivering an interval of useful values (or even a single value) for every system parameter, then selecting a combination by testing at the application level those limited intervals of values.

Linux always uses the RAM not used by programs in order to maintain a disk cache. To shorten the running time of dedicated testing tools we may reduce the buffercache size because we are interested in DISK performance, not in buffercache efficiency (overall system optimization is better done at application level), therefore the system has to have as few RAM as possible. I use the kernel boot parameter "mem=192M" (a reboot is needed. To check memory usage invoke 'free'). "192M" is adequate for me, but the adequate value depends upon many parameters, establish it empirically by checking that, when all filesystems to be tested are mounted and without any test running, your system does not use the swap and that "free" shows that 'free', 'buffers' and 'cached' cumulate at least 20 MB. Beware: fiddling with this can hang your system or put it into a thrashing nightmare, so start high then gradually test lower values. When not enough RAM is available during boot some daemons launched by init may lead to swap-intensive activity (thrashing), to the point of apparent crash. Play it secure by disabling all memory consuming daemons then testing with "mem" set at 90% of your RAM, rebooting and checking, then reducing again and testing, and so on.
Don't shrink accessible memory it to the point of forbidding the system to use some buffercache for metadata (invoke 'ls' two times in a row, if the disk is physically accessed each time there is not enough RAM!). Moreover be aware that such constrained memory may distort benchmarks (this is for example true for the 3ware controller), so only use it to somewhat reduce (reasonably!) available RAM.

The '/sbin/raw' command (Debian package 'util-linux') offers a way to bypass the buffercache (and therefore the read-ahead logic), but is not flexible and can be dangerous. I didn't use it for benchmarking purposes. Using the O_DIRECT flag for open(2) is much more efficient.

Disable the swap ('swapoff -a'), in order to avoid letting Linux use it to store pages allocated but rarely used (it does this in order to obtain memory space for the buffercache, and we don't want it!).

Disabling the controller cache (example with tw_cli, for the controller 0 - unit 1: "/c0/u1 set cache=off") is not realistic becauses its effect, especially for write operations (it groups the blocks), is of paramount importance.

Seting up

use the most recent stable kernel
using the manufacturer specific software tool or SMART ('smartctl', begin with something along with '--device=3ware,0 --all /dev/twa0' line):
- disable all 'acoustic' ("quiet") on the drive. This is especially useful if the drive disconnects from time to time
- enable the controller's write cache (tw_cli: "set cache=on")
- disable the disk look ahead (warning: this is the low-level 'read-ahead', done by the disk itself, not the proper 'read-ahead' which is done by the operating system, at filesystem or block-device level)
- check the protocol level (SATA 1? SATA 2?), set SATA-2 if the disk is labelled
- check the max throughput of the disk-to-controller link (often by default 1.5 Gbps instead of 3.0)
- temporarily disable APM
- disable LBA (drive 'geometry' remapping) in favor of CHS (Cylinder Head Sector) which enables the controller to know the physical disk geometry. This is highly experimental, not alway even possible and if you try it check the available disk space afterwards
upgrade each hard disk firmware and check each parameter. Right now the last reply to me from the 3ware customer support service is: "if any one of your disks in RAID has slower performance compared to other, it will result in poor performance. Kindly run the disk drive diagnostics on the drives using this article.". I have a problem with that, because IMHO the 3ware controller must detect (thru performance analysis/SMART/...) and report such problems and mishaps, even thanks to a dedicated test procedure which has to be used remotely as (many 3ware-equiped machines are hosted servers)
search in the disk manufacturer's documents (knowledge base/FAQ/Troubleshooting guide...)
check SETUP ('BIOS') parameters. Especially:
- enable interleaved memory ("memory bank interleaving", in fact). It may only be possible with a sufficient amount of installed RAM sticks. This may boost write performance
- disable 'spread spectrum'. AFAIK it lowers electromagnetic interferences ('EMI') by varying frequencies. This can interfer with rapidly clocked signals. Some mainboard/controllers/disks seem ready for it, others may not be. Test! In case of doubt: disable. Beware: a 3ware controller always honours it when the drive requests it, so try disabling it at the drive level
- disable CPU frequency scaling (seems sometimes preferable but will raise temperature, accelerating hardware wear. Test!)
'cat /proc/mtrr' It is OK for anything to be non-cachable?
update 3ware's 'tw_cli' and the controller's firmware (3ware software downloadable from their website, see the 3ware website, "Service and Support/Software download")
using 'tw_cli', tweak various parameters (beware! read the manual, which is a PDF published on the 3Ware website): 'Storsave policy' to 'performance', then 'qpolicy' (NCQ on drive) to 'ON' or 'OFF' ("/c0/u0 set qpolicy=on"), then ensure (cat /proc/sys/fs/aio-nr) that your applications and system (some systems only emulate it) do as much asynchronous I/O (aio) as possible. Don't forget that most RDBMS optimizations are only possible with multiple concurrent queries and not available for a single one, although some are working on it. Moreover enabling NCQ may, on some systems and disks, disable any look-ahead implicitly done by the disk or the controller, which may benefit to random read performance but will probably hurt performance on sequential read.
(TODO) are the drives and controller doing 'multi-initiator' transfers? read dox then ask Hitachi / Western Digital / 3Ware

A potential culprit, at least for slow write operations, lies in Q12546. I played with setpci in order to enable this ''Memory Write and Invalidate' (lspci now shows that the 3Ware controller 9550 is in 'MemWINV+' instead of 'MemWINV-' mode), maybe enhancing write throughput. The 9650 is in 'MemWINV-' mode. This seems somewhat frequent with SuperMicro mainboards, check this with your system integrator or SuperMicro support service. This may be somewhat tied to the "interleaved memory". With the 9650 my PCI parameters are as follows:

03:00.0 RAID bus controller: 3ware Inc 9650SE SATA-II RAID (rev 01)
        Subsystem: 3ware Inc 9650SE SATA-II RAID
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- <128ns, L1 <2us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal+ Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s L1, Latency L0 <512ns, L1 <64us
                        ClockPM- Suprise- LLActRep+ BwNot-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
        Capabilities: [100] Advanced Error Reporting 
        Kernel driver in use: 3w-9xxx
        Kernel modules: 3w-9xxx

Avoid IRQ sharing, the 3Ware card must have its own IRQ line. Check it with 'cat /proc/interrupts'. Dont let the card share a PCI I/O controller with any other heavy-I/O card. Insert it in other slots.
(test) enable/disable IRQ balancing: envvar 'CONFIG_IRQBALANCE' and 'irqbalance' daemon (or use IRQBALANCE_BANNED_INTERRUPTS, check 'irqbalance' manpage)
I did not tweak the 'SWIOTLB' bounce-buffer because in case of slow random I/O it seems irrelevant. Moreover the problems were patent even with only 196 MB RAM taken into account by the kernel (albeit the fact that the RAM circuits were in place may somewhat let this SWIOTLB thing act, but there was no sign of this in kernel's log when booting with "mem=196M"). Read /usr/src/linux/Documentation/x86_64/boot-options.txt. Try iommu=memaper=3 ; swiotlb=65536 (not tested by me)
Potential SMP and/or ACPI-related problem: test various kernel parameters combinations: "maxcpus=1" (no SMP), "noapic", "noacpi", "acpi=off", "pci=noacpi", "acpi=noirq"

Miquel van Smoorenburg's well-informed hints:

You need to make sure that the nr_requests (kernel request queue)
is at least twice queue_depth (hardware requests queue).
Also the deadline or cfq i/o schedulers work a bit better for
database-like workloads.

Try something like this, replacing sda with the device name
of your 3ware controller.

	# Limit queue depth somewhat
	echo 128 > /sys/block/sda/device/queue_depth
	# Increase nr_requests
	echo 256 > /sys/block/sda/queue/nr_requests
	# Don't use as for database-like loads
	echo deadline > /sys/block/sda/queue/scheduler
CFQ seems to like larger nr_requests, so if you use CFQ, try 254
(maximum hardware size) for queue_depth and 512 or 1024 for nr_requests.

Oh, remember, if you have just created a RAID array on
the disks, wait with testing until the whole array has been
rebuild..

He also wrote:

There are a couple of things you should do:

1. Use the CFQ I/O scheduler, and increase nr-requests:
echo cfq > /sys/block/hda/queue/scheduler
echo 1024 > /sys/block/hda/queue/nr_requests

2. Make sure that your filesystem knows about the stripe size
and number of disks in the array. E.g. for a raid5 array
with a stripe size of 64K and 6 disks (effectively 5,
because in every stripe-set there is on disk doing parity):

# ext3 fs, 5 disks, 64K stripe, units in 4K blocks
mkfs -text3 -E stride=$((64/4))

# xfs, 5 disks, 64K stripe, units in 512 bytes
mkfs -txfs -d sunit=$((64*2)) -d swidth=$((5*64*2))

3. Don't use partitions. Partions do not start on a multiple of
the (stripe_size * nr_disks), so your I/O will be misaligned
and the settings in (2) will have no or an adverse effect.
If you must use partitions, either build them manually
with sfdisk so that partitions do start on that multiple,
or use LVM. ((editor's note: beware!))

4. Reconsider your stripe size for streaming large files.
If you have say 4 disks, and a 64K
stripe size, then a read of a block of 256K will busy all 4 disks.
Many simultaneous threads reading blocks of 256K will result in
trashings disks as they all want to read from all 4 disks .. so
in that case, using a stripesize of 256K will make things better.
One read of 256K (in the ideal, aligned case) will just keep one
disk busy. 4 reads can happen in parallel without trashing. Esp.
in this case, you need the alignment I talked about in (3).

5. Defragment the files.
If the files are written sequentially, they will not be fragmented.
But if they were stored by writing to thousands of them appending
a few K at a time in round-robin fashion, you need to defragment..
in the case of XFS, run xfs_fsr every so often.
-=-=-=

the internal queue size of some 3ware controllers (queue_depth) is larger
than the I/O schedulers nr_requests so that the I/O scheduler doesn't get
much chance to properly order and merge the requests.

I've sent patches to 3ware a couple of times to make queue_depth
writable so that you can tune that as well, but they were refused
for no good reason AFAICS. Very unfortunate - if you have 8 JBOD
disks attached, you want to set queue_depth for each of them to
(max_controller_queue_depth / 8) to prevent one disk from starving
the other ones, but oh well.

Read the Linux Hardware RAID Howto

(matthias platzer) very low nr_requests (like 32 or 64), That will limit the iowait times for the cpus ((ed. note: this may be related to NCQ, which is often limited to 64 simultaneous requests, albeit our test did not confirm this hypothesis))
(matthias platzer) try _not_ putting the system disks on the 3ware card, because additionally the 3ware driver/card gives writes priority. People suggested the unresponsive system behaviour is because the cpu hanging in iowait for writing and then reading the system binaries won't happen till the writes are done, so the binaries should be on another io path ((editor's note: if this proves true one may, for the same reason, try to place the swap on a specific disk/array))
avoid LVM on 3ware (Red Hat LVM list), albeit there may be ways to alleviate the burden.
XFS: Dave Pratt kindly wrote to me about a way to maintain stripe alignment on XFS. He thinks that:

The 'su' parameter in XFS is equivalent to the "stripe" setting in 3ware's BIOS and tw_cli. Such that a "stripe" setting in the 3ware setup of "64KB" is matched with "-d su=64k" in the mkfs.xfs command line. The 'sw' parameter is then determined based on the number of data disks within the RAID group multiplied by the stripe size. Such that in a RAID5 with a 64KB stripe 'sw' would be set like (n-1)*64 and with RAID6 it would be (n-2)*64 or with RAID10 ((n/2)*64).
The man page for mkfs.xfs is a bit confusing as to how these setting work exactly and the 'xfs_info' output is further confusing as it uses a different block size for displaying its units.
fs mounted with 'noatime' option. Maybe also "commit=X" (with X pretty high, for example 50000). See 'man mount'
if read throughput is poor during mixed operations (many writes!) try using deadline and its parameters: /sys/block/sd?/queue/iosched/read_expire, /sys/block/sd?/queue/iosched/write_expire and /sys/block/sd?/queue/iosched/writes_starved . Read Documentation/block/deadline-iosched.txt about all this.
try soft raid (Linux: 'md') on 'Single Disk', see for example this testimony (fs-oriented)
the filesystem journal setup can have a major impact (no journal, journal on main disk array, journal on a dedicated array...). Beware: some setups may add to fsck duration.
check performance on raw RAID (w/o any LVM, partition or filesystem)
use blktrace (along with a kernel in debug mode)
check that your mainboard does not use any 'PCI riser' adapter
borrow new cables (from controller to disks)
use another power supply and outlet (some disks behave strangely when badly feeded)
write performance: use a BBU (this enhances write operations, especially sync'ed ones, when ones does not want to risk losing data considered commited by the kernel)
write performance: use a separate spindle dedicated to the intensive and sequential operations (especially write), for example the fs journal or the MySQL log
describe your problem to all support service (3ware and disk manufacturer)

Exploration

We now have a test procedure, enabling us to efficiently assess the effect of any single tweak. Here are some ways to modify the system setup. Tweak one parameter at a time, then test. The next section proposes a few testing tools.

Most hardware-oriented hints published below are geared towards the 3ware (and probably any similar hardware)?

Let's proceed with the tweaking session.

Adequate procedure and setup

To avoid confusion create an explicit name for each device then use only this long name. Example:

# cd /dev
# ln -s md1 md_RAID10_6Hitachi

Caution: some tests write on the disk, therefore any device storing precious data must be preserved. Do a backup, verify it then (it is not on a volume that must be writable for the system to run) protect the device using 'blockdev --setro /dev/DeviceName', then mount it in read-only mode: 'mount -o ro ...'.

Periodically

Right at the beginning, then periodically, you have to ensure that no other software uses the disk subsystem. At the end of a test temporarily cease testing then check activity using 'top' and 'iostat DeviceName 1' (the 'tps' column must contain almost only '0') or 'dstat' (the '-dsk/total-' column must reveal that there is nearly no I/O). In a word: monitor

Beware: those tools alone are not adequate for hardware RAID (3ware) because they only show what Linux knows, and an intelligent controller act behind the curtains. Therefore, at least when using hardware RAID, check the arrays status (output of "./tw_cli show", the column 'NotOpt' must contain 0, if it contains anything else gather intelligence using 'show all') in order to be quite sure that there is no housekeeping ('init', 'rebuild', 'verify'...) running (the INIT phase of a RAID5 made of six 500 GB spindles (2.3 TB available), doing nothing else, lasts ~7 hours).

If you use the 'md' software RAID check that all spindles are OK and that there is no rebuilding in action, for example with 'mdadm -D /dev/md0'. The best way do do it is to verify that all the drives blink when used, then to stay in front of the machine during all the testing session and stop testing from time to time, but this is pretty inconvenient.

Performance evaluation ('testing') with dedicated tools

This is a delicate field, especially because most testing tools are not adequate (some don't even offer, at least as an option, to check that what they write was effectively written).

Disk subsystem saturation on random I/O

Do not test for random I/O with strictly classic fs-oriented tools, for example 'cp' or 'tar', often are unable to reflect what a dedicated software (for example a database server) will obtain. Some standard tools, for example, won't write as they read (using async io / threads).

No data is read of written during a disk head move, therefore head movement reduces throughput. Serving immediately each request often leads to frequent head back-and-forth movements. An adequate disk subsystem therefore reduces this by retaining for a short period all pending requests in order to sort them, grouping requests concerning the same disk physical zone. In an ideal world requests will be retained for a very long time then the heads will do a single sweep through the platters surface, collecting data at full speed. Retaining requests while the disk is used raises the global throughput, however retaining them for too long add service latency (the user is waiting!), and not enough retention leads to a disk often spinning without reading or writing data because its heads are in movement (this is the worst case: a few requests will be very quickly served while most will wait for quite a long time (head movements), moreover the global throughput will be bad). An adequate disk subsystem balances all this with respect to the current disk activity and nature of the requests (urgent ones are served as fast as possible, most are served reasonably fast and grouped, while some are only done if there is nothing else waiting).

Note that, for all this to work as efficiently as possible, some pieces of software must know the real disk geometry along with head location. The operating system elevator often can't access to those informations.

Some applications are not able to saturate a given disk subsystem, for example because all needed data is always in some cache, because their architecture limits their requests throughput (they send a read request, wait for it to complete, send another one, wait... and so on). Most recent code (for example database-managing code), however, serve multiple concurrent requests thanks to multiprocessing (for example through forking a subprocess for each new connected client) or multithreading (which, for our present concerns, behaves like multi-processing) or aio (asynchronous I/O). They put the disk subsystem on intense pressure.

When tuning for such applications we need to simulate this very type of system load. Read "Random I/O: saturating the disk subsystem" in the next session.

Sample data size

Beware: define an adequate value for the amount of data used during a test ('sample set'), in order to avoid the caches, by adding up the RAM available to the kernel ("mem" parameter, this is an absolute max because the kernel will never be able to use all this as buffercache), to the sizes of the controller cache (another absolute max because it is almost never only used as a data cache as it often contains, among other things, some firmware microcode copy) and to the sizes of all disk-integrated caches (another absolute max because it also contains some microcode). This sum is a minimal sample set size, multiply it by 2 to keep a margin. The value reflecting real whole system performance (including buffercache efficiency) is established by multiplying this max amount of total potential cache by the ratio (total amount of application data DIVIDED BY total RAM, without the "mem" kernel parameter).

Example

System RAM: 12 GB (this is expressed in base 2 but diskspace measures are often expressed in base 10 (giga/gibi) let's max it up and take 12900 MB

My application manages approx 800 GB (800000 MB)
The ratio is (800000.0 / 12900): approx 62

During tests: total core memory limited to 96 MB, controller cache 224 MB, 6 disk with 16 MB each: 516 MB total. One may also disable as much caches as possible:

the on-disk cache (the one used by the hard drive itself, made of RAM inside the disk unit), thanks to some low-level software tool delivered by the disk manufacturer. Beware that the controller may re-enable it
the controller cache thanks to a 3Ware software tool (for example './tw_cli'), which may enable the on-disk cache
the Linux buffercache can not be avoided, it must be physically limited. However one can flush it between tests using 'sync ; echo 3 > /proc/sys/vm/drop_caches' (see 'man proc')

Theoritically I must use a minimal sample size of 1032 MB (* 516 2) and a 'real-life' size of approx 32GB (* 516 62).

Find the proper size for any testing programs to be used. The 'iozone' test software, for example, offers a '-s' parameter. For a test to deliver a sound result it must run at the very least least for a wallclock minute, in order to alleviate various small perturbations (mainly from the system and internal disk housekeeping). Use various values until they stabilize accross multiple runs:

#RAM available to Linux reduced thanks to the "mem" kernel parameter
$ free
             total       used       free     shared    buffers     cached
Mem:         92420      62812      29608          0       1524      16980
-/+ buffers/cache:      44308      48112
Swap:            0          0          0

Note: this particular test does not need multi-processing or multi-threading as a single thread simulates adequately the load and saturates the disk subsystem

#Using a 4 GB testfile:
$ time iozone -f iozone.tmp -w -i0 -i2 -p -e -o -r64k -m -O -S 2048 -s 4g
 ((...))
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
         4194304      64     738    1231                       114     145

real    19m24.447s
user    0m5.876s
sys     0m42.395s

#This is realistic and repeatedly running it leads to very similar
#results. This will be our reference values. Let's see if we can reduce
#test duration while preserving results quality


#Using a 10 MB testfile:
$ iozone -f iozone.tmp -w -i0 -i2 -p -e -o -r64k -m -O -S 2048 -s 10m
 ((...))
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
           10240      64     788     775                     13790     493

#Comment: WAY too different from the reference, this server just cannot
#(physically) give 13000 random read/second.  The write performance is
#probably boosted by the controller disk cache ('-o' can't disable it), the
#read performance by the buffercache


#Using a 400 MB testfile:
$ time iozone -f iozone.tmp -w -i0 -i2 -p -e -o -r64k -m -O -S 2048 -s 400m
 ((...))
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
          409600      64     751    1300                       138     173                                                          

real    1m37.455s
user    0m0.652s
sys     0m4.392s

#The test did not run long enough

#Using a 800 MB testfile:
$ time iozone -f iozone.tmp -w -i0 -i2 -p -e -o -r64k -m -O -S 2048 -s 800m
 ((...))
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
          819200      64     704    1252                       116     152                                                          

real    3m42.477s
user    0m1.220s
sys     0m9.393s

#Seems good.

#Another run produced:
          819200      64     729    1293                       114     155                                                          
real    3m42.234s
user    0m1.180s
sys     0m9.397s

#Another one:
          819200      64     711    1283                       113     153                                                          
real    3m44.281s
user    0m1.120s
sys     0m9.629s

#800 MB is OK

Here are some dedicated tools:

iozone

Using iozone version 3_283 I invoke "time iozone -f iozone.tmp -w TEST -p -e -o -r64k -m -T -s 4g -O -S 2048", replacing 'TEST' with '-i0' for a first run (creating the testfile) then using the adequate set of tests (I ran '-i2' because I'm interested in random I/O). Replace '-r64k' with the average size of an I/O operation made by your main application, and '-s 4g' with at least 4 times the amount of RAM seen by Linux: use 'free' to display it, it is the 'Mem total' number, in the first cell (upper left corner) of the table and '-S 2048' by the amount of KB RAM in your processor's cache. Check the other iozone parameters by invoking 'iozone -h'.

Beware: on a RAID always use the '-t' parameter, in order to saturate the disk subsystem. Moreover iozone is picky about its arguments format. Check that he understood your command line arguments by reading its output header (summary of the operation mode).

One may try using iozone's '-I' option in order to forbid using the OS buffercache but I'm not sure it works as intended because it seems to forbid some parallelization.

In order to preserve the parameters values I frontalized it with a stupid shellscript 'test_io' which gathers the parameters then runs the test, enabling me to redirect its output to a file which will store the results. It was later automated with a lame Perl script automating testing of all combinations of a given set of parameters values: 'test_io_go.pl'. Please do only use them if you understand what they do and how. Note that they only run under an account authorized to 'sudo' anything without typing a password.

Here are the best results of a 3ware RAID5 (6 Hitachi), slightly edited for clarity: one, two, three. Here is one of the less convincing.

(TODO) (read-ahead parameter: 'test_io' queries it through /sbin/blockdev, but 'test_io_go.pl' sets it by writing into /sys/block/sda/queue/read_ahead_kb. They use different units! Adapt 'test_io')

(TODO) As MySQL may do aio (async I/O) and pwrite/pread let's try the corresponding iozone options:

           -H #  Use POSIX async I/O with # async operations
           -k #  Use POSIX async I/O (no bcopy) with # async operations

... but which one? I droped the idea as while using them Linux did not see any aio (cat /proc/sys/fs/aio-nr)! Moreover it reduced the IOPS per approx 25%.

randomio

It is a very fine (albeit somewhat discreete and as nearly poorly presented as this document) tool

Seeker

Evaluates average access time to the raw device (only reading tiny amounts of data to ensure that the disk heads moved), without any request parallelization (therefore only useful on a single device).

Published along with the How fast is your disk? LinuxInsight article.

I had to modify it, mainly to enable testing of huge devices and in order to enhance random coverage. Here are the diff and the source.

Compile then run:

$ gcc -O6 -Wall seekerNat.c -o seekerNat
$ sudo ./seekerNat /dev/sda

Results: best: 14.7ms, average 20, up to 25. The drives data sheet states that its average acces time is 8.2 ms, therefore there is a quirk.

fio

(TODO)

Sysbench

(TODO)

XDD

(TODO)

tiobench

(TODO)

fsbench

(TODO)

postmark

(TODO)

OSDLDBT

(TODO)

SuperSmack

(TODO)

During a test session don't forget the periodical checks.

3ware and Linux 'md': a benchmark

I pitted the 3ware 9650 RAID logic against 'md' (through the same controller, in 'single' mode). All this section's tests were by default done with the 3ware 9650, without any LVM, partition or filesystem.

Random I/O: saturating the disk subsystem

Let's find the adequate amount of simultaneous threads.

Reduce the amount of RAM available to Linux ("mem" kernel parameter):

# free
             total       used       free     shared    buffers     cached
Mem:        254012     149756     104256          0       1652      86052
-/+ buffers/cache:      62052     191960
Swap:      7815580        120    7815460

Use a representative file (create it with dd):

# ls -lh testfile
-rw-rw---- 1 root root 278G 2008-02-27 12:24 testfile

# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              2.3T  279G  2.1T  12% /mnt/md0

We will use the 'randomio' software tool, varying the amount of threads and the volume of data for each I/O request ("I/O size"). You may also check, while the test runs (in another terminal), a continuously running 'iostat -mx 1', especially the last ('%util') column: when most disks are saturated (100%) there is no more room for parallelization. Example: "for t in 1 16 64 128 256 512 1024 2048 ; do sync ; echo 3 > /proc/sys/vm/drop_caches ; echo $t; randomio testfile $t 0.2 0.1 $((65536*10)) 60 1 ; done"

The template used here is "randomio testfile ((Threads)) 0.2 0.1 ((I/O Size)) 60 1" (20% write, 10% write sync'ed, 60 seconds/run, 1 run). Check the colums "total iops" along with "latency avg", "latency max" and "sdev" (std deviation): the disk subsystem, as intended, maintains throughput by trading latency (instead of trying hard to keep latency low: this is physically impossible). Operational research to the rescue :-)

'md10fs' is a 'md' RAID-10, 64K stripe, built on 10 spindles, on filesystem level: mkfs.xfs -l version=2 -i attr=2; mounted noatime,logbufs=8,logbsize=256; CFQ
'3w10' results are from a 9650 with the very same 10 spindles, in RAID10 mode, raw level.
'3w10fs': '3w10' but at filesystem level. mkfs.xfs -l version=2 -i attr=2 -d sunit=$((64*2)) -d swidth=$((10*64*2)), mounted noatime,logbufs=8,logbsize=256; CFQ
'3w5fs': RAID5 (initialized) at filesystem level. mkfs.xfs -l version=2 -i attr=2 -d sunit=$((64*2)) -d swidth=$((9*64*2)) /dev/sdb, CFQ

Beware: 'latency' is to be understood as 'during the test'. More explicitly: at the end of each test ALL requests have been satisfied, therefore the winner, IMHO, has the highest IOPS (the lowest average latency is just a bonus, and a low sdev is a bonus of a bonus).

         T
         h
         r
         e
         a
         d      I/O    total |  read:         latency (ms)       |  write:        latency (ms)
         s      size    iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
-----------------------------+-----------------------------------+----------------------------------
md10fs   1      512    112.3 |  89.1    1.8   11.2  270.9    7.5 |   23.2   0.1    0.2   17.7    0.5
3w10     1      512     93.4 |  74.1    3.5   13.4  517.6   10.1 |   19.3   0.1    0.1    1.3    0.1
3w10fs   1      512    131.0 | 104.4    0.2    9.5  498.0   10.6 |   26.6   0.1    0.2   38.6    1.0
3w5fs    1      512    133.1 | 106.1    0.2    9.4  102.7    5.3 |   27.0   0.1    0.2    7.4    0.4

#3ware dominates. Best throughput, lowest latency. Sadly, this test hardly
#simulates server-class load

md10fs   8      512    539.1 |  431.3   0.2   18.4  505.5   13.4 |  107.8   0.1    0.5   81.0    2.8
3w10fs   8      512    635.0 |  508.1   0.3   14.8  185.7    8.4 |  126.9   0.1    4.0  178.8    8.5

md10fs   16     512    792.3 |  634.4   0.2   24.8  384.2   20.5 |  157.9   0.1    1.6  131.1    7.6
3w10     16     512    670.4 |  536.3   2.7   27.8  380.6   22.1 |  134.0   0.1    8.3  412.4   29.3
3w10fs   16     512    811.0 |  649.5   1.7   22.3  126.1   13.8 |  161.5   0.1    9.2  101.0   14.2
3w5fs    16     512    525.9 |  421.0   0.2   33.1  343.6   27.3 |  104.9   0.1   19.5  328.8   30.6

#Light load. Max throughput is far ahead

md10fs   24     512    846.8 |  678.5   1.8   34.3  656.8   40.5 |  168.4   0.1    4.2  407.6   16.5
3w10fs   24     512    847.4 |  678.9   2.1   31.6  523.3   21.4 |  168.4   0.1   15.0  526.3   22.2

#Up to this point the disk subsystem is at ease

md10fs   32     512    877.6 |  702.7   0.2   43.9 1218.7   60.9 |  174.8   0.1    6.6 1144.6   34.2
3w10fs   32     512    872.7 |  698.8   0.3   40.7  180.0   23.8 |  173.8   0.1   20.5  149.0   23.4

#md: some rare requests need 1.2 seconds (average and std deviation remain low)

md10fs   48     512    928.5 |  743.4   0.2   61.5 1175.0   86.3 |  185.1   0.1   12.0  680.9   40.8
3w10fs   48     512    897.4 |  718.7   2.1   58.8  212.7   31.2 |  178.7   0.1   32.0  186.3   30.3

md10fs   64     512    968.2 |  774.8   0.9   78.0 1247.2  104.5 |  193.3   0.1   18.0  857.9   61.1
3w10     64     512    825.3 |  661.1   3.0   95.5 1798.8  131.0 |  164.2   0.1    4.4  711.4   45.1
3w10fs   64     512    913.7 |  731.6   3.2   77.0  269.0   36.9 |  182.0   0.1   41.7  264.1   35.6
3w5fs    64     512    615.6 |  492.7   1.9  112.8  426.6   58.4 |  122.9   0.1   68.0  473.6   57.6

md10fs   128    512   1039.1 |  830.7   2.0  142.4 1584.5  181.5 |  208.4   0.1   44.5 1214.8  118.8
3w10     128    512    891.0 |  715.3   2.9  166.1 2367.9  216.5 |  175.7   0.1   42.8 1693.1  210.1
3w10fs   128    512    968.6 |  774.6   3.7  147.7  410.0   60.4 |  194.0   0.1   68.6  374.0   57.6
3w5fs    128    512    653.5 |  522.5   3.5  217.7  644.3   95.5 |  131.0   0.1  106.6  641.1   91.3

#3ware superbly manages latency, but the winner here is
#'md10fs' because it serves more requests in a given timeframe

md10fs   164    512   1032.2 |  824.6   0.2  173.4 1587.0  171.1 |  207.6   0.1   99.6 1195.1  167.7
md10fs   180    512   1079.1 |  862.9   1.9  185.7 1476.2  184.8 |  216.2   0.1   89.7 1073.8  147.0
md10fs   196    512   1097.8 |  877.8   2.5  200.3 1923.6  202.1 |  220.0   0.1   90.3 1115.6  151.9
md10fs   212    512   1105.9 |  884.4   0.2  220.2 1749.0  228.9 |  221.5   0.1   71.7 1273.0  145.9
md10fs   228    512   1111.1 |  888.1   2.3  236.2 2062.3  243.8 |  223.0   0.1   77.3 1098.2  147.2
md10fs   244    512   1044.5 |  834.9   2.3  257.1 1960.2  233.2 |  209.6   0.1  136.4 1241.2  202.9

#Latency get worse as throughput culminates

md10fs   256    512   1100.9 |  880.4   3.0  259.0 1933.3  235.0 |  220.5   0.1  123.6 1254.5  188.0
3w10     256    512    908.1 |  726.6   3.7  316.8 2838.8  253.3 |  181.5   0.1  138.1  982.8  261.4
3w10fs   256    512    961.5 |  768.7   2.8  308.4  607.6   95.8 |  192.9   0.1   92.9  614.1  102.0
3w5fs    256    512    648.0 |  518.2   0.3  455.3  861.2  146.2 |  129.8   0.1  148.0  975.1  157.0

#3ware limits max-latency (maximum waiting time), lowering cumulated
#output. This is one way to cope with the usual trade-off: accepting more
#incomming requests and serve "on average" faster (with some requests
#waiting) or limiting both the amount of requests AND the worst experienced
#latency for each served request?

md10fs   384    512   1089.9 |  870.4   5.0  394.0 2642.9  267.5 |  219.5   0.3  181.4 1916.2  212.9
3w10fs   384    512    960.3 |  767.4   3.0  470.1  850.4  132.5 |  193.0   0.1  110.4  877.0  148.9
3w5fs    384    512    657.9 |  525.6   3.3  682.2 1271.0  201.2 |  132.4   0.1  176.5 1318.0  222.5

md10fs   512    512   1110.2 |  887.5   4.4  522.7 2701.5  300.0 |  222.7   0.4  202.3 1619.7  211.7
3w10     512    512    912.0 |  729.7   2.8  655.8 3388.7  400.4 |  182.2   0.1  164.4  976.8  283.8
3w10fs   512    512    975.9 |  779.2   3.1  620.6  932.0  168.6 |  196.7   0.1  129.1  996.7  188.6
3w5fs    512    512    656.7 |  524.2   4.6  917.2 1427.7  239.7 |  132.5   0.1  198.7 1437.6  281.9


#With 512 threads at work a request is, on average, served in approx
#1/2 second (with 1/3 s std deviation) and the slowest needs 2 seconds,
#this is unbearable in most OLTP contexts, and getting worse:

md10fs   1K     512   1075.2 |  857.7   3.6 1101.7 3187.2  413.2 |  217.5   0.2  298.2 2414.8  366.0
3w10     1K     512    905.7 |  722.5   2.9 1359.9 4097.0  870.3 |  183.2   0.1  160.1 1047.0  268.2
3w10fs   1K     512    999.9 |  796.6   4.1 1221.9 1694.8  318.1 |  203.2   0.1  186.4 1835.9  356.6
3w5fs    1K     512    662.1 |  527.0   3.3 1828.5 2544.8  468.8 |  135.1   0.1  307.7 2632.8  562.2

md10fs   2K     512   1074.1 |  853.4   4.8 2225.7 6274.2  689.4 |  220.7   0.3  470.2 4030.0  729.1
3w10fs   2K     512    978.0 |  776.3   2.7 2495.7 3207.4  633.2 |  201.7   0.1  313.3 3353.5  736.9
3w5fs    2K     512    667.1 |  527.6   3.9 3633.7 4533.4  912.0 |  139.5   0.1  496.4 4961.4 1113.5

#32K per data block

md10fs   1      32K    107.8 |   85.5   0.3   11.5   54.5    5.2 |   22.3   0.2    0.6  412.8   11.3
3w10     1      32K     90.5 |   71.7   2.6   13.9  413.5    7.8 |   18.8   0.1    0.1    1.1    0.1
3w10fs   1      32K    116.4 |   92.5   0.3   10.7  399.8    8.2 |   23.8   0.1    0.5  433.5   11.5
3w5fs    1      32K    115.4 |   91.7   0.3   10.9  381.2    8.0 |   23.7   0.1    0.2    4.9    0.4

md10fs   16     32K    735.1 |  588.1   0.3   26.7  459.4   22.0 |  147.0   0.2    2.0  160.6    9.0
3w10     16     32K    638.0 |  510.5   0.3   29.4  499.0   23.3 |  127.5   0.1    7.7  225.0   27.2
3w10fs   16     32K    613.3 |  490.8   3.1   29.1  260.8   19.1 |  122.4   0.1   13.8  259.9   22.5
3w5fs    16     32K    403.3 |  322.9   0.4   43.4  310.5   33.9 |   80.4   0.1   24.4  263.1   37.9

md10fs   64     32K    912.5 |  730.5   0.3   83.8 1448.7  117.9 |  181.9   0.2   14.8  828.1   55.9
3w10     64     32K    798.6 |  639.3   2.9   99.9 2132.5  131.6 |  159.3   0.1    0.1    8.9    0.2
3w10fs   64     32K    692.1 |  553.5   4.0  101.9  343.7   48.2 |  138.6   0.1   54.1  327.1   47.3
3w5fs    64     32K    483.5 |  387.0   2.0  144.3  473.7   70.3 |   96.5   0.1   83.8  428.9   70.7

md10fs   128    32K    994.5 |  795.3   0.4  150.6 1977.7  182.0 |  199.3   0.2   39.6  814.3   97.7
3w10     128    32K    849.1 |  680.0   4.4  173.5 2621.9  222.6 |  169.1   0.1   55.8 1655.2  243.4
3w10fs   128    32K    712.6 |  569.8   3.8  200.6  737.6   82.3 |  142.8   0.1   94.9  805.1   84.8
3w5fs    128    32K    504.8 |  404.1   3.7  283.4  698.9  114.2 |  100.7   0.1  131.9  755.6  115.9

md10fs   256    32K   1024.5 |  819.0   3.3  286.8 3253.0  290.4 |  205.5   0.2   98.6 1673.0  186.4
3w10     256    32K    855.6 |  686.3   4.1  336.0 2986.6  266.9 |  169.2   0.1  140.3 1129.2  285.5
3w10fs   256    32K    718.3 |  573.7   0.5  411.3  896.8  128.2 |  144.5   0.1  132.1  932.4  142.8
3w5fs    256    32K    488.6 |  391.0   4.4  601.9 1237.3  183.4 |   97.6   0.2  200.1 1355.1  216.9

md10fs   512    32K   1049.9 |  838.9   6.1  561.7 2463.6  352.4 |  211.0   0.3  169.6 2064.5  209.1
3w10     512    32K    867.9 |  693.5   5.0  669.5 3394.4  386.9 |  174.4   0.1  254.9  967.2  334.2
3w10fs   512    32K    716.1 |  571.5   4.5  846.6 1259.6  225.0 |  144.6   0.2  171.4 1368.3  261.1
3w5fs    512    32K    508.9 |  406.5   4.5 1182.8 1787.3  301.1 |  102.4   0.1  265.0 2161.9  379.1

md10fs   1K     32K   1062.1 |  846.4   5.3 1137.9 3138.1  442.1 |  215.7   0.4  235.6 2481.6  327.2
3w10     1K     32K    868.2 |  694.4   4.8 1372.3 4544.6  761.7 |  173.8   0.1  336.8 1774.4  384.2
3w10fs   1K     32K    713.6 |  568.1   5.5 1710.7 2418.6  433.0 |  145.5   0.1  257.0 2752.5  512.7
3w5fs    1K     32K    494.0 |  392.8   4.0 2444.1 3286.7  636.7 |  101.1   0.1  394.2 3865.4  761.6

#'md' throughput is barely reduced, albeit we now request 64 times more data,
#because our stripe size is larger than each set of data requested: disk
#head movements is the performance-limiting factor
#3ware throughput seems too severily reduced, especially in fs mode, there
#is something to explore here

#64K per data block (stripe size)
md10fs   1      64K    102.3 |   81.2   2.2   12.2  390.9   10.4 |   21.1   0.2    0.3    1.4    0.1
3w10     1      64K     85.3 |   67.7   4.4   14.7  396.5    9.6 |   17.6   0.1    0.2    0.8    0.1
3w10fs   1      64K    107.1 |   85.0   0.5   11.7  376.8    9.7 |   22.1   0.1    0.2    2.1    0.2

md10fs   16     64K    699.9 |  559.7   0.5   28.1  416.8   22.0 |  140.2   0.2    1.9  125.8    8.7
3w10     16     64K    609.5 |  487.8   0.5   31.0  570.5   24.4 |  121.7   0.1    7.3  238.3   26.2
3w10fs   16     64K    492.6 |  394.3   2.3   36.2  288.1   23.1 |   98.3   0.1   17.5  249.2   27.1

md10fs   64     64K    878.2 |  703.3   0.5   86.5 1442.0  117.0 |  175.0   0.2   17.5  940.1   61.3
3w10     64     64K    761.9 |  610.0   4.1  103.8 1821.1  138.8 |  151.9   0.1    3.2  785.7   39.8
3w10fs   64     64K    553.5 |  442.7   5.0  127.2  519.3   60.0 |  110.8   0.1   68.3  428.3   60.5

md10fs   128    64K    930.2 |  744.5   0.5  163.2 2039.0  214.2 |  185.6   0.2   32.8 1454.2  110.4
3w10     128    64K    803.0 |  644.7   0.5  176.3 2384.0  223.7 |  158.3   0.1   78.3 1806.3  288.0
3w10fs   128    64K    565.0 |  451.9   5.2  254.2  661.5   99.1 |  113.1   0.1  114.3  719.4  102.7

md10fs   256    64K    955.1 |  763.8   1.5  303.7 2120.4  292.2 |  191.2   0.2  120.0 1632.9  199.0
3w10     256    64K    823.1 |  660.4   5.2  292.1 2956.8  266.4 |  162.8   0.1  378.4 2178.7  414.4
3w10fs   256    64K    573.9 |  458.6   6.1  514.9  929.3  149.5 |  115.3   0.1  162.1 1139.2  179.8

md10fs   512    64K    964.4 |  770.5   6.9  617.0 2639.6  409.1 |  193.9   0.5  172.7 2145.4  220.7
3w10     512    64K    823.0 |  659.9   5.9  686.0 3788.9  342.6 |  163.1   0.1  342.9 1816.0  389.4
3w10fs   512    64K    569.9 |  454.7   6.5 1054.9 1559.7  262.8 |  115.2   0.2  238.1 1912.6  330.1

md10fs   1K     64K    980.6 |  781.4   7.2 1229.1 3404.8  501.3 |  199.2   0.5  252.9 2399.1  365.9
3w10     1K     64K    827.2 |  659.4   5.6 1408.5 3897.9  591.0 |  167.8   0.1  478.3 1809.4  391.0
3w10fs   1K     64K    565.1 |  449.5   5.8 2144.3 3013.3  542.8 |  115.6   0.2  343.0 3461.2  673.7

#3ware performance sunks, there is definitively something weird

#128K per data block (2 times the stripe size)

md10fs   1     128K    74.0 |   58.7   4.2   16.8  855.2   17.2 |   15.3   0.4    1.0  415.4   13.7
3w10     1     128K    72.2 |   57.1   6.7   17.4  544.7   13.6 |   15.0   0.2    0.3    6.3    0.2
3w10fs   1     128K    94.1 |   74.7   0.6   13.3  374.5    9.9 |   19.4   0.2    0.3    9.9    0.4
3w5fs    1     128K    94.7 |   75.2   0.5   13.2  126.8    7.6 |   19.5   0.2    0.3    8.8    0.7

md10fs   16    128K   387.5 |  310.1   1.3   49.8  527.5   43.7 |   77.4   0.4    7.2  203.1   19.8
3w10     16    128K   329.7 |  263.4   6.8   56.6  651.1   48.3 |   66.2   0.2   16.1  645.1   58.3
3w10fs   16    128K   329.2 |  263.0   1.0   54.2  387.3   36.3 |   66.2   0.2   26.2  381.6   42.2
3w5fs    16    128K   259.5 |  207.3   0.6   69.0  656.6   47.0 |   52.3   0.2   32.4  335.2   52.8

md10fs   64    128K   448.1 |  359.0   5.5  162.3 1543.8  173.9 |   89.1   0.4   63.2  879.1  131.0
3w10     64    128K   389.1 |  311.2   8.3  205.0 2550.3  213.2 |   77.9   0.2    0.3    2.6    0.2
3w10fs   64    128K   353.8 |  282.9   7.1  200.0  621.2   87.7 |   70.9   0.2  103.2  562.1   91.5
3w5fs    64    128K   309.7 |  247.3   6.7  229.5  811.2   98.8 |   62.4   0.2  113.6  710.3  103.3

md10fs   128   128K   461.0 |  369.0   7.0  315.4 2334.6  294.0 |   92.1   0.4  118.7 1862.9  238.8
3w10     128   128K   414.4 |  332.8  10.7  345.0 2632.5  294.5 |   81.6   0.2  139.4 3194.3  520.5
3w10fs   128   128K   378.0 |  302.2  10.2  380.7  927.8  134.0 |   75.8   0.2  168.0 1025.2  151.5
3w5fs    128   128K   324.4 |  259.0  10.3  445.5 1013.4  158.4 |   65.3   0.2  186.7 1083.0  173.9

md10fs   256   128K   486.5 |  389.2  16.3  603.8 2689.6  426.6 |   97.4   0.4  196.6 1602.6  246.9
3w10     256   128K   431.3 |  346.7  11.8  483.5 3028.6  341.3 |   84.6   0.2 1014.5 2214.9  710.5
3w10fs   256   128K   366.6 |  292.5  10.1  806.7 1653.0  234.0 |   74.0   0.2  246.0 1613.7  275.6
3w5fs    256   128K   319.7 |  255.1  10.6  930.2 1589.8  245.1 |   64.6   0.3  263.9 1745.3  310.3

md10fs   512   128K   483.8 |  386.0  11.9 1218.7 4733.3  605.3 |   97.8   1.2  377.7 2961.9  397.6
3w10     512   128K   430.2 |  344.8  12.8 1249.7 3885.5  454.4 |   85.4   0.2  865.6 2648.8  758.4
3w10fs   512   128K   378.6 |  301.5   8.3 1580.2 2427.7  381.7 |   77.1   0.2  361.7 2491.6  500.4
3w5fs    512   128K   319.7 |  255.1  10.6  930.2 1589.8  245.1 |   64.6   0.3  263.9 1745.3  310.3

md10fs   1K    128K   479.6 |  381.5  20.4 2458.4 6277.9  799.1 |   98.1   1.5  647.9 4847.9  825.6
3w10     1K    128K   429.6 |  342.9  14.1 2638.7 5507.9  838.8 |   86.7   0.2 1108.8 2718.6  665.6
3w10fs   1K    128K   370.4 |  293.5   7.5 3246.2 4317.8  788.0 |   77.0   0.5  541.5 4642.4 1007.1
3w5fs    1K    128K   310.4 |  245.1   9.6 3877.3 5185.4  980.4 |   65.3   0.3  651.2 6024.0 1230.1

#'md' performance sunks. There may be (!) a relationship between the
#average size of requested blocks and stripe size

#640K (full stride, all ten disks mobilized by each request)

md10fs   1     640K    40.8 |   32.7  10.4   29.9   414.3  14.7 |    8.1   1.6    2.5    7.3    0.7
3w10     1     640K    48.3 |   38.4   9.5   25.8    65.4   8.2 |    9.9   0.7    1.1    2.4    0.6
3w10fs   1     640K    60.7 |   48.2   8.2   20.5   395.6   9.3 |   12.5   0.7    1.1    2.9    0.7
3w5fs    1     640K    60.2 |   47.7   8.5   20.6   297.9  12.2 |   12.5   0.7    1.3    6.4    1.2

md10fs   16    640K    84.5 |   67.1  24.6  192.8   913.4 110.3 |   17.4   1.5  174.4  933.5  187.5
3w10     16    640K    72.0 |   57.0  26.2  226.8   432.3  58.5 |   15.0   1.4  202.1  371.7   59.0
3w10fs   16    640K   103.9 |   82.5  16.0  161.3   516.6  64.3 |   21.4   0.7  124.7  381.4   82.2
3w5fs    16    640K    91.9 |   72.9  12.2  191.6   781.2  95.1 |   19.0   0.8  105.3  819.2  107.8

md10fs   64    640K    95.8 |   75.9  38.0  756.0 2688.8 335.0 |  19.9   1.5  313.0 2104.7  426.9
3w10     64    640K    71.9 |   56.9  32.8  996.9 1627.6 159.2 |  15.0   2.6  455.6 1090.4  131.6
3w10fs   64    640K   118.9 |   94.4  19.4  595.6 1328.0 182.3 |  24.5   0.9  301.5 1366.1  192.7
3w5fs    64    640K    92.1 |   72.9  22.8  769.9 1396.7 194.6 |  19.2   0.9  385.7 1399.3  241.1

md10fs   128   640K   100.1 |   79.2 195.6 1403.8 3796.1 481.6 |  20.9   4.9  750.1 3164.9  560.9
3w10fs   128   640K   126.8 |  100.7  28.4 1174.0 2353.6 278.0 |  26.1   2.7  339.3 1352.9  238.0
3w5fs    128   640K    97.6 |   77.2  65.7 1507.9 3046.4 302.2 |  20.4   0.8  480.1 2202.0  368.1

md10fs   256   640K   95.9 |   75.8 285.5 2766.4 9177.8 1118.6 | 20.2   3.9 2072.0 7038.4 1400.8
3w10fs   256   640K  127.1 |  100.7  29.4 2381.6 4947.3  487.4 | 26.5   1.7  390.2 4534.5  416.2
3w5fs    256   640K   96.4 |   75.7  39.6 3129.4 6538.3  615.5 | 20.7   1.1  592.5 4551.3  627.6

#1MB (1048576 bytes), 1.6 stride

md10fs   1     1M   36.0 |   28.7  10.2   33.8  390.0   14.3 |    7.2   2.7    4.1    8.2    0.7
3w10     1     1M   44.1 |   35.0  11.2   28.1  396.3   11.8 |    9.0   1.1    1.6    2.8    0.7
3w10fs   1     1M   55.2 |   43.8   9.9   22.3  239.6    7.5 |   11.4   1.1    1.8    3.3    0.8
3w5fs    1     1M   50.5 |   40.2  10.2   24.3  402.5   14.1 |   10.3   1.1    2.0   16.2    1.4

md10fs   16    1M   75.1 |   59.6  36.2  243.2  888.6   125.3 |   15.5   2.4   94.8  609.3  107.4
3w10     16    1M   66.0 |   52.3  46.1  248.2  399.9    59.7 |   13.7   2.8  217.3  343.9   58.6
3w10fs   16    1M   87.4 |   69.3  18.0  189.2  387.6    60.3 |   18.1   2.2  158.7  381.0   78.7
3w5fs    16    1M   69.6 |   55.1  16.4  243.0 1198.4   107.5 |   14.5   1.1  177.2  657.9  116.3

md10fs   64    1M   82.0 |   65.2  72.6  857.9 2250.5  303.5 |   16.9   2.4  450.3 2604.3  474.0
3w10     64    1M   66.1 |   52.3  20.1 1126.9 1621.1  145.5 |   13.8   1.3  320.3  830.7   95.8
3w10fs   64    1M   95.3 |   75.5  30.7  771.1 1607.2  165.8 |   19.8   4.0  271.1 1095.8  156.5
3w5fs    64    1M   73.7 |   58.3  30.6  995.0 2021.7  221.8 |   15.4   3.5  348.4 1311.2  220.1

md10fs   128   1M   85.5 |   68.0 221.8 1631.1 5353.9  605.5 |   17.5   2.5  885.9 3217.8  647.2
3w10     128   1M   66.1 |   52.1  37.1 2320.7 2900.5  333.0 |   14.0   1.4  329.5 2175.3  145.7
3w10fs   128   1M   95.1 |   75.1  59.0 1605.7 2485.2  267.3 |   20.0   1.5  279.8 1786.0  192.9
3w5fs    128   1M   73.6 |   58.0  45.2 2065.1 2894.8  421.6 |   15.6   6.7  383.9 2899.7  341.6

In the next tests we will use: BEWARE! The command below this line is DANGEROUS, it may ERASE DATA. BEWARE!
sync ; echo 3 > /proc/sys/vm/drop_caches ; time randomio /dev/DeviceName 48 0.2 0.1 65536 120 1
Meaning 48 simultaneous threads, 20% of operations are writes, 10% writes are fsync'ed, 64kB per record, test duration is 60 seconds, 1 single run

The benchmark used for sequential I/O is sdd
BEWARE! The 'write' command below this line is DANGEROUS, it may ERASE DATA. BEWARE!
read: sync ; echo 3 > /proc/sys/vm/drop_caches ; time sdd -onull if=/dev/DeviceName bs=1m count=10000 -t
write: sync ; echo 3 > /proc/sys/vm/drop_caches ; time sdd -inull of=/dev/DeviceName bs=1m count=10000 -t

Caution: using a high 'bs' value may be impossible because sdd allocates memory for it, or it may trigger swap activity if you did not 'swapoff'.

IO scheduler used: deadline and CFQ (no major performance hit)

Device names order (ghijkb, ghjkbi, ghkbij, hbijkg, hbjkgi) does not affect performance.

Linux md:

I configured each spindle as a 'single disk' (no RAID done by the controller), using 1024 "nr_request"
During all runs there was no 'iowait' and the 'md' kernel thread (md1_raid10) eaten from ~1% of a single CPU (random) and 3 to 5% (sequential).

Legend:

5_3w_INIT: 3ware RAID5: "/c0 add type=raid5 disk=8:14:10:13:9:12 on Hitachis, storsave=perform" (note: same results with '8:9:10:12:13:14'), during 3ware 'INIT' phase
5_3w: 3ware RAID5: "/c0 add type=raid5 disk=8:14:10:13:9:12", on Hitachis, storsave=perform
1_3w: 3ware RAID1: "/c0 add type=raid1 disk=8:9", on Western Digital storsave=perform
10_3w 1: 3ware RAID10: "/c0 add type=raid10 disk=8:14:10:13:9:12 storsave=perform" on Hitachis
10_3w 2: 1/ and noqpolicy
10_3w 3: 1/ and nr_request=1024 (for i in `ls /sys/block/sd?/queue/nr_requests` ; do echo 1024 > $i ; done)
10_3w 4: 1/ and nr_request=1024,max_sectors_kb=64
10_md FAR: Linux md, RAID10 'far': mdadm --create /dev/md1 --auto=md --metadata=1.2 --level=raid10 --chunk=64 --raid-devices=6 --spare-devices=0 --layout=f2 --assume-clean /dev/sd[bghijk], on Hitachis
10_md OFF: Linux md, RAID10 'offset': mdadm --create /dev/md1 --auto=md --metadata=1.2 --level=raid10 --chunk=64 --raid-devices=6 --spare-devices=0 --layout=o2 --assume-clean /dev/sd[hbkjgi], on Hitachis
10_md NEAR: Linux md, RAID10 'near': mdadm --create /dev/md1 --auto=md --metadata=1.2 --level=raid10 --chunk=64, on Hitachis --raid-devices=6 --spare-devices=0 --layout=n2 --assume-clean /dev/sd[bghijk], on Hitachis
5_md: Linux md, RAID5: mdadm --create /dev/md1 --auto=md --metadata=1.2 --level=raid5 --chunk=64 --raid-devices=6 --spare-devices=0 --assume-clean /dev/sd[bghijk], on Hitachis

Test      |  total |  read:         latency (ms)       |  write:        latency (ms)
          |   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
----------+--------+-----------------------------------+----------------------------------
 1_3w     |  162.9 |  130.0  12.6  367.9 2713.5  305.9 |   32.8   0.1    0.7   280.5    9.7 real 2m1.135s, user 0m0.036s, sys 0m0.504s
----------+--------+-----------------------------------+-----------------------------------
 5_3w_INIT|   75.7 |   60.5   2.7  617.6 6126.2 1759.7 |   15.2   0.1  622.7  6032.0 1786.2 real 2m0.744s, user 0m0.016s, sys 0m0.288s
 5_3w     |   78.1 |   62.5   0.5  598.1 6096.5 1711.1 |   15.6   0.1  608.8  6056.1 1740.1 real 2m0.591s, user 0m0.024s, sys 0m0.300s
----------+--------+-----------------------------------+-----------------------------------
 5_md     |  231.3 |  185.4   3.4   54.3  524.1   54.6 |   45.9  30.6  822.7 16356.0  705.0 real 2m0.404s, user 0m0.072s, sys 0m4.788s
----------+--------+-----------------------------------+-----------------------------------
10_3w 1   |  439.5 |  351.9   3.3  136.2 1838.7  140.9 |   87.5   0.1    0.2     3.5    0.1 real 2m0.202s, user 0m0.096s, sys 0m1.404s
10_3w 2   |  400.6 |  320.7   4.5  116.6  728.2   85.6 |   79.9   0.1  132.4  1186.3  269.0 real 2m0.533s, user 0m0.080s, sys 0m1.484s
10_3w 3   |  440.2 |  352.6   3.5  135.7 1765.0  139.9 |   87.7   0.1    0.9   458.4   14.8 real 2m0.680s, user 0m0.084s, sys 0m1.488s
10_3w 4   |  440.1 |  352.4   4.7  136.0 2200.3  139.3 |   87.6   0.1    0.3   192.6    4.4 real 2m0.320s, user 0m0.076s, sys 0m1.420s
----------+--------+-----------------------------------+-----------------------------------
10_md FAR |  454.6 |  364.0   4.6  126.8 1151.3  135.0 |   90.5   0.2   19.8   542.6   47.7 real 2m0.426s, user 0m0.092s, sys 0m1.660s
10_md OFF |  443.5 |  355.3   3.0  130.7 1298.2  136.3 |   88.2   0.2   17.2   522.8   45.7 real 2m0.340s, user 0m0.100s, sys 0m1.676s
10_md NEAR|  442.1 |  354.1   0.5  130.5 1299.5  137.5 |   87.9   0.2   19.7   617.3   50.0 real 2m0.309s, user 0m0.132s, sys 0m1.548s

3ware efficiently uses its write cache, reducing the apparent latency. It doesn't boost the effective performance (IOPS), which is device-dependent. Sequential:

Type	Read throughput; CPU usage	Write throughput; CPU usage
3ware RAID1, mounted root fs on 2 Western Digital)	(read-ahead 16384): 84 MB/S; real 2m2.687s; user 0m0.060s; sys 0m13.597s	63 MB/s at fs-level; real 2m48.616s, user 0m0.056s, sys 0m17.061s
3ware RAID5 INIT	During INIT: 223 MB/s; real 0m50.939s, user 0m0.068s, sys 0m17.369s	During INIT: 264 MB/s; real 0m40.015s, user 0m0.020s, sys 0m19.969s
3ware RAID5	254 MB/s; real 0m40.774s, user 0m0.004s, sys 0m18.633s	268 MB/s; real 0m42.785s, user 0m0.020s, sys 0m17.193s
md RAID5	84 MB/s; real 2m2.019s, user 0m0.040s, sys 0m13.261s	129 MB/s; real 1m19.632s, user 0m0.020s, sys 0m27.430s
3ware RAID10	173 MB/s; real 0m59.907s, user 0m0.008s, sys 0m19.565s	188 MB/s; real 0m59.907s, user 0m0.008s, sys 0m19.565s (242.8 MB/s; real 19m37.522s, user 0m0.484s, sys 10m20.635s on XFS)
md RAID10 FAR	305 MB/s; real 0m30.548s, user 0m0.048s, sys 0m16.469s (read-ahead 0: 22 MB/s; real 7m45.582s, user 0m0.048s, sys 0m57.764s)	118 MB/s; real 1m26.857s, user 0m0.012s, sys 0m21.689s
md RAID10 OFFSET	185 MB/s; real 0m59.742s, user 0m0.056s, sys 0m20.197s	156 MB/s; real 1m5.735s, user 0m0.024s, sys 0m22.585s
md RAID10 NEAR	189 MB/s; real 0m59.046s, user 0m0.036s, sys 0m20.461s	156 MB/s; real 1m6.124s, user 0m0.012s, sys 0m22.513s

A 4-drives (Western Digital) 'md' RAID10 delivers (18 GB read, fs mounted):

At disk beginning: 282.7 MB/s; real 1m12.640s, user 0m0.000s, sys 0m35.018s
At disk end: 256.6 MB/s; real 0m45.226s, user 0m0.048s, sys 0m19.781s

Simultaneous load: the 2 aforementionned 2 RAID10 (no fs mounted) and the RAID1 (ext3 root fs mounted):

CFQ:
- RAID1 2-drives WD:
- RAID10 4 drives WD:
- RAID10 6-drives Hitachi:

RAID1 3ware, 2 WD

Note: CFQ benchmarks ran at 'realtime 4' priority.

nr_request=128

Test     | total | read:         latency (ms)        | write:        latency (ms)
         |  iops |  iops    min    avg    max   sdev |  iops   min    avg    max   sdev
---------+-------+-----------------------------------+---------------------------------
CFQ      | 155.8 |  124.5   6.5  348.5 1790.3  283.3 |   31.3   0.2  144.0 1490.1  278.5
noop     | 155.0 |  123.8   8.0  350.0 1519.1  260.8 |   31.2   0.2  137.5 1583.8  308.4
deadline | 152.7 |  122.0   5.6  346.1 1594.8  278.8 |   30.7   0.2  180.1 1686.5  314.9

nr_request=4

Test | total | read:         latency (ms)        | write:        latency (ms)
     |  iops |  iops    min    avg    max   sdev |  iops   min    avg    max   sdev
-----+-------+-----------------------------------+---------------------------------
noop | 128.6 |  102.7   4.5  222.5 1262.2  201.9 |   25.9   0.2  962.6 2205.7  489.4
CFQ  | 127.6 |  102.0   5.9  316.6 1381.5  279.6 |   25.6   0.2  590.8 1808.3  490.4

nr_request=1024

Test | total | read:         latency (ms)        | write:        latency (ms)
     |  iops |  iops    min    avg    max   sdev |  iops   min    avg    max   sdev
-----+-------+-----------------------------------+---------------------------------
noop | 154.7 |  123.5   6.9  343.7 1524.4  281.2 |   31.1   0.2  170.8 1901.4  330.5
CFQ  | 154.4 |  123.3   6.3  363.1 1705.7  289.2 |   31.1   0.2  100.9 1217.8  220.4

nr_request=1024,queue_depth=1

Test | total | read:         latency (ms)        | write:        latency (ms)
     |  iops |  iops    min    avg    max   sdev |  iops   min    avg    max   sdev
-----+-------+-----------------------------------+---------------------------------
noop | 151.9 |  121.4   4.8  339.2 1585.6  298.3 |   30.5   0.2  208.2 1869.3  363.1

'md' and XFS

XFS atop the 'md' volume built on 6 Hitachi, controlled by the 3ware in 'SINGLE' mode (which is standlone, JBOD):

$ mdadm -D /dev/md1

/dev/md1:
        Version : 01.02.03
  Creation Time : Wed Feb 20 01:17:06 2008
     Raid Level : raid10
     Array Size : 1464811776 (1396.95 GiB 1499.97 GB)
    Device Size : 976541184 (465.65 GiB 499.99 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Wed Feb 20 18:48:04 2008
          State : clean
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=1, far=2
     Chunk Size : 64K

           Name : diderot:500GB  (local to host diderot)
           UUID : 194106df:617845ca:ccffda8c:8e2af56a
         Events : 2

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       96        1      active sync   /dev/sdg
       2       8      112        2      active sync   /dev/sdh
       3       8      128        3      active sync   /dev/sdi
       4       8      144        4      active sync   /dev/sdj
       5       8      160        5      active sync   /dev/sdk

Note that the filesystem code is smart:

'immediately' creating a file by invoking 'dd bs=1 seek=800G if=/dev/null of=testfile' produces absurd randomio (3500 IOPS) results
'truncating' (using the 'truncate' system call), for example with truncate.c, leads to similarly absurd results

However normally 'growing' a file through 'time sdd -inull of=testfile bs=IncrementSize oseek=CurrentSize count=IncrementAmount -t' (example: 'time sdd -inull of=testfile bs=1g oseek=240g count=80 -t') is adequate.

Read-ahead tweaking

In my opinion the read-ahead must be set in order to fill the caches without any disk head movement nor disk exclusive use. More explicitly the disk always spins, therefore let's not ignore what passes under the head and keep it in a cache, just in case. It costs nothing IF it does not imply a head displacement nor unavailability of the disk logic (leading to waiting I/O requests). This is a tricky matter, because many stacked levels (kernel code for fs and block device, but also logics of the disk interface and disk...)

Let's find the adequate read-ahead by running some tests, just after a reboot (default parameters).

Read throughput from the first blocks:

# sync ; echo 3 > /proc/sys/vm/drop_caches ;  time sdd -onull if=/dev/md1 bs=1g count=1 -t
sdd: Read  1 records + 0 bytes (total of 1073741824 bytes = 1048576.00k).
sdd: Total time 3.468sec (302270 kBytes/sec)

real    0m4.276s
user    0m0.000s
sys     0m2.676s

Last volume blocks:

# sync ; echo 3 > /proc/sys/vm/drop_caches ;  time sdd -onull if=/dev/md1 bs=1g iseek=1300g count=1 -t
sdd: Read  1 records + 0 bytes (total of 1073741824 bytes = 1048576.00k).
sdd: Total time 3.489sec (300537 kBytes/sec)

real    0m4.076s
user    0m0.000s
sys     0m2.680s

Sequential read of the block device is 3 times slower on the raw underlying device when a filesystem created on it is mounted. Maybe because such operation needs updates between the fs and the block device(?TODO: test with the raw in read-only mode). That's not a problem because I will only use it at the fs level:

# mkfs.xfs -l version=2 -i attr=2 /dev/md1
# mount -t xfs -onoatime,logbufs=8,logbsize=256k /dev/md1 /mnt/md1

# sync ; echo 3 > /proc/sys/vm/drop_caches ;  time sdd -onull if=/dev/md1 bs=1g count=1 -t
sdd: Read  1 records + 0 bytes (total of 1073741824 bytes = 1048576.00k).
sdd: Total time 11.109sec (94389 kBytes/sec)

real    0m11.359s
user    0m0.000s
sys     0m7.376s

XFS pumps faster than a direct access to the raw device, maybe thanks to its multithreaded code.

# sync ; echo 3 > /proc/sys/vm/drop_caches ;  time sdd -onull if=testfile bs=1g count=24 -t
sdd: Read  24 records + 0 bytes (total of 25769803776 bytes = 25165824.00k).
sdd: Total time 73.936sec (340368 kBytes/sec)

real    1m14.036s
user    0m0.000s
sys     0m31.602s

Reading at the very end of the filesystem (which is AFAIK in fact in the middle of the platters, with 'md' far(?)) is slower but still OK

# sync ; echo 3 > /proc/sys/vm/drop_caches ;  time sdd -onull if=iozone.tmp bs=1g count=1 -t
sdd: Read  1 records + 0 bytes (total of 1073741824 bytes = 1048576.00k).
sdd: Total time 3.901sec (268727 kBytes/sec)

real    0m4.000s
user    0m0.000s
sys     0m1.648s

Read-ahead is, by default, (on those 6 spindles) 256 for each underlying device and the cumulated value (1536) for /dev/md1 IOPS are pretty good:

# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1              1.4T  1.3T  145G  90% /mnt/md1

# ls -lh testfile
-rw-r--r-- 1 root root 1,2T 2008-02-24 18:34 testfile

# sync ; echo 3 > /proc/sys/vm/drop_caches ; time randomio testfile 48 0.2 0.1 65536 120 1
  total |  read:         latency (ms)       |  write:        latency (ms)
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
  261.5 |  209.3   6.1  218.7 1458.5  169.0 |   52.2   0.2   41.2  843.9   86.0

real    2m0.839s
user    0m0.072s
sys     0m1.540s

On one of very first files stored:

# ls -lh iozone.DUMMY.4
-rw-r----- 1 root root 8,0G 2008-02-23 19:58 iozone.DUMMY.4

# sync ; echo 3 > /proc/sys/vm/drop_caches ; time randomio iozone.DUMMY.4 48 0.2 0.1 65536 120 1
           total |  read:         latency (ms)       |  write:        latency (ms)
            iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
---------+-------+-----------------------------------+----------------------------------
deadline | 312.2 |  249.8   3.7  184.1 1058.4  139.4 |   62.4   0.2   31.8  498.0   62.0
noop     | 311.5 |  249.2   3.2  183.8  987.5  139.8 |   62.3   0.2   34.4  509.0   67.0
CFQ      | 310.2 |  248.2   5.3  185.0 1113.5  144.6 |   62.0   0.2   32.2  879.9   71.0

Without NCQ ('qpolicy=off' for all underlying devices)
(those results are consistent with the raw-level tests):
noop     |  280.5 |  224.5   6.5  200.8  971.5  126.1 |   56.1   0.2   51.5 1138.8  103.9
deadline |  278.0 |  222.4   3.0  205.2 1002.0  136.8 |   55.6   0.2   41.4  833.8   83.7
CFQ      |  273.2 |  218.5   2.4  208.0 1194.9  144.5 |   54.7   0.2   45.1  845.3   93.2

Let's use CFQ, which enables us to 'nice' I/O and doesn't cost much. Three processes will run, each in a different 'ionice' class, using different files. Here is the model:
# sync ; echo 3 > /proc/sys/vm/drop_caches ; time ionice -c1 randomio iozone.DUMMY.4 48 0.2 0.1 65536 180 1
Class 1 is 'realtime', 2 is 'best-effort', 3 is 'idle' CFQ disabled:

   total |  read:         latency (ms)       |  write:        latency (ms)
    iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
---------+-----------------------------------+----------------------------------
1  128.6 |  102.8  11.7  440.1 1530.2  256.3 |   25.7   0.2  103.4 1134.2  156.9 real 3m0.669s, user 0m0.076s, sys 0m0.944s
2  127.6 |  102.0   9.3  443.7 1684.2  258.7 |   25.6   0.2  103.9 1102.8  155.0 real 3m0.585s, user 0m0.076s, sys 0m0.836s
3   31.7 |   25.2  56.9 1880.9 8166.9 1052.1 |    6.5   0.3   86.9  909.8  135.3 real 3m0.190s, user 0m0.008s, sys 0m0.232s
Total: 287.9

CFQ enabled:

   total |  read:         latency (ms)       |  write:        latency (ms)
    iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
---------+-----------------------------------+----------------------------------
1  141.4 |  113.1   7.7  401.3 1350.0  240.9 |   28.3   0.2   89.1  919.1  127.3 real 3m1.529s, user 0m0.060s, sys 0m0.952s
2  138.3 |  110.6  11.8  410.2 1416.7  243.4 |   27.7   0.2   92.9  986.0  133.5 real 3m0.417s, user 0m0.040s, sys 0m0.924s
3   35.1 |   27.8  20.7 1702.1 7588.5  948.1 |    7.3   0.3   85.8  920.1  120.3 real 3m0.238s, user 0m0.004s, sys 0m0.192s
Total: 314.8

CFQ is efficient on those Hitachi drives, I enable it.

Ionices 1 and 2 simultaneously active:
   total |  read:         latency (ms)       |  write:        latency (ms)
    iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
---------+-----------------------------------+----------------------------------
1  160.8 |  128.7   6.8  361.7 1502.9  241.1 |   32.0   0.2   42.9 1028.2  109.3 real 3m1.143s, user 0m0.088s, sys 0m1.340s
2  160.4 |  128.4   6.1  362.6 1357.5  239.3 |   32.0   0.2   42.0 1001.3  102.2 real 3m0.749s, user 0m0.080s, sys 0m1.352s
Total: 321.2

1 and 3:
1  160.4 |  128.4   6.1  362.6 1357.5  239.3 |   32.0   0.2   42.0 1001.3  102.2 real 3m0.749s, user 0m0.080s, sys 0m1.352s
3   52.4 |   41.9  29.9 1128.5 5246.7  617.6 |   10.5   0.3   67.7  935.0   82.2 real 3m1.935s, user 0m0.032s, sys 0m0.400s
Total: 212.8

2 and 3:
2  256.2 |  205.1   7.3  213.9 1089.3  138.4 |   51.1   0.2   79.9  626.9   83.0 real 3m0.540s, user 0m0.076s, sys 0m2.136s
3   50.1 |   40.0  19.7 1176.0 5236.8  647.5 |   10.1   0.3   67.4  575.9   73.2 real 3m1.120s, user 0m0.048s, sys 0m0.292s
Total: 306.3

1 and 2, with this last one at -n7:
1  156.9 |  125.5   6.6  367.4 1479.0  246.3 |   31.3   0.2   58.1 1291.8  147.9 real 3m0.489s, user 0m0.032s, sys 0m1.356s
2  158.3 |  126.6   6.3  365.0 1422.9  247.7 |   31.7   0.2   54.2 1334.7  143.8 real 3m0.565s, user 0m0.064s, sys 0m1.324s
Total: 315.2

1 and 2, with this last one at -n0:
1  157.5 |  126.0   5.7  365.6 1681.5  239.3 |   31.5   0.2   58.8 1727.4  143.9 real 3m0.748s, user 0m0.080s, sys 0m1.412s
2  160.6 |  128.6   6.8  359.9 1440.6  237.7 |   32.0   0.2   51.3 1067.3  128.3 real 3m0.748s, user 0m0.080s, sys 0m1.412s
Total: 318.1

2 (best effort) -n0 (high priority) and 2 -n7:
-n0  161.4 |  129.2   6.4  358.9 1474.1  233.8 |   32.2   0.2   49.7 1101.9  120.7 real 3m1.348s, user 0m0.100s, sys 0m1.260s
-n7  159.6 |  127.7   6.2  363.0 1477.4  233.7 |   31.9   0.2   49.9 1150.0  116.9 real 3m1.867s, user 0m0.068s, sys 0m1.380s
Total: 321

Sequential read and random I/O, done simultaneously and on distant files:

# time ionice -c2 -n7 sdd -onull if=testfile bs=1g iseek=1t count=18 -t
along with
# time ionice -c2 -n0 randomio iozone.DUMMY.2 48 0.2 0.1 65536 180 1
sdd      86.8 MB/s; real 3m37.902s, user 0m0.000s, sys 0m22.353s
randomio  231.2 |  184.9   5.8  237.9 1270.8  186.5 |   46.3   0.2   85.2  956.0  104.0 real 3m0.560s, user 0m0.128s, sys 0m2.060s

Reciprocally (-n7 for randomio and -n0 for sdd):
sdd      95.9 MB/s; real 3m16.912s, user 0m0.000s, sys 0m22.121s
randomio  197.3 |  157.9   4.4  265.2 1412.6  210.4 |   39.4   0.2  154.5 1178.7  159.5 real 3m0.800s, user 0m0.076s, sys 0m1.728s

sdd under nice, randomio in default mode
sdd      88.5 MB/s; real 3m33.447s, user 0m0.000s, sys 0m22.441s
randomio  222.6 |  178.1   4.2  243.9 1397.3  188.1 |   44.5   0.2  101.2 1239.2  122.1 real 3m0.277s, user 0m0.088s, sys 0m2.196s

sdd under nice, randomio in default mode (deadline)
sdd       89.7 MB/s; real 3m31.237s, user 0m0.000s, sys 0m21.613s
randomio   220.7 |  176.6   5.7  263.8 1485.4  200.7 |   44.1   0.2   30.9  952.7  107.5 real 3m2.273s, user 0m0.132s, sys 0m3.176s

The CFQ 'realtime' class seems useless, maybe because the underlying XFS has no 'realtime section'. I did not declared one because the underlying and device has no such stuff (a dedicated track associated to a fixed head? A solid state disk?).

The CFQ 'class data' ('-n' argument) seems nearly useless.

'md' RAID made of various devices

By creating a 500 GB first partition on each WD drive I was able to create a RAID10 made of 10 spindles. Here are its performances, compared to 3ware volumes (built on the same drives, all with 'storsave=perform'), a RAID10 and to a RAID6 (beware: this last one was INITIALIZING during the tests). Note that the 3Ware cannot use a partition and only uses on each drive the space offered by the smallest drive of the set, therefore it neglected half of each 1 TB drive diskspace.

mdadm --create /dev/md0 --auto=md --metadata=1.2 --level=raid10 --chunk=64 /
 --raid-devices=10 --spare-devices=0 --layout=f2 --assume-clean /dev/sdc1 /
 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sd[bghijk]

# mdadm -D /dev/md0                                                                                                                                                                                        
/dev/md0:
        Version : 01.02.03
  Creation Time : Tue Feb 26 14:26:21 2008
     Raid Level : raid10
     Array Size : 2441352960 (2328.26 GiB 2499.95 GB)
    Device Size : 976541184 (465.65 GiB 499.99 GB)
   Raid Devices : 10
  Total Devices : 10
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Feb 26 14:26:21 2008
          State : clean
 Active Devices : 10
Working Devices : 10
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=1, far=2
     Chunk Size : 64K

           Name : diderot:0  (local to host diderot)
           UUID : 51c89f70:70af0287:42c3f2cf:13cd20fd
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1
       4       8       16        4      active sync   /dev/sdb
       5       8       96        5      active sync   /dev/sdg
       6       8      112        6      active sync   /dev/sdh
       7       8      128        7      active sync   /dev/sdi
       8       8      144        8      active sync   /dev/sdj
       9       8      160        9      active sync   /dev/sdk


# time randomio /dev/sdb 48 0.2 0.1 65536 180 1
      total |  read:         latency (ms)       |  write:        latency (ms)
       iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
------------+-----------------------------------+----------------------------------
md    793.8 |  634.6   1.8   73.6 2710.1  114.0 |  159.2   0.1    8.0  673.3   29.9
3w10  726.0 |  580.3   0.5   82.2 1733.3  104.6 |  145.7   0.1    2.0  586.0   23.8; real 3m0.355s, user 0m0.272s, sys 0m3.944s
3w6   327.5 |  262.0   3.2  131.7 1967.8  145.9 |   65.5   0.1  204.5 1911.6  423.3; real 3m0.365s, user 0m0.144s, sys 0m1.860s

However the throughput on sequential read was mediocre: 342.7 MB/s; real 1m0.674s, user 0m0.000s, sys 0m40.571s (offset2 gave 278 MB/s, near2 206 MB/s)

3ware's RAID10 was 302.3 MB/s (real 1m8.464s, user 0m0.000s, sys 0m31.722s) and RAID6 (while INITIALIZING!) was 121.4 MB/s (real 2m41.076s, user 0m0.000s, sys 0m31.786s)

Testing at application level

iostat

Using a MySQL 'optimize' of a huge INNODB table to evaluate I/O schedulers ('CFQ', 'deadline' or 'noop') on 3ware 9550 RAID5 made of 6 Hitachi.

Legend (from 'man iostat'):

rrqm/s: number of read requests merged per second that were queued to the device
r/s: number of read requests that were issued to the device per second
avgrq-sz: average size (in sectors) of the requests that were issued to the device.
avgqu-sz: average queue length of the requests that were issued to the device.
await: average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
svctm: average service time (in milliseconds) for I/O requests that were issued to the device.
%util: percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.'

rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
  0.00    14.85 102.97 82.18     2.29     2.18    49.41     1.67    9.01   4.92  91.09
  1.98    86.14 58.42 84.16     1.14     1.80    42.28     3.00   18.86   6.75  96.24
  0.00    25.00 84.00 91.00     1.83     2.24    47.59     1.38    9.71   5.33  93.20
  0.00    66.34 69.31 157.43     1.64     6.17    70.53     1.78    7.81   3.97  89.90
  0.00    62.38 91.09 101.98     2.13     2.47    48.82     1.71    8.70   4.88  94.26
  0.00     4.00 25.00 147.00     0.47     3.27    44.56     2.55   15.02   5.86 100.80
  5.94    44.55 103.96 77.23     2.41     2.11    51.15     1.52    8.31   4.98  90.30
  0.00    75.25 59.41 69.31     1.21     1.70    46.28     3.75   25.42   7.35  94.65
  0.00     0.00 49.50 100.00     0.96     2.13    42.33     1.72   14.81   6.23  93.07
  0.00    50.51 61.62 98.99     1.45     2.15    45.89     1.96   12.28   5.96  95.76
  1.98    44.55 85.15 86.14     2.00     2.02    48.00     1.74    9.99   5.27  90.30
  0.00    66.34 51.49 123.76     1.36     2.44    44.47     2.12   12.27   5.06  88.71
  5.88     7.84 53.92 81.37     1.15     2.28    51.88     1.16    8.61   6.20  83.92
  0.00    58.00 73.00 81.00     1.56     1.63    42.49     2.38   15.30   5.97  92.00
  5.94     4.95 13.86 111.88     0.36     2.17    41.07     1.89   15.12   7.87  99.01
 10.00    59.00 82.00 73.00     1.55     2.16    48.98     1.31    8.49   5.32  82.40
  0.00    62.00 82.00 126.00     1.94     3.05    49.12     2.14   10.37   4.44  92.40
  0.00    36.63 81.19 81.19     1.86     1.86    46.93     2.12   12.80   5.78  93.86
  9.90    64.36 83.17 100.00     1.75     2.88    51.72     1.42    7.91   4.80  87.92
  0.00    59.60 79.80 70.71     1.63     1.85    47.30     1.29    7.89   6.07  91.31
  1.96     1.96 89.22 69.61     1.98     2.05    51.90     1.49    9.88   5.63  89.41
  0.00    72.00 12.00 129.00     0.36     2.27    38.18     2.93   20.88   7.09 100.00
 10.00    55.00 76.00 68.00     1.45     1.82    46.50     1.60   10.75   6.11  88.00
  0.00     4.00 38.00 109.00     0.92     1.89    39.13     2.64   18.39   6.59  96.80

The '%util' (last) column content seems pretty abnormal to me, given the modest workload.

As a sidenote: MySQL (which, under the 'CFQ' scheduler, is invoked here with 'ionice -c1 -n7') grinds this (in fact this 'optimize' is an 'ALTER TABLE', therefore reading, somewhat reorg, then writing) with less than 2% of a single CPU...). 'show innodb status' reveals:

FILE I/O
 87.91 reads/s, 24576 avg bytes/read, 93.91 writes/s
BUFFER POOL AND MEMORY
 133.87 reads/s, 8.99 creates/s, 164.84 writes/s
 Buffer pool hit rate 994 / 1000

'rrqm/s' with the 'noop' I/O scheduler are similar other schedulers results. How comes? Is it to say that 'noop' merges requests, therefore that it somewhat buffers them? Or that it uses a merge approach inducing no latency, maybe by reserving it to sequential requests saturating the device's queue? Or that 'CFQ' and 'deadline' (even when /sys/block/sda/queue/iosched/read_expire contains 10000, offering up to 10 seconds of latency to group read requests) are not able to merge efficiently, perhaps due to some queue saturation?

Time series

In order to gather intelligence about IO schedulers efficiency we may analyze time series.

During a continuous and stable (homogeneous) usage scenario (I used the loooong MySQL 'optimize' run, even if it is not representative as it does much more writes than read) launch as root:

#!/bin/sh

#name of the monitored device
DEVICENAME=/dev/sda

#pause duration, after establishing the IO scheduler
SLP=10

while (true) ; do

 COLLECT_TIME=$(($RANDOM%3600 + 30))

 case $(($RANDOM%3)) in

  0)
   echo cfq > /sys/block/sda/queue/scheduler
   sleep $SLP
   iostat -mx $DEVICENAME 1 $COLLECT_TIME |grep sda|cut -c10- >> cfq_sched.results
  ;;

  1)
   echo deadline > /sys/block/sda/queue/scheduler
   sleep $SLP
   iostat -mx 1 $COLLECT_TIME |grep sda|cut -c10- >> deadline_sched.results
  ;;

  2)
   echo noop > /sys/block/sda/queue/scheduler
   sleep $SLP
   iostat -mx 1 $COLLECT_TIME |grep sda|cut -c10- >> noop_sched.results
  ;;

 esac

done

Suppress some lines in some series (by using 'head -#OfLines FileName > NewFileName'), for all of them to contain the same number of samples (check: 'wc -l wc -l *_sched.results').

Create a MySQL database and table for the time series:

create database ioperf;
use ioperf;
create table iostat ( scheduler ENUM('cfq', 'noop', 'deadline','anticipatory') NOT NULL,
rrqmps float, wrqmps float, rps float, wps float, rmbps float,wmbps float,avgrqsz float,
avgqusz float,await float,svctm float, percentutil float,index id_sche (scheduler));

Inject the data:

$ perl -pe '$_ =~ tr/ //t/s; print "noop";' < noop_sched.results > iostat.txt
$ perl -pe '$_ =~ tr/ //t/s; print "deadline";' < deadline_sched.results >> iostat.txt
$ perl -pe '$_ =~ tr/ //t/s; print "cfq";' < cfq_sched.results >> iostat.txt
$ mysqlimport --local ioperf iostat.txt

Here is a way to conclude, using an arithmetic mean (provided by SQL 'avg'). It seems to be sufficient albeit an harmonic mean may be an useful addition. My main concerns are the first columns: rps, rmbps, await...

# check that all time series are of same size
select scheduler,count(*) from iostat group by scheduler;
+-----------+----------+
| scheduler | count(*) |
+-----------+----------+
| cfq       |    12000 | 
| noop      |    12000 | 
| deadline  |    12000 | 
+-----------+----------+

#major performance-related arithmetic means
select scheduler,avg(rps),avg(rmbps),avg(await),avg(wps),avg(wmbps),avg(percentutil) from iostat group by scheduler;
+-----------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+
| scheduler | avg(rps)        | avg(rmbps)      | avg(await)      | avg(wps)        | avg(wmbps)      | avg(percentutil) |
+-----------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+
| cfq       | 85.109504183431 | 1.9109841637902 | 13.676339167714 | 89.733314177394 | 2.2845941611628 |  95.672404343605 | 
| noop      | 86.285021674693 | 1.9277283308546 | 12.093321669618 |  89.79774667867 | 2.2659291594663 |  96.018881834666 | 
| deadline  | 86.597292512933 |  1.931128330753 |  12.50979499948 | 89.036291648348 | 2.2486824922271 |   96.12482605044 | 
+-----------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+

#(less important) informations about schedulers housekeeping, useful to
#spot weird differences
select scheduler,avg(svctm),avg(rrqmps),avg(wrqmps),avg(avgrqsz),avg(avgqusz) from iostat group by scheduler;
+-----------+-----------------+-----------------+-----------------+-----------------+-----------------+
| scheduler | avg(svctm)      | avg(rrqmps)     | avg(wrqmps)     | avg(avgrqsz)    | avg(avgqusz)    |
+-----------+-----------------+-----------------+-----------------+-----------------+-----------------+
| cfq       | 6.1323758338888 | 2.2023866616189 | 34.736365837912 | 48.764866647402 | 2.0277108325611 | 
| noop      | 6.0292158377171 | 2.3308641605874 | 32.569892516136 | 48.432940823873 |  1.897218332231 | 
| deadline  | 6.1189483366609 | 2.2282899945279 | 32.127461672723 | 48.399182485898 | 1.9355808324416 | 
+-----------+-----------------+-----------------+-----------------+-----------------+-----------------+

Right now 'deadline' dominates. (TODO: update those results, refresh them under a more adequate and controlled load (iozone?))

Application testing

Hints

Try hard to put the right hardware setup for any given mission. This may imply to maintain several different spindle sets and filesystems on a given server.

For example partition your databases (MySQL, see also the tech-resources) in order to scatter, accross the fastest spindles, each and every table often-simultaneously random-read.

We badly need HSM.

End of testing

Re-enable Linux to use all available RAM. Re-enable all needed daemons disabled before our testing period. Re-enable the swap. If needed re-enable APM (maybe at disk level) and CPU frequency scaling. Reboot and check (available RAM, available disk space, swap, daemons, performance (BEWARE! Some tests can erase data!...).

Modify the parameters of the last set of tests used, which delivered a performance judged adequate, with respect to the new amount of RAM available. Try to match your real-world application. Then re-run the test in order to check that the performance gain. Then select the most adequate set of parameters by using all tests offered by your applications or devised by yourself.

Check SMART status periodically (some daemons just do this, for example 'smartclt' and 'mdadm' in 'monitor' mode) and automagically in order to detect and replace battered drives but keep on mind that this is not a magic bullet: it may only warn about very few potential failures.

3ware 9550 RAID5: my tweakings and their results

This iozone results set obtained after some tweaking reveals the problem: 131 random read/second (most other tests results, done with original and various other parameters, are much worse). Note that the box ran this with 12 GB RAM, and that some tests were avoided (no figures in the table).

Setups

Here is a setup leading, with the 9550 RAID5 and 6 Hitachis drives, to 140 random read/s and 195 random write/second), as a shell script named '/etc/rc.local':

# http://www.3ware.com/kb/article.aspx?id=11050k
echo 64 > /sys/block/sda/queue/max_sectors_kb

#Objective: saturating the HD's integrated cache by reading ahead
#during the period used by the kernel to prepare I/O.
#It may put in cache data which will be requested by the next read.
#Read-ahead'ing too much may kill random I/O on huge files
#if it uses potentially useful drive time or loads data beyond caches.
#As soon as the caches are saturated the hit ratio is proportional to
#(data volume/cache size) therefore beware if the data volume is
#much bigger than cumulated useful (not overlaping) caches sizes
/sbin/blockdev --setra 4096 /dev/sda

#Requests in controller's queue. 3ware max is 254(?)
echo 254 > /sys/block/sda/device/queue_depth

#requests in OS queue
#May slow operation down, if too low of high
#Is it per disk? Per controller? Grand total for all controllers?
#In which way does is interact with queue_depth?
#Beware of CPU usage (iowait) if too high
#Miquel van Smoorenburg wrote: CFQ seems to like larger nr_requests,
#so if you use it, try 254 (maximum hardware size) for queue_depth
#and 512 or 1024 for nr_requests.
echo 1024 > /sys/block/sda/queue/nr_requests

#Theoritically 'noop' is better with a smart RAID controller because
#Linux knows nothing about (physical) disks geometry, therefore it
#can be efficient to let the controller, well aware of disk geometry
#and servos locations, handle the requests as soon as possible.
#But 'deadline' and 'CFQ' seem to enhance performance
#even during random I/O. Go figure.
echo cfq > /sys/block/sda/queue/scheduler

#iff deadline
#echo  256 > /sys/block/sda/queue/iosched/fifo_batch
#echo    1 > /sys/block/sda/queue/iosched/front_merges
#echo  400 > /sys/block/sda/queue/iosched/read_expire
#echo 3000 > /sys/block/sda/queue/iosched/write_expire
#echo    2 > /sys/block/sda/queue/iosched/writes_starved

#avoid swapping, better recycle buffercache memory
echo 10 > /proc/sys/vm/swappiness

#64k (sector size) per I/O operation in the swap
echo 16 > /proc/sys/vm/page-cluster

#See also http://hep.kbfi.ee/index.php/IT/KernelTuning
#echo 500 > /proc/sys/vm/dirty_expire_centisecs
echo 20 > /proc/sys/vm/dirty_background_ratio
echo 60 > /proc/sys/vm/dirty_ratio
#echo 1 > /proc/sys/vm/vfs_cache_pressure

#export IRQBALANCE_BANNED_INTERRUPTS=24
#/etc/init.d/irqbalance stop
#/usr/bin/killall irqbalance
#/etc/init.d/irqbalance start

'/etc/sysctl.conf' contains (after Debian-established lines):

kernel.shmmax = 4294967295
kernel.shmall = 268435456

net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=262143
net.core.wmem_default=262143
net.ipv4.tcp_rmem=8192 1048576 16777216
net.ipv4.tcp_wmem=4096 1048576 16777216
vm.min_free_kbytes=65536
net.ipv4.tcp_max_syn_backlog=4096

In a hurry?

Here is how quickly assess performances, for example in order to compare various setups (disks, RAID types, LVM/no LVM, filesystems types or parameters, kernel parameters...)

Before running the tests launch a dedicated terminal and invoke 'iostat -mx 5' to track performances variations, disk utilization percentage and iowait, and read them alternatively during the tests.

Measure 'random access' performance thanks to randomio:

sync
echo 3 > /proc/sys/vm/drop_caches
randomio WHERE THREADS WR_PROPORTION 0.1 AVG_RND_BLOCK_SIZE 60 1 ; done

For sequential read performance, use sdd:

sync
echo 3 > /proc/sys/vm/drop_caches
time sdd -onull if=WHERE bs=1m count=10000 -t

For sequential write performance:
BEWARE! The next command may DIRECTLY WRITE ON THE DISK! DANGER!

sync
echo 3 > /proc/sys/vm/drop_caches
time sdd -inull of=WHERE bs=1m count=10000 -t

Replacing:

WHERE by the name of a file or device (danger: by using a device name (in the '/dev/SOMETHING' form) you may write directly on the disk, bypassing the filesystem. This means that you will probably LOSE all data stored on the disk. BEWARE. In case of doubt, for example if you test an already useful RAID array, do a filesystem-level test by providing a file name, not a device name, after having created the file thanks to 'dd' or 'sdd': the file size must be superior to 4 times the amount of RAM recognized by the kernel, and at least 20% of the diskspace (no need to test if the disk heads don't travel). Do NOT use an existing file as the test will WRITE in it!
THREADS by a value letting you obtain the highest performance. It is is higly context-dependant (AVG_RND_BLOCK_SIZE value, amount of spindles and disk types, stripe size, host CPU and bus, OS, libraries...). An approximative value is OK. Start with 20 times the amount of spindles, then 30 times and raise this way, up to the value lowering performance. Then an dichotomic approach will do. On most boxes the adequate value lays between 128 and 512.
WR_PROPORTION with the fraction of fsync'ed writes requested by your critical application (in case of doubt use 0.1)
AVG_RND_BLOCK_SIZE by the amount of bytes read or written, on (harmonic) average, by your critical application

My setup

I retained a md RAID10, f2 on 9 spindles, with a spare.

Conclusions and ideas

Chosing a controller

In many cases a dedicated RAID controller is not necessary: 'md' will do the job. Beware of:

hot (no server stop!) disk-swapping capability
some mediocre controllers or mainboards, unable to simultaneously offer enough bandwith to all connected disks
a way to cut the power of all spare disk and to have it automatically restored when needed (the disk's APM function may do the trick)

Attributing spindles to arrays

If you know that some critical parts (database tables?) of your dataset will be much more oftenly accessed you may dedicate a RAID to it, with a finely tuned chunk size. This is often impossible as the common load profile do vary, often in unexpected ways, justifying to 'throw' all available spindles in a single array.

Needing random I/O?

Avoid RAID5 at all costs.

When using actors (application, operating system and controller) unable to parallelize accesses (beware: a database server used by only one request at a time falls in this category) one may buy less diskspace from faster devices (rotation speed: 15k and up, less 'rotational' latency) with better average access time (probably smaller form factor, reducing mechanical latency).

If parallelization is effective, as with most contemporary server usages and softwares, buying spindles for a 'md' RAID10 is the way to go.

Defining chunk size

To optimally use all spindles define the chunk size superior to most sizes of data blocks requested by a single I/O operation. In a pragmatic way you can take the arithmetic mean of sizes of blocks requested by a sample work session, bump it according to the standard deviation. Don't bump it too much if many requests read/write much smaller data segments (this is often the case with database index).

Needing sequential I/O?

For sequential I/O (streaming...) and disk space at a bargain go for RAID5 managed by a 3ware. If you use a fair amount of disks be aware of bandwidth limitations (the system may not be able to pump data in/out the drives at their full cumulated throughputs) and consider RAID6 for increased safety.

RAID5 is good at sequential access while preserving the ratio (usable space/total disk space). However it is (especially the 3ware implementation!) awful at random access. 'md' RAID5 is good at random... for a RAID5 (in a word: mediocre)

BBU

Here is Albert Graham's advice about a battery backup unit (which lets the controller cache writes without sacrificing coherency when various types of incidents happen): prefer the model coming on a dedicated extension card. He wrote "The battery can be mounted in a number of different positions and more importantly it is not piggy-backed on the card itself. This battery gets really hot!"

RAID: overall

'md' RAID10 (especially in 'far' mode) is the most impressive and generic performer overall, cumulating the best performance at read (both on random and sequential), random write and throughput. Random access performance remains stable and high, degraded performance is theoritically good (not tested).

Its CPU cost is neglectable on such a machine: at most ~28% over 3ware (the most stressing test (sequential read on a RAID10 made of 10 drives, 18 GB read in ~1 minute) costed 40.571s 'system' load ('md') and 31.722s (3ware)). TODO: test bus contention, by simultaneously running the disk test and some memory-intensive benchmark.

Caveats:

if an average single I/O request size is near (or superior to) the stride size 3ware seems more efficient
under high load 'md' latency max and latencies standard deviation are much higher than 3ware's (but throughput is better)
RAID10 costs disk space (which is now cheap) and degraded mode divides by 2 the sequential read throughput (as summarized by P. Grandi, using Wikipedia)
'md' (RAID10,f2) sequential write performance is mediocre (but constant)

NCQ

NCQ (3ware 'qpolicy') is efficient (~9%) on Hitachi Desktar 500 GB drives

CFQ

CFQ is useful and costs at max 2% with my workload, albeit there is no perceptible difference between the 'realtime' and 'best effort' classes. It offers a real 'idle' class and may boost the aggregated throughput. Benchmark it!

Beware! There is some controversy (see also here) about this, especially when nearly all data are in the cache (nearly no disk read) while the database writes nearly all the time.

LVM

LVM2 induces a performance hit. Avoid it if you don't need snapshot of logical management.

Linux block devices parameters

There is no system parameter acting as a magic bullet, albeit read ahead ('/sbin/blockdev --setra') and 'nr_request' are sometimes efficient

Read ahead

Increasing readahead does not boost performance after a given point, maybe because it is only useful when using some incompressible latency between requests batches sent from the controller to the drive, maybe somewhat composed with the drive's integrated cache. The optimal value is context (hardware and software)-dependent.

Filesystem types

XFS is the best performer, maybe (partly) because it has a correct lecture of the array geometry. Some think that is is not robust enough, AFAIK it is only exposed to loss after a disk crash, therefore is OK on good RAID.

See Sander Marechal's benchmark.

你可能感兴趣的:(存储性能测试)

探索NebulaGraph：一个开源分布式图数据库的技术解析一休哥助手数据库分布式系统开源分布式数据库
1.介绍NebulaGraph的定位和用途NebulaGraph是一款开源的分布式图数据库，专注于存储和处理大规模图数据。它的主要定位是为了解决图数据存储和分析的问题，能够处理节点和边数量巨大、结构复杂的图结构数据。NebulaGraph被设计用来应对各种领域的图数据挑战，包括社交网络分析、推荐系统、网络安全监测等。无论是从数据量还是计算复杂度上，NebulaGraph都能够应对各种挑战，为用户提
整形在内存中的存储（例题逐个解析）祁同伟. #C语言 c语言
目录一.相关知识点1.截断：2.整形提升：3.如何截断，整型提升？（1）负数（2）正数（3）无符号整型，高位补0注意：提升后得到的是补码。要根据打印类型，判断是否有符号位；有效数字二.例题1.2.3.4.疑问：不应该算数转换为unsignedint吗？5.6.一.相关知识点1.截断：直接保留低位的二进制位2.整形提升：表达式中的字符(char)和短整形(short)操作数在使用之前被转换为普通整型
MySQL 中，分库分表机制和分表分库策略小赖同学啊 java mysql oracle 数据库
在MySQL中，分库分表是一种常见的数据库水平扩展方案，用于解决单库单表数据量过大导致的性能瓶颈问题。通过将数据分散到多个数据库或表中，可以提高系统的并发处理能力、降低单点故障风险，并提升查询性能。一、分库分表的作用提升性能：分散数据存储和查询压力，避免单库单表的性能瓶颈。提高并发能力：多个数据库或表可以并行处理请求，提高系统吞吐量。降低单点故障风险：数据分散存储，单个数据库或表故障不会影响整个系
Python中Requests的Cookies的简单使用北条苒茗殇 python 开发语言 Requests
概述Python的Requests库中有一个cookies，是用于管理HTTPCookie的工具，可以像字典一样操作Cookie，支持自动处理作用域（域名、路径）和持久化，cookies是一个RequestsCookieJar的类型。一、概念1.作用自动存储服务器返回的Cookie根据请求域名和路径进行自动发送匹配的Cookie支持手动添加、修改、删除Cookie2.RequestsCookieJ
解决前后端分离跨域产生的session丢失问题 luckilyil BUG java servlet
目录前言存储用户信息的方式Cookies：Token（令牌）：LocalStorage/SessionStorage：Session：Redis：OAuth/OIDC：本篇文章主要讲使用session会话来存储信息会话机制1.何为一次会话，会话从什么时候开始，从什么时候结束？2.cookies如何保持会话，它的工作流程？3.什么是Session？Session的工作原理：问题出现解决方法总结前言现
31天Python入门——第11天:挑战一口气把闭包·装饰器讲明白安然无虞 Python手把手教程 python 开发语言后端 pyqt
你好，我是安然无虞。文章目录1.闭包扩展知识:闭包的自由变量是如何存储的2.装饰器装饰器的应用场景3.补充练习1.闭包闭包是指在一个函数内部定义的函数，并且这个内部函数可以访问外部函数的变量、参数.换句话说，闭包是一个包含了函数及其相关引用环境的组合体.在Python中，当一个函数返回了内部函数的引用时，这个内部函数可以访问并操作外部函数的局部变量，它就创建了一个闭包,即使外部函数已经执行完毕，它
蓝桥杯web备赛----html篇菥菥爱嘻嘻蓝桥杯备赛前端蓝桥杯 html
1、html写在前面，html相对简单，主要会考基础标签、html5新特性、html5本地存储、但是目前我还没有做到本地存储的题目1.1基础标签(1)、链接标签a:访问Examplehref:链接target：定义链接的打开方式。_blank:在新窗口或新标签页中打开链接。_self:在当前窗口或标签页中打开链接（默认）。_parent:在父框架中打开链接。_top:在整个窗口中打开链接，取消任何
【ZYNQ开发】使用xilfs库实现对于SD卡的读写辣个蓝人QEX ZYNQ MPSoC ZYNQ MPSoC arm开发 Xilffs FATFS 文件IO
文章目录1Xiliffs库2BSP配置3文件IO操作4一些重要的细节5总体测试代码1Xiliffs库Xilinx的Xilffs库是一个用于嵌入式系统的FAT文件系统库。它支持FAT12、FAT16和FAT32乃至exFAT系统文件系统，适用于SD卡、eMMC等存储设备。Xilffs库与Xilinx的硬件IP核（如SD/SDIO控制器）紧密集成，提供文件读写、目录管理等基本操作。该库具有轻量级、可配
C#进阶之路：揭秘反序列化漏洞与解决方案计算机学长开发工具 C#web安全网络 c#
一、引言在现代软件开发中，数据的持久化和传输是至关重要的环节。C#作为一种广泛使用的编程语言，其序列化与反序列化机制在这两个环节中扮演着不可或缺的角色。序列化，是将对象的状态信息转换为可以存储或传输的形式的过程，比如将对象转换为字节流、JSON字符串或者XML格式。而反序列化则是将这些序列化后的数据重新转换回原始对象的过程。在实际应用中，当我们需要将对象保存到文件系统、数据库，或者通过网络在不同的
数据库设计20条军规：血泪教训换来的实战指南潘多编程数据库
优秀的数据库设计不是炫技，而是用最低的成本规避最痛的坑。在经历过数百次深夜故障复盘后，我总结了这些真正经得起生产环境考验的铁律：一、基础生存法则第三范式是起点不是终点订单表里的收货地址必须拆成独立地址表？先看业务场景：日均10万订单的电商系统，拆分会带来3表关联查询，不拆可能存储冗余。实战解法：高频查询字段适当冗余，低频字段严格范式化。命名规范要强制执行user_order_2023比tbl_us
Spring Cloud Config 快速介绍与实例 oscar999 Spring Boot实战开发大全 Spring Boot Cloud Config
SpringCloudConfig是什么？SpringCloudConfig是一个用于分布式系统的配置管理工具，提供集中化的外部配置支持。它适用于微服务架构，能够将各个服务的配置集中存储在服务端（如Git仓库），客户端按需动态获取配置，解决了配置分散、环境切换复杂等问题。SpringCloudConfig核心概念ConfigServer：配置中心服务端，统一管理配置，支持Git、本地文件等存储方式
HTML5！进击2025web蓝桥杯复习之路 Deepsleep. html5 前端 html
#HTML5全面解析##目录1.[HTML5简介](#1-html5-简介)2.[基本标签](#2-基本标签)3.[新特性](#3-新特性)4.[本地存储](#4-本地存储)5.[总结](#5-总结)---##1.HTML5简介HTML5是HTML的第五个主要版本，2014年由W3C正式发布。主要特性包括：-语义化标签-多媒体支持-图形绘制（Canvas/SVG）-本地存储能力-WebWorker
Appdata\Local Roaming LocalLow文件夹 ynchyong 系统运维 local Roaming LocalLow
自Vista及Win7开始，微软更改了原有的应用程序存储目录结构，（XP是ApplicationData）C\用户\用户名\Appdata,并分为Roaming,Local,及LocalLow三个文件夹.更改原因如下:优化登录速度根据使用安全级别分别访问不同文件夹Windows使用Local及LocalLow文件夹存放非漫游的应用程序数据（类似注册表Local_machine）及一些空间占用大无法
亿级流量架构网关设计思路，常用网关对比，写得太好了。。 wadfdhsajd java 后端框架大数据
什么是网关网关,很多地方将网关比如成门,没什么问题,但是需要区分网关与网桥的区别,网桥工作在数据链路层，在不同或相同类型的LAN之间存储并转发数据帧，必要时进行链路层上的协议转换。可连接两个或多个网络，在其中传送信息包。网关是一个大概念，不具体特指一类产品，只要连接两个不同的网络都可以叫网关,网桥一般只转发信息,而网关可能进行包装。网关通俗理解根据网关的特性,举个例子:假如你要去找集团老板(这儿只
运维面试题（七） a_j58 运维
1.statefulset用来管理有状态的应用程序，有状态是什么意思？每一个pod都有一个固定的网络标识符，在整个生命周期中不会改变。每个实例都可以拥有自己的持久化存储卷，即使容器被删除并重新创建，存储卷仍然存在。StatefulSet确保了Pod按照顺序启动、更新和终止。2.主键是什么，它与索引有什么关系？主键确保表中每一行数据都可以被唯一标识，避免数据重复。主键通常会自动创建一个唯一索引，加快
SAP-ABAP：ABAP内存和SAP内存详细对比爱喝水的鱼丶 VIP详情查看专栏 SAP-ABAP开发基础详解 ABAP开发之必须知道的 SAP 运维 ABAP ERP
在SAPABAP中，内存数据（MemoryData）是一种临时存储机制，允许在同一会话或程序之间共享数据。内存数据存储在ABAP内存（ABAPMemory）或SAP内存（SAPMemory）中，具体取决于数据的生命周期和共享范围。以下是关于如何在SAP中保存和使用内存数据的详细说明：—##1.ABAP内存vsSAP内存###ABAP内存-作用范围:仅在当前内部会话（InternalSession）
Linux------Redis(软件安装，Linux下和Windows下)，NoSQL（简单了解） .墨迹. Linux redis 大数据 java
文章目录NoSql1.历史1.单机MySql2.Memcached(缓存)+MySql+垂直拆分(读写分离)3.分库分表+水平拆分+MySql集群4.如今最近的年代5.为什么要使用NoSQL2.什么是NoSQL1.NOSQL2.特点3.3v+3高3.NoSQL的四大分类1.kv键值对：2.文档型数据库（bson和json一样）：3.列存储数据库：4.图关系型数据库Redis1.初始redis1.简
基于跳表实现的轻量级KV存储引擎项目总结码云笔记后端 KV存储
项目介绍KV存储引擎众所周知，非关系型数据库redis，以及levedb，rockdb其核心存储引擎的数据结构就是跳表。本项目就是基于跳表实现的轻量级键值型存储引擎，使用C++实现。插入数据、删除数据、查询数据、数据展示、数据落盘、文件加载数据，以及数据库大小显示。在随机写读情况下，该项目每秒可处理啊请求数（QPS）:24.39w，每秒可处理读请求数（QPS）:18.41w项目存储文件main.c
从零实现KV存储项目实战程序员老舅 C++Linux后端 c++c++存储 kv存储分布式存储后端项目 c++项目 cpp项目
本项目是从零实现一个完整的、兼容Redis协议的KV数据库项目。通过每一行代码的编写。你会对整个系统了如指拿，这样对自己基本功的锻炼、对编程能力的提升都是很大的项目提供完整的视频教程+代码下面是关于KV存储项目的技术大纲：如果你在学习的过程当中，遇到有任何问题，都可以在项目社群提出了，有专人给大家答疑的。适用人群这个KV存储项目对以下同学应该都非常的合适,包括但不限于:●想入门数据库的同学，存储对
硬核项目 KV 存储，轻松拿捏面试官！程序员老舅 C++Linux后端 KV存储 C++C++后端开发 Redis 内存索引 C++数据结构
硬核项目KV存储，轻松拿捏面试官！在简历上如何写这个项目？项目概述基于Bitcask模型，兼容Redis数据结构和协议的高性能KV存储引擎设计细节采用Key/Value的数据模型，实现数据存储和检索的快速、稳定、高效存储模型：采用Bitcask存储模型，具备高吞吐量和低读写放大的特征持久化：实现了数据的持久化，确保数据的可靠性和可恢复性索引：多种内存索引结构，高效、快速数据访问并发控制：使用锁机制
华为云计算产品系列 | 云上迁移工具RainBow实战详解降世神童云计算技术专栏华为华为云云计算
华为云计算产品系列|云上迁移工具RainBow实战详解1.迁移方案2.迁移流程3.迁移实验3.1.Windows系统迁移3.2.Linux系统迁移3.3.存储层迁移1.迁移方案 RainBow可以将物理机或者虚拟机上的业务迁移到华为的虚拟化平台和私有云平台（6.5.1以上支持），还可以实现低版本私有云迁移到高版本私有云。 Rainbow是华为自研迁移工具，支持X86架构下主流的Linux、Wi
链接-简介 zhubo_1117 深入理解计算机系统
链接是将代码和数据合成一个文件的一个过程，生成的文件可以直接拷贝到存储器中并且执行。链接可以在程序编译时，加载时，甚至运行时执行。1.编译器的驱动程序编译器系统中包含编译驱动程序，驱动程序主要包含：预处理器，编译器，汇编器和连接器。处理过程如下：预处理器编译器汇编器main.c------------------>main.i----------------------->main.s------
Linux 内核数据结构解析--哈希链表 Black8Mamba24 Linux内核数据结构
一、Hash表的基本定义1.1Hash的概念散列表（Hashtable，也叫哈希表）,是一种数据结构，可以用于存储Key-Value键值对。也就是说，通过Key来映射到具体的Value。通常用于查找。将Key映射到Value的函数叫做Hash函数，而存储Key-Value的表叫做Hash表。Hasn表常用数组来存储。1.2常用的Hash函数1.3常用的处理碰撞的方法如果说存储空间是无线的，那只要定
使用Couchbase实现高效的AI应用缓存与数据存储 scaFHIO 人工智能缓存 python
在当今AI应用的开发中，除了模型本身的性能，数据存储和缓存的效率也至关重要。Couchbase作为一款分布式NoSQL云数据库，其性能、可扩展性以及对AI、边缘计算应用的支持能力，使其成为优秀的选择。在本文中，我们将探讨如何通过Couchbase来实现高效的数据存储与缓存，尤其是在AI应用中。技术背景介绍随着AI应用规模的扩大和复杂度的增加，我们需要可靠的数据存储解决方案来满足实时性要求，同时减少
深度剖析linux内核万能--双向链表,Hash链表模版 Engineer-Bruce_Yang C语言-算法与数据结构编程 C语言在开发中的应用
我们都知道，链表是数据结构中用得最广泛的一种数据结构，对于数据结构，有顺序存储，数组就是一种。有链式存储，链表算一种。当然还有索引式的，散列式的，各种风格的说法，叫法层出不穷，但是万变不离其中，只要知道什么场合用什么样的数据结构，那就行了。那么，标题说的内核万能链表，其实就是内核链表，它到底和我们平常大学学的数据结构的链表有什么不同呢？？内核链表，是在linux内核里的一种普遍存在的数据结构，比如
数据结构——链表专项 seven——seven linux mailbox之线程邮箱数据结构链表算法
数据结构的总结1.定义一组用来保存一种或者多种特定关系的数据的集合（组织和存储数据）程序的设计：将现实中大量而复杂的问题以特定的数据类型和特定的存储结构存储在内存中，并在此基础上实现某个特定的功能的操作；程序=数据结构+算法高内聚，低耦合2.数据与数据之间的关系数据的逻辑结构：数据元素与元素之间的关系集合：关系平等线性结构：元素之间一对一的关系（表，队列。栈。。。）树型结构：元素之间一对多的关系（
数据结构-----队列磨十三数据结构算法 linux
顺序队列（Queue）一、队列核心概念1.基本特性先进先出（FIFO）：最早入队的元素最先出队操作限制：队尾（Rear）：唯一允许插入的位置队头（Front）：唯一允许删除的位置2.顺序队列结构typedefintDATATYPE;typedefstructqueue{DATATYPE*ptr;//存储空间基地址inttlen;//队列总容量inthead;//队头索引inttail;//队尾索引
多级缓存设计实践 MClink 架构缓存
缓存是什么？缓存技术是一种用于加速数据访问的优化策略。它通过将频繁访问的数据存储在高速存储介质（如内存）中，减少对慢速存储设备（如硬盘或远程服务器）的访问次数，从而提升系统的响应速度和性能。缓存的基本原理是：当某个数据被请求时，系统首先检查缓存中是否已存储该数据。如果缓存中存在，则直接返回缓存中的数据，称为“缓存命中”；如果缓存中没有该数据，则从源数据存储（如数据库或远程服务器）中获取数据，并将其
Graylog日志系统超详细部署和配置 kim_liao123 部署 elasticsearch docker
Graylog日志系统部署和配置1.软件介绍：Graylog是一个开源的日志聚合、分析、审计、展现和预警工具。功能上和ELK类似，但又比ELK要简单，依靠着更加简洁，高效，部署使用简单；官方文档：https://docs.graylog.org/en/3.3/pages/users_and_roles.html以下所有部署方式都来源与官方文档2.软件准备：服务端：Mongo：存储graylog的一
Java高频面试之SE-23 牛马baby java 面试 windows
hello啊，各位观众姥爷们！！！本baby今天又来了！哈哈哈哈哈嗝Java中的Stream是Java8引入的一种全新的数据处理方式，它基于函数式编程思想，提供了一种高效、简洁且灵活的方式来操作集合数据。Stream的核心思想是声明式编程（告诉程序“做什么”，而不是“怎么做”）。1.Stream的核心特点无存储：Stream不存储数据，只是对数据源的视图（如集合、数组、I/O通道等）。函数式操作：
Enum用法不懂事的小屁孩 enum
以前的时候知道enum，但是真心不怎么用，在实际开发中，经常会用到以下代码: protected final static String XJ = "XJ"; protected final static String YHK = "YHK"; protected final static String PQ = "PQ";
【Spark九十七】RDD API之aggregateByKey bit1129 spark
1. aggregateByKey的运行机制 /** * Aggregate the values of each key, using given combine functions and a neutral "zero value". * This function can return a different result type
hive创建表是报错： Specified key was too long; max key length is 767 bytes daizj hive
今天在hive客户端创建表时报错，具体操作如下 hive> create table test2(id string); FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataSto
Map 与 JavaBean之间的转换周凡杨 java 自省转换反射
最近项目里需要一个工具类，它的功能是传入一个Map后可以返回一个JavaBean对象。很喜欢写这样的Java服务，首先我想到的是要通过Java 的反射去实现匿名类的方法调用，这样才可以把Map里的值set 到JavaBean里。其实这里用Java的自省会更方便，下面两个方法就是一个通过反射，一个通过自省来实现本功能。 1：JavaBean类 1 &nb
java连接ftp下载 g21121 java
有的时候需要用到java连接ftp服务器下载，上传一些操作，下面写了一个小例子。 /** ftp服务器地址 */ private String ftpHost; /** ftp服务器用户名 */ private String ftpName; /** ftp服务器密码 */ private String ftpPass; /** ftp根目录 */ private String f
web报表工具FineReport使用中遇到的常见报错及解决办法（二）老A不折腾 finereport web报表 java报表总结
抛砖引玉，希望大家能把自己整理的问题及解决方法晾出来，Mark一下，利人利己。出现问题先搜一下文档上有没有，再看看度娘有没有，再看看论坛有没有。有报错要看日志。下面简单罗列下常见的问题，大多文档上都有提到的。 1、没有返回数据集：在存储过程中的操作语句之前加上set nocount on 或者在数据集exec调用存储过程的前面加上这句。当S
linux 系统cpu 内存等信息查看墙头上一根草 cpu 内存 liunx
1 查看CPU 　　1.1 查看CPU个数　　# cat /proc/cpuinfo | grep "physical id" | uniq | wc -l 　　2 　　**uniq命令：删除重复行;wc –l命令：统计行数** 　　1.2 查看CPU核数　　# cat /proc/cpuinfo | grep "cpu cores" | u
Spring中的AOP aijuans spring AOP
Spring中的AOP Written by Tony Jiang @ 2012-1-18 （转）何为AOP AOP，面向切面编程。在不改动代码的前提下，灵活的在现有代码的执行顺序前后，添加进新规机能。来一个简单的Sample: 目标类： [java] view plain copy print ? package&nb
placeholder(HTML 5) IE 兼容插件 alxw4616 JavaScript jquery jQuery插件
placeholder 这个属性被越来越频繁的使用. 但为做HTML 5 特性IE没能实现这东西. 以下的jQuery插件就是用来在IE上实现该属性的. /** * [placeholder(HTML 5) IE 实现.IE9以下通过测试.] * v 1.0 by oTwo 2014年7月31日 11:45:29 */ $.fn.placeholder = function
Object类,值域,泛型等总结(适合有基础的人看) 百合不是茶泛型的继承和通配符变量的值域 Object类转换
java的作用域在编程的时候经常会遇到,而我经常会搞不清楚这个问题,所以在家的这几天回忆一下过去不知道的每个小知识点变量的值域; package 基础; /** * 作用域的范围 * * @author Administrator * */ public class zuoyongyu { public static vo
JDK1.5 Condition接口 bijian1013 java thread Condition java多线程
Condition 将 Object 监视器方法（wait、notify和 notifyAll）分解成截然不同的对象，以便通过将这些对象与任意 Lock 实现组合使用，为每个对象提供多个等待 set （wait-set）。其中，Lock 替代了 synchronized 方法和语句的使用，Condition 替代了 Object 监视器方法的使用。条件（也称为条件队列或条件变量）为线程提供了一
开源中国OSC源创会记录 bijian1013 hadoop spark MemSQL
一.Strata+Hadoop World（SHW）大会是全世界最大的大数据大会之一。SHW大会为各种技术提供了深度交流的机会，还会看到最领先的大数据技术、最广泛的应用场景、最有趣的用例教学以及最全面的大数据行业和趋势探讨。二.Hadoop &nbs
【Java范型七】范型消除 bit1129 java
范型是Java1.5引入的语言特性，它是编译时的一个语法现象，也就是说，对于一个类，不管是范型类还是非范型类，编译得到的字节码是一样的，差别仅在于通过范型这种语法来进行编译时的类型检查，在运行时是没有范型或者类型参数这个说法的。范型跟反射刚好相反，反射是一种运行时行为，所以编译时不能访问的变量或者方法(比如private)，在运行时通过反射是可以访问的，也就是说，可见性也是一种编译时的行为，在
【Spark九十四】spark-sql工具的使用 bit1129 spark
spark-sql是Spark bin目录下的一个可执行脚本，它的目的是通过这个脚本执行Hive的命令，即原来通过 hive>输入的指令可以通过spark-sql>输入的指令来完成。 spark-sql可以使用内置的Hive metadata-store，也可以使用已经独立安装的Hive的metadata store 关于Hive build into Spark
js做的各种倒计时 ronin47 js 倒计时
第一种：精确到秒的javascript倒计时代码 HTML代码: <form name="form1"> <div align="center" align="middle"
java-37.有n 个长为m+1 的字符串，如果某个字符串的最后m 个字符与某个字符串的前m 个字符匹配，则两个字符串可以联接 bylijinnan java
public class MaxCatenate { /* * Q.37 有n 个长为m+1 的字符串，如果某个字符串的最后m 个字符与某个字符串的前m 个字符匹配，则两个字符串可以联接， * 问这n 个字符串最多可以连成一个多长的字符串，如果出现循环，则返回错误。 */ public static void main(String[] args){
mongoDB安装开窍的石头 mongodb安装基本操作
mongoDB的安装 1:mongoDB下载 https://www.mongodb.org/downloads 2:下载mongoDB下载后解压
[开源项目]引擎的关键意义 comsci 开源项目
一个系统，最核心的东西就是引擎。。。。。而要设计和制造出引擎，最关键的是要坚持。。。。。。现在最先进的引擎技术，也是从莱特兄弟那里出现的，但是中间一直没有断过研发的
软件度量的一些方法 cuiyadll 方法
软件度量的一些方法http://cuiyingfeng.blog.51cto.com/43841/6775/在前面我们已介绍了组成软件度量的几个方面。在这里我们将先给出关于这几个方面的一个纲要介绍。在后面我们还会作进一步具体的阐述。当我们不从高层次的概念级来看软件度量及其目标的时候，我们很容易把这些活动看成是不同而且毫不相干的。我们现在希望表明他们是怎样恰如其分地嵌入我们的框架的。也就是我们度量的
XSD中的targetNameSpace解释 darrenzhu xml namespace xsd targetnamespace
参考链接: http://blog.csdn.net/colin1014/article/details/357694 xsd文件中定义了一个targetNameSpace后，其内部定义的元素，属性，类型等都属于该targetNameSpace,其自身或外部xsd文件使用这些元素，属性等都必须从定义的targetNameSpace中找：例如：以下xsd文件，就出现了该错误，即便是在一
什么是RAID0、RAID1、RAID0+1、RAID5，等磁盘阵列模式? dcj3sjt126com raid
RAID 1又称为Mirror或Mirroring，它的宗旨是最大限度的保证用户数据的可用性和可修复性。 RAID 1的操作方式是把用户写入硬盘的数据百分之百地自动复制到另外一个硬盘上。由于对存储的数据进行百分之百的备份，在所有RAID级别中，RAID 1提供最高的数据安全保障。同样，由于数据的百分之百备份，备份数据占了总存储空间的一半，因而，Mirror的磁盘空间利用率低，存储成本高。 Mir
yii2 restful web服务快速入门 dcj3sjt126com PHP yii2
快速入门 Yii 提供了一整套用来简化实现 RESTful 风格的 Web Service 服务的 API。特别是，Yii 支持以下关于 RESTful 风格的 API：支持 Active Record 类的通用API的快速原型涉及的响应格式（在默认情况下支持 JSON 和 XML) 支持可选输出字段的定制对象序列化适当的格式的数据采集和验证错误
MongoDB查询(3)——内嵌文档查询（七） eksliang MongoDB查询内嵌文档 MongoDB查询内嵌数组
MongoDB查询内嵌文档转载请出自出处：http://eksliang.iteye.com/blog/2177301 一、概述有两种方法可以查询内嵌文档：查询整个文档；针对键值对进行查询。这两种方式是不同的，下面我通过例子进行分别说明。二、查询整个文档例如:有如下文档 db.emp.insert({ &qu
android4.4从系统图库无法加载图片的问题 gundumw100 android
典型的使用场景就是要设置一个头像，头像需要从系统图库或者拍照获得，在android4.4之前，我用的代码没问题，但是今天使用android4.4的时候突然发现不灵了。baidu了一圈，终于解决了。下面是解决方案： private String[] items = new String[] { "图库","拍照" }; /* 头像名称 */
网页特效大全 jQuery等 ini JavaScript jquery css html5 ini
HTML5和CSS3知识和特效 asp.net ajax jquery实例分享一个下雪的特效 jQuery倾斜的动画导航菜单选美大赛示例你会选谁 jQuery实现HTML5时钟功能强大的滚动播放插件JQ-Slide 万圣节快乐！！！向上弹出菜单jQuery插件 htm5视差动画 jquery将列表倒转顺序推荐一个jQuery分页插件 jquery animate
swift objc_setAssociatedObject block(version1.2 xcode6.4) 啸笑天 version
import UIKit class LSObjectWrapper: NSObject { let value: ((barButton: UIButton?) -> Void)? init(value: (barButton: UIButton?) -> Void) { self.value = value
Aegis 默认的 Xfire 绑定方式，将 XML 映射为 POJO MagicMa_007 java POJO xml Aegis xfire
Aegis 是一个默认的 Xfire 绑定方式，它将 XML 映射为 POJO, 支持代码先行的开发.你开发服务类与 POJO,它为你生成 XML schema/wsdl XML 和注解映射概览默认情况下，你的 POJO 类被是基于他们的名字与命名空间被序列化。如果
js get max value in (json) Array qiaolevip 每天进步一点点学习永无止境 max 纵观千象
// Max value in Array var arr = [1,2,3,5,3,2];Math.max.apply(null, arr); // 5 // Max value in Jaon Array var arr = [{"x":"8/11/2009","y":0.026572007},{"x"
XMLhttpRequest 请求 XML,JSON ,POJO 数据 Luob. POJO json Ajax xml XMLhttpREquest
在使用XMlhttpRequest对象发送请求和响应之前，必须首先使用javaScript对象创建一个XMLHttpRquest对象。 var xmlhttp； function getXMLHttpRequest(){ if(window.ActiveXObject){ xmlhttp:new ActiveXObject("Microsoft.XMLHTTP
jquery wuai jquery
以下防止文档在完全加载之前运行Jquery代码，否则会出现试图隐藏一个不存在的元素、获得未完全加载的图像的大小等等 $(document).ready(function(){ jquery代码; }); <script type="text/javascript" src="c:/scripts/jquery-1.4.2.min.js&quo