Working with many files and file system fragmentation

MySQL File System Fragmentation Benchmarks

Few days ago I wrote about testing writing to many files and seeing how this affects sequential read performance. I was very interested to see how it shows itself with real tables so I’ve got the script and ran tests for MyISAM and Innodb tables on ext3filesystem. Here is what I found:

The fragmentation we speak in this article is filesystem fragmentation or internal table fragmentation which affects performance of full table scan. Not all queries are going to be affected same way, for example point select reading single page should not be significantly affected – ie you may not be affected as bad as we show here.

Benchmarks were done using this script:
The benchmark run with following simple shell script:

[root@DB10 ~]# for i in 1 10 100 1000 10000; do ./benchmark.php $i 10000000; mysql -e'drop database test1'; mysql -e'create database test1'; done;
tables: 1; total records: 10000000; write rows per sec: 9498.478176443 , reads rows per sec: 45142.234447809sec.
tables: 10; total records: 10000000; write rows per sec: 8401.0627704619 , reads rows per sec: 18970.855770078sec.
tables: 100; total records: 10000000; write rows per sec: 6689.2612428044 , reads rows per sec: 2212.7170957877sec.
tables: 1000; total records: 10000000; write rows per sec: 5180.4069362984 , reads rows per sec: 1346.7226581156sec.
tables: 10000; total records: 10000000; write rows per sec: 262.6496819245 , reads rows per sec: 1169.4009695919sec.

The script creates specified amount of tables and does specified number of inserts going to random tables. I used default MySQL settings for MyISAM (table_cache=64) and set innodb_buffer_pool_size=8G innodb_flush_logs_at_trx_commit=2innodb_log_file_size=256M innodb_flush_method=O_DIRECT for Innodb.

The tables were sized so they are considerably larger than amount of memory in box so full table scan will be IO bound.

As you can see from MyISAM results (above) the insert speeds does not degrade that badly until going from 1000 to 10000 tables, even though table_cache was just 64. I expect this is because updating index header (most complex part of opening and closing MyISAM table) can happen by OS in background and flushing 1000 pages each 30 seconds is not big overhead for this server configuration.

Going to 10000 tables however insert speed dropped 20 times. This could be because ext3 does not like so many files in directory or because random updates to 10000 distinct pages for index header updates not to mention modification time update is a lot of overhead. During this last test box felt really sluggish responding 10+ seconds for as simple command as “ls” even though loadavg was about 1. In the process list I could see some single value insert statements taking over 5 seconds… So it does not work very well.

Note: As I checked later contrary to my expectation this filesystem was created without dir_index option which should add significant overhead for insert with many tables.

The read performance, which is the main measurement for this benchmark suffered quite as expected – with 10000 tables it was 40 times worse than with single table! Looking at IOSTAT I could see average read size of being just 4K which means ext3 does horrible job in this case of doing extent allocation. Note however even 100 tables are enough to drop performance 20 times.

Innodb in single tablespace mode showed following results:

[root@DB10 ~]# for i in 1 10 100 1000 10000; do ./benchmark.php $i 10000000; mysql -e'drop database test1'; mysql -e'create database test1'; done;
tables: 1; total records: 10000000; write rows per sec: 4919.7214223134 , reads rows per sec: 25408.766711241sec.
tables: 10; total records: 10000000; write rows per sec: 4887.5507251885 , reads rows per sec: 11848.973747839sec.
tables: 100; total records: 10000000; write rows per sec: 4007.1215976416 , reads rows per sec: 11826.941546043sec.
tables: 1000; total records: 10000000; write rows per sec: 2838.9678814081 , reads rows per sec: 13758.602641499sec.
tables: 10000; total records: 10000000; write rows per sec: 803.46939763369 , reads rows per sec: 3629.3610005806sec.

As you can see insert speed starts slower but degrades less, even though drop from 1000 to 10000 tables is dramatic as well. The read speed is also slower (expected as table was larger for same amount of rows) though it drops at different rate. Interesting enough it dropped just 2 times and was about same for 10 100 and 1000 tables which could be because of extent allocation for rather large tables. For 10000 tables we had just 1000 of 4K rows in the table which caused too much space allocated as single pages. I expect if we would use larger amount of rows read performance for 10000 tables would be close.

Innodb with innodb_file_per_table=1 had the following results:

[root@DB10 ~]# for i in 1 10 100 1000 10000; do ./benchmark.php $i 10000000; mysql -e'drop database test1'; mysql -e'create database test1'; done;
tables: 1; total records: 10000000; write rows per sec: 4479.3690015631 , reads rows per sec: 25554.477094788sec.
tables: 10; total records: 10000000; write rows per sec: 4279.1557765714 , reads rows per sec: 16787.265656296sec.
tables: 100; total records: 10000000; write rows per sec: 3609.974742019 , reads rows per sec: 16525.06580466sec.
tables: 1000; total records: 10000000; write rows per sec: 2130.8515988384 , reads rows per sec: 11602.826401997sec.
tables: 10000; total records: 10000000; write rows per sec: 434.330528194 , reads rows per sec: 5157.1389149296sec.

Insert performance is close, the difference is perhaps explained by the fact files needed to be constantly extended (meta data updates) and reopened for more than 100 tables. Read performance starts close but degrades less for 10 and 100 tables and when better again for 10000 tables. I can’t explain why it is a bit worse for 1000 tables thought as I did only one run (It took more than 24 hours) it also could be some activity spike.

A bit better performance in this case can be perhaps explained but a bit larger increment how tablespaces are allocated comparted to internal allocation from single tablespace.

Summary: There are few basic things we can learn from these results
– Concurrent growth of many tables causes data fragmentation and affects table scan performance nadly
– MyISAM suffers worse than Innodb
– Innodb extent allocation works (perhaps would be good option for MyISAM as well)
– Innodb suffers fragmentation less if it stores different tables in different files.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Simetrical  says:

    Were you using dir_index for the filesystems?

  2. Diego  says:

    was this a typo?
    >>4K which means ext2 does horrible job<< (ext2 or was it ext3?)

  3. PaulM  says:

    Nice article Peter,

    We had a similar issue with the number of files per directory on a migration project I worked on in the past. A perl script was taking articles from a legacy db and dumping each as individual files onto a linux FS (ext3). After a fast start, as you found it slowed down dramatically.
    Our solution was to add an additional step to the dump article script to split the files into 1000 per directory. The performance was then stable throughout the process. The funny thing was I offered a slab of beer to the IT portion of the company for someone who could solve it (got many more people interested).

    Maybe you should add another recommendation:
    No more than 1000 files per directory on ext3

    Have Fun
    Paul

  4. peter  says:

    Simetrical,

    I just checked and it was not used… I thought it was set by default already on CentOS4 box. So yes lack of dir_index should be one of the things which impacted insert speed.

  5. peter  says:

    Diego,

    Thanks fixed. I only tested with ext3 you’re welcome to test with ext2 and let us know 

  6. peter  says:

    PaulM,

    This is not really about number of files per directory, though I have not tried hashing it. The point is files get fragmented and file read speed becomes low – while scanning large tables the penalty of locating and opening the file is not huge.

  7. Kevin Burton  says:

    I might be wrong but if you have table_cache set to a large value then dir_index won’t really make much difference once the file is opened.

    This will be amortized over the entire length of the DB server.

  8. peter  says:

    Kevin,

    table_cache is not the only one you should be looking for – if you’re using innodb_file_per_table setting innodb_open_files also need to be set high enough so no reopens are required.

    I specially kept those default so we get the open/reopen overhead factored in.

  9. Kevin Burton  says:

    Peter,

    If you’re interested in testing the performance of open() then you should do this in a dedicated benchmark.

    If you test two things you’re going to get different results for different filesystems on different OSes.

    If you just test fragmentation you would get more and similar results.

    Fragmentation with MyISAM InnoDB can happen with just two tables, each taking INSERT load in round robin fashion.

    Kevin

  10. peter  says:

    Kevin,

    Of course in pure science you could spend a lot of time on this and test different things… Though I mainly tested what had practical interest for me at that point.

    Regarding fragmentation – you can’t really assume 2 tables are enough without knowing about internal implementation of OS and tables. First it is quite possible to design filesystem which would be able to handle small amount of growing files well but not large number of files.

    But what is even more important is not all fragmentations are same.

    Consider the worst case with 2 files with blocks going as 12121212121212 drive/raid/os or storge engine itself will do read-ahead which will fetch few blocks with single read. 1MB read will have only 512K of data for the given table but it is still much better than getting single row with random read 

  11. Kevin Burton  says:

    Peter.

    I agree with your worst case scenario. This is what I tried to point out in my previous comment. Though maybe I didn’t do a good job expressing myself 

    The point I was trying to make is that in that situation there’s not much the filesystem CAN do.

    It could TRY to pre-allocate both files in larger chunks but then you’d have angular velocity kick in on the HDDs.

    InnoDB’s grow factor (which by default is 8M I believe) is a good balance.

  12. peter  says:

    Kevin,

    There ARE things filesystem can do. For example you could externally allocate space in extents, related to file size and filesystem size. For example if you have 50MB file allocating in 1MB blocks would still cause no more than 2% space waste. For 10K image of course you do not want.

    Another optimization which can be done is called delayed allocation. When you perform writes you can actually allocate space only when you flush data to disk, this way you can accumulate larger fragments.

    It is not Innodb single tablespace allocation important here as it is single growing file anyway but how innodb allocates data intermally – which is done in 1M extents after first few pages.

    I’m not sure what is default grow increment for innodb_file_per_table tablespace – this one is important as many files grow at the same time.

  13. Apachez  says:

    How was the partition created and which flags were used for mounting it?

    Things like dir_index etc but also things like noatime.

    Could a new test be performed on the same box using for example noatime on the mount and see how it (if any) changes?

  14. peter  says:

    Apachez,

    Absolutely. I specially published the benchmark script so everyone could easily repeat the run with options they like 

    There are a lot of variables you can play with 

  15. paul  says:

    Hi,
    you should retroy this benchmark with XFS (noatime) and ReiserFS (noatime, notail).
    My findings when I tried it with our workload was that XFS gets really slow as soon as you have a very many files, and ReiserFS works great for such a workload.
    Well I didn’t check MySQL Performance at all, just creating folders with empty files in it to see how much the filesystem affected the performance with a lot of folders and a fixed file count in them.

    It would be nice to see if the difference is seen as clearly with actual data in the files 

  16. peter  says:

    Thank Paul,

    I may run it on XFS when I have a chance. ReiserFS future is kind of uncertain now and as it is dropped as default filesystem by SuSE and not supported by RedHat I do not see customers eager to use it in production.

    Also note just many files and support of many growing files is different things. ReiserFS indeed works well with small files I remember creating 10.000.000 of 100 byte files in the directory and it still was working fine.

  17. paul  says:

    We use it in production, as XFS slowed up on us and mysql became just slow because we had a lot of users on it. The speed difference was like the difference in O(e^n) vs O(n), but as we have a quite uncommon workload it is somewhat our own problem. We have a lot of Databases which are quite tiny.
    You might still be interessted in testing it, as my experience shows that a query on an idle server with xfs and 100K Databases takes way longer than a query on the same server with reiserfs.

  18. Nate  says:

    I am a newbie to MySQL and am not getting the high insert throughput as in this benchmark. Could you post the my.cnf file used and the computer hardware specs that produced this benchmark. I am interested in general, innodb, and myisam settings and installation configuration of the machine. I am interested in how many processors and processor type, processor cache, RAM size, front side bus speed, harddrive rpms, harddrive max write speed, and operating system. Also are you doing multiple inserts per transaction.

    here is data from my tests:
    computer 1:

    My.ini
    [client]
    port=3306
    [mysql]
    default-character-set=latin1
    [mysqld]
    port=3306
    basedir=”C:/Program Files/MySQL/MySQL Server 5.0/”
    datadir=”C:/Program Files/MySQL/MySQL Server 5.0/Data/”
    default-character-set=latin1
    default-storage-engine=INNODB
    #default-storage-engine=MyISAM
    sql-mode=”STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION”
    max_connections=100
    query_cache_size=0
    table_cache=256
    tmp_table_size=93M
    thread_cache_size=8
    #*** MyISAM Specific options
    myisam_max_sort_file_size=100G
    myisam_max_extra_sort_file_size=100G
    myisam_sort_buffer_size=185M
    key_buffer_size=157M
    read_buffer_size=64K
    read_rnd_buffer_size=256K
    sort_buffer_size=256K
    #*** INNODB Specific options ***
    innodb_additional_mem_pool_size=20M
    innodb_flush_log_at_trx_commit=0
    innodb_log_buffer_size=4M
    innodb_buffer_pool_size=304M
    innodb_log_file_size=152M
    innodb_thread_concurrency=8

    RAM: 1GB
    Processor: Intel P4 2.253GHz
    Cache size: 512KB
    Front side bus: 530 MHz
    Harddrive: WDC WD400BB-75DEA0
    Max Burn Rate: 100MB/s tested 8MB/s
    RPMs: 7200

    using multiple inserts per transaction for 1 innodb table (25 inserts or .2 seconds of data to be inserted) I got 1121, 1133, and 1166 inserts per second into that one table.
    using autocommit for innodb inserts I got 1125, and 1158 inserts per second into the table
    using myisam I got 1186 and 1177 inserts per second into the table
    the average data length was 231B. It seemed the writing of the insert is what was taking all the time. Data was coming in much faster than it was able to insert. The data queue was getting quite long. I ran the test for 5 minutes. The inserts per second was pretty constant the little variation arrose in how fast the data was coming in which was no more than 3000 packets of data per second. The processor would go up to about 80% during tests.

    please give me feed back. I am trying to get the data to be inserted faster than it is coming in. Please post the my.cnf file that you were able to get 9000 inserts per second and the computer specs for that test.

    Thanks.

  19. Shai  says:

    From the sample code, your read statitic is bogus.
    you do many more SELECT on many table than one table so, of course, you will get slower time, but in reality if you did your calculation correctly you will get a much faster read when you have more tables with less data in each table. here is the problem:

    insteed of
    $number_of_records/(microtime(1)-$t_temp);
    do this:
    $number_of_tables/(microtime(1)-$t_temp);

    and if you do want to keep it with number_of_record try this
    ($number_of_records*$number_of_tables)/(microtime(1)-$t_temp);

  20. Gregor  says:

    This Script is really dangerous !

    Because there is a “bug” in MySQL InnoDB which don’t shrink the file ibdata1 after a DROP Database without dropping the Table – If you have the defalut setting and not innodb_file_per_table enabled..

    For more Information see here:
    http://bugs.mysql.com/bug.php?id=15748

    http://bugs.mysql.com/bug.php?id=1287
    http://bugs.mysql.com/bug.php?id=1341
    http://bugs.mysql.com/bug.php?id=36943

    Here a solution, but completly reimport is not that fun for a production enviroment….
    http://crazytoon.com/2007/04/03/mysql-ibdata-files-do-not-shrink-on-database-deletion-innodb/

    best regards gregor

  21. peter  says:

    Gregor,

    There is no bug here. Things work as designed – ibdata1 can’t be shrunk either by deleting tables or deleting databases.
    You may not like such design but it does not make it a bug

    The innodb_file_per_table was created exactly as a workaround and I doubt there would be ether ibdata1 shrinker implemented.

    Regarding the script – why is it dangerous ? Have you run it on production table and run our of space ? That is dangerous… but not the script.

  22. Gregor  says:

    Peter,

    no didn’t run on production. But there is no note about this scary design, this is just the point where i want to mention.
    And it wouldn’t happend if you delete each table by own drop table – it’s just happend when you drop the database with the data inside.

    – it’s always good to know…

    best regards

  23. peter  says:

    Gregor,

    What are you speaking about ? If you DROP TABLE or DROP DATABASE ibdata1 never shrinks.

  24. Gregor  says:

    Peter,

    sorry my fault – damn.
    I wrote the drop database by myself…*arg*

  25. KevG  says:

    Sorry if this is a bit late. But is there a newer version of this script. I seem to be getting negative numbers from values.

    Trial 1:

    tables: 1; total records: 10000000; write rows per sec: 19153382.199996 , reads rows per sec: -
    13280177.210685sec.
    Content-type: text/html
    X-Powered-By: PHP/4.3.9

    tables: 10; total records: 10000000; write rows per sec: 53662175.142607 , reads rows per sec:
    -10424734.977175sec.
    Content-type: text/html
    X-Powered-By: PHP/4.3.9

    tables: 100; total records: 10000000; write rows per sec: 31758735.240128 , reads rows per sec:
    28556416.055559sec.
    Content-type: text/html
    X-Powered-By: PHP/4.3.9

    tables: 1000; total records: 10000000; write rows per sec: 58727492.688427 , reads rows per sec
    : -37967090.126279sec.
    Content-type: text/html
    X-Powered-By: PHP/4.3.9

    tables: 10000; total records: 10000000; write rows per sec: -18732414.94547 , reads rows per sec: 37017842.600133sec.

    Thanks

  26. peter  says:

    There is probably some kind of bug when. You should not get negative numbers of course.

Speak Your Mind

Name *

Website




Working with many files and file system fragmentation

Working on performance optimization project (not directly MySQL related) we did a test – creating 100 files writing 4K in the random file for some time and when checking the read speed on the files we end up with, compared to just writing the file sequentially and reading it back.

The performance difference was huge – we could read single file at 80MB/sec while fragmented files only deliver about 2MB/sec – this is a massive difference.

The test was done on EXT3 and it looks like it does not do very good job preventing file fragmentation for large amount of growing files.

It would be interesting to see how other filesystems deal with this problem, for example XFS with delayed allocation may be doing better job.

I also would like to repeat the test with MySQL MyISAM tables and see how bad the difference would be for MySQL but I would expect something along those lines.

Interesting enough it should not be that hard to fix this issue – one could optionally preallocate MyISAM tables in some chunks (say 1MB) so its gets less fragmentation. Though it would be interesting to benchmark how much such approach would generally help.

Until we have this feature – reduced fragmentation is one more benefit we get with batching. For example instead of inserting rows one by one in large number of tables once can be buffered in memory (application or MyISAM memory table) and flushed to the actual tables in bulks.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Artem Russakovskii  says:

    I’ve been working with xfs a lot on the storage servers and fragmentation has been a huge problem with it too, though xfs prides itself in not needing defragmentation (at least as much as some other filesystems). Until fragmentation was sorted out, the load on the box would spike to 100+ during an otherwise reasonable load. I’ve used xfs_fsr to defrag and ended up cronning it to run for 30 seconds every 15 minutes. I suspect the issue may be related to nfs3 writing to it in relatively small chunks (as it’s a stateless protocol) and not xfs itself.

    As far as the benchmarks above, I’m curious to see a similar one with reiser, if you find time and ext4 when it comes out.

  2. Bill  says:

    It would also be interesting to repeat the test with the files on those new solid state drives, to see how much of an improvement you could get.

  3. peter  says:

    Oh yes. SSD reduce seek penalty dramatically so fragmentation should be much less issue for them.

    This is multiple Terabyte data storage project so it is not for SSD yet.

  4. brian  says:

    Are you sure the speed difference is caused by fragmentation of the filesystem on the disk, and not due to the single large file being in the buffer cache? It might be interesting to use the following to see what’s in the buffer cache:

    http://net.doit.wisc.edu/~plonka/fincore/

    http://insights.oetiker.ch/linux/fadvise.html

  5. tgabi  says:

    I’ve noticed this long time ago. I’m fighting this using different measures: a. small tables are optimized periodically b.big tables that are affected by fragmentation can be arranged as 1 file per filesystem c. recently Mysql 5.1 table partitioning allowed to use SSD for most recent data (most used) and disk for historical data (less used). Index files are the most prone to fragmentation – since 1 record inserted/updated can trigger updates on several indices. No filesystem is immune to fragmentation unfortunately, the only way to reduce it is to have some table options that will pre-allocate space in large chunks (like 1GB data, 1 GB index).

  6. peter  says:

    Thanks tgabi,

    If you have updates you may be suffering from internal fragmentation in addition to file system fragmentation – such as rows may get more than one piece in MySQL sequential pages in index order happen to be in different locations etc.

    Good you agree on pre allocation option though I think even much smaller values such as 1MB or 4MB would make things much better.

  7. peter  says:

    Brian,

    Thanks for heads up I’ve written another blog post on those tools.

    I’m quite sure that was not caching because number at VMSTAT matches these quite closely.

  8. Markus  says:

    Be aware if you use XFS/ReiserFS on PC-Hardware, you risk loss of data in case of power failure:

    http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc

  9. Daniel Schneller  says:

    Some time ago (2006) I noticed that even on a fully defragged NTFS drive newly created InnoDB data files got scattered around the whole partition.
    http://jroller.com/dschneller/entry/strange_innodb_fragmentation

  10. peter  says:

    Daniel,

    I’m not quite sure from your post – did you create the tablespace out of many files and it was already badly fragmented or it was autoextend during some workload run ?

    In any case we can learn filesystems may not be as optimal as we would like them to be 

你可能感兴趣的:(Working with many files and file system fragmentation)