14.10 InnoDB Disk IO and File Space Management

MYSQL5.6官方文档
14.10 InnoDB Disk I/O and File Space Management
中文只是我个人阅读时的理解,歧义的地方请以英文内容为准。
word版连接地址:
http://note.youdao.com/share/?id=091e903e9309a75b02983864f84df0ca&type=note#/


14.10 InnoDB Disk I/O and File Space Management

14.10.1 InnoDB Disk I/O

14.10.2 File Space Management

14.10.3 InnoDB Checkpoints

14.10.4 Defragmenting a Table

14.10.5 Reclaiming Disk Space with TRUNCATE TABLE

As a DBA, you must manage disk I/O to keep the I/O subsystem from becoming saturated, and manage disk space to avoid filling up storage devices. The ACID design model requires a certain amount of I/O that might seem redundant, but helps to ensure data reliability. Within these constraints, InnoDB tries to optimize the database work and the organization of disk files to minimize the amount of disk I/O. Sometimes, I/O is postponed until the database is not busy, or until everything needs to be brought to a consistent state, such as during a database restart after a fast shutdown.

作为一个dba,你必须要管理磁盘I/O以防止I/O过于饱和,以及管理好磁盘空间避免填满。ACID设计模型会要求一定数量的看似又有些多余的I/O,但这能确保数据的可靠性。在这些约束中,InnoDB会尝试对数据库进行优化,并组织磁盘文件最小化磁盘I/O的数量。有的时候,一些I/O操作会延迟到数据库不繁忙的时候再运行,又或者是需要保持一致性状态的时候运行(例如数据库的快速关闭后的重启)。

This section discusses the main considerations for I/O and disk space with the default kind of MySQL tables (also known as InnoDB tables):

这一章节主要讨论了InnoDB表的I/O及磁盘空间问题:

  • Controlling the amount of background I/O used to improve query performance.

  • 控制后台使用的I/O数量来该散查询性能。

  • Enabling or disabling features that provide extra durability at the expense of additional I/O.

  • 以更多的I/O代价来获得额外的持久性。

  • Organizing tables into many small files, a few larger files, or a combination of both.

  • 把表分散到多个小的文件里,或者合并到大的文件里。

  • Balancing the size of redo log files against the I/O activity that occurs when the log files become full.

  • 在日志文件变满的时候平衡redo日志的大小以及I/O的活动能力之间的矛盾。

  • How to reorganize a table for optimal query performance.

  • 如何重新组织表来优化查询性能。

14.10.1 InnoDB Disk I/O

InnoDB uses asynchronous disk I/O where possible, by creating a number of threads to handle I/O operations, while permitting other database operations to proceed while the I/O is still in progress. On Linux and Windows platforms, InnoDB uses the available OS and library functions to perform “native” asynchronous I/O. On other platforms, InnoDB still uses I/O threads, but the threads may actually wait for I/O requests to complete; this technique is known as “simulated” asynchronous I/O.

InnoDB会通过创建一系列的线程来异步处理磁盘I/O,这样就可以使得即使I/O在处理的时候也能允许其他的数据操作继续进行。在Linux和Windows平台上,InnoDB会尽可能使用操作系统的库函数来执行“本地的(native)”异步I/O。在其他的平台上,InnoDB还是会使用I/O线程,但这些线程实际上是要等到I/O请求结束后才能继续的,这种情况被称为“模拟(simulated)”异步I/O。

Read-Ahead

If InnoDB can determine there is a high probability that data might be needed soon, it performs read-ahead operations to bring that data into the buffer pool so that it is available in memory. Making a few large read requests for contiguous data can be more efficient than making several small, spread-out requests. There are two read-ahead heuristics in InnoDB:

如果InnoDB能够确认一些数据在后面一段时间有很大的概率会被使用,那么它就会执行一个read-ahead操作把这些数据尽可能放到内存的buffer pool里。数据库处理大量连续数据读请求要比小的,分散的读请求效率更高。

  • In sequential read-ahead, if InnoDB notices that the access pattern to a segment in the tablespace is sequential, it posts in advance a batch of reads of database pages to the I/O system.

  • 在连续的read-ahead的时候,如果InnoDB注意到要访问的段在表空间里是连续的,那么它就会以批量读的方式把数据库页读到系统I/O里面。

  • In random read-ahead, if InnoDB notices that some area in a tablespace seems to be in the process of being fully read into the buffer pool, it posts the remaining reads to the I/O system.

  • 在随机的read-ahead情况下,如果InnoDB注意到表空间里面的某些部分会被读到buffer pool里面,那它就会把它们保持在系统I/O里面。

Doublewrite Buffer

InnoDB uses a novel file flush technique involving a structure called the doublewrite buffer, which is enabled by default (innodb_doublewrite=ON). It adds safety to recovery following a crash or power outage, and improves performance on most varieties of Unix by reducing the need for fsync() operations.

InnoDB使用了一个称之为doublewrite buffer的技术来刷新的文件,它默认是开启的(innodb_doublewrite=ON)。它增加了crash或者电源故障之后恢复的安全性,并且它还通过减少fsync()操作来提升了大部分unix平台的性能。

Before writing pages to a data file, InnoDB first writes them to a contiguous tablespace area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer has completed does InnoDB write the pages to their proper positions in the data file. If there is an operating system, storage subsystem, or mysqld process crash in the middle of a page write (causing a torn page condition), InnoDB can later find a good copy of the page from the doublewrite buffer during recovery.

在数据页写入到数据文件之前,InnoDB首先会把它们写入到一段被称之为doublewrite buffer的连续的表空间区域内。只有在doublewrite buffer的刷新完成之后InnoDB才会把这些数据页写到数据文件相应的位置上。如果在写数据页的时候发生了操作系统,存储子系统,又或者是mysqld主进程的损坏,InnoDB可以在恢复的时候从doublewrite buffer里面找到数据页完好的拷贝。

14.10.2 File Space Management

The data files that you define in the configuration file form the InnoDB system tablespace. The files are logically concatenated to form the tablespace. There is no striping in use. You cannot define where within the tablespace your tables are allocated. In a newly created tablespace, InnoDB allocates space starting from the first data file.

在配置文件里定义的数据文件构成了InnoDB的系统表空间。这些表空间文件是逻辑串联形式的,是不能拆开使用的。你也不能把表分配到不同的表空间里。在最新创建的表空间里,InnoDB会从第一个数据文件开始分配空间。

To avoid the issues that come with storing all tables and indexes inside the system tablespace, you can turn on the innodb_file_per_table configuration option, which stores each newly created table in a separate tablespace file (with extension .ibd). For tables stored this way, there is less fragmentation within the disk file, and when the table is truncated, the space is returned to the operating system rather than still being reserved by InnoDB within the system tablespace.

为了避免在系统表空间里存储所有的表和索引,你可以打开innodb_file_per_table配置参数,这样新创建的表就会被存储它自己的表空间文件里了(ibd文件)。以这样的方式存储表,那能减少磁盘文件的碎片数量,还有相比所有都存储在系统表空间里,这样的方式在表被truncate的时候操作系统能够回收相关的存储空间。

Pages, Extents, Segments, and Tablespaces

Each tablespace consists of database pages. Every tablespace in a MySQL instance has the same page size. By default, all tablespaces have a page size of 16KB; you can reduce the page size to 8KB or 4KB by specifying the innodb_page_size option when you create the MySQL instance.

每个表空间都是由数据页构成的。在一个MySQL实例里面每个表空间的页大小都是相同的。默认情况下,所有的表空间数据页都是16KB;当然在创建MySQL实例的时候可以通过指定innodb_page_size参数来把数据页减少到8KB或者4KB。

The pages are grouped into extents of size 1MB (64 consecutive 16KB pages, or 128 8KB pages, or 256 4KB pages). The “files” inside a tablespace are called segments in InnoDB. (These segments are different from the rollback segment, which actually contains many tablespace segments.)

数据页会被分组到1MB大小的extent里面(64个连续的16KB数据页,或者是128个8K数据页,再或者是256个4KB数据页)。在InnoDB里面表空间里的“文件”被称之为segment。(这些segment不同于回滚段,回滚段里面包含了很多的表空间段。)

When a segment grows inside the tablespace, InnoDB allocates the first 32 pages to it one at a time. After that, InnoDB starts to allocate whole extents to the segment. InnoDB can add up to 4 extents at a time to a large segment to ensure good sequentiality of data.

表空间里的segment在增长的时候,InnoDB开始每次会分配32个数据页。其后,InnoDB就会把整个extent分配到segment里。InnoDB一次能够添加4个extent到segment里来确保数据的连续性。

Two segments are allocated for each index in InnoDB. One is for nonleaf nodes of the B-tree, the other is for the leaf nodes. Keeping the leaf nodes contiguous on disk enables better sequential I/O operations, because these leaf nodes contain the actual table data.

InnoDB会为每个索引分配两个segment。一个是没有叶子节点的B-tree,另一个是叶子节点。磁盘上连续的叶子节点能够让连续的I/O操作更为效率,因为叶子节点里面包含了实际的表数据。

Some pages in the tablespace contain bitmaps of other pages, and therefore a few extents in an InnoDB tablespace cannot be allocated to segments as a whole, but only as individual pages.

一些数据页里面会包含一些其他页的位图信息,因此总的来说一些InnoDB表空间里的extent是无法被分配到segment里面的,但这仅是一少部分的。

When you ask for available free space in the tablespace by issuing a SHOW TABLE STATUS statement, InnoDB reports the extents that are definitely free in the tablespace. InnoDB always reserves some extents for cleanup and other internal purposes; these reserved extents are not included in the free space.

当你执行SHOW TABLE STATUS语句来查看表空间的可用空余空间时,InnoDB统计的是表空间里明确为free的extent的空间。还有InnoDB总是会保留一些extent用于清理以及内部的一些其他用途;因此这些保留的extent也是不保留在空余空间里的。

When you delete data from a table, InnoDB contracts the corresponding B-tree indexes. Whether the freed space becomes available for other users depends on whether the pattern of deletes frees individual pages or extents to the tablespace. Dropping a table or deleting all rows from it is guaranteed to release the space to other users, but remember that deleted rows are physically removed only by the purge operation, which happens automatically some time after they are no longer needed for transaction rollbacks or consistent reads. (See Section 14.2.2, “InnoDB Multi-Versioning”.)

当你从表里删除一些数据,InnoDB会收缩相关的B-tree索引。其他用户释放的空间是否会可用要取决于delete的模式。删除表(drop table)或者删除所有的行能够保证释放空间给其他用户使用,但是要注意的是delete释放空间实际是由purge来实现的,这就表示只有在没有需要回滚的事务以及一致性的读要求之后才会发生。(详见Section 14.2.2, “InnoDB Multi-Versioning”.)

To see information about the tablespace, use the Tablespace Monitor. See Section 14.15, “InnoDB Monitors”.

更多关于表空间的信息可以查看Section 14.15, “InnoDB Monitors”中的Tablespace Monitor。

How Pages Relate to Table Rows

The maximum row length, except for variable-length columns (VARBINARY, VARCHAR, BLOB and TEXT), is slightly less than half of a database page. For example, the maximum row length is about 8000 bytes for the default 16KB innodb_page_size setting. LONGBLOB and LONGTEXT columns must be less than 4GB, and the total row length, including BLOB and TEXT columns, must be less than 4GB.

除了可变长的列(VARBINARY, VARCHAR, BLOB and TEXT),一行的最大长度会略小于一个数据页的一半。例如,默认16KB的数据页行的最大长度是8000bytes。LONGBLOB and LONGTEXT列必须要小于4GB,并且包括BLOB and TEXT列在内行的总长度也必须小于4GB。

If a row is less than half a page long, all of it is stored locally within the page. If it exceeds half a page, variable-length columns are chosen for external off-page storage until the row fits within half a page. For a column chosen for off-page storage, InnoDB stores the first 768 bytes locally in the row, and the rest externally into overflow pages. Each such column has its own list of overflow pages. The 768-byte prefix is accompanied by a 20-byte value that stores the true length of the column and points into the overflow list where the rest of the value is stored.

如果一行的长度小于一个数据页的一半大小,那么它的所有内容都会被存储到这个数据页里。如果超过的一半,可变长列会被放到额外的off-page里面直到这行能够满足数据页一半大小的要求。对于使用了off-page的情况,InnoDB会把一行的前768bytes存储在这个数据页上,而余下的则会放到overflow页上。每个这样的列都有它自己的overflow页列表。768-byte的前缀里面有20-byte的空间用来存储列的实际长度以及指向overflow页的指针。

14.10.3 InnoDB Checkpoints

Making your log files very large may reduce disk I/O during checkpointing. It often makes sense to set the total size of the log files as large as the buffer pool or even larger. Although in the past large log files could make crash recovery take excessive time, starting with MySQL 5.5, performance enhancements to crash recovery make it possible to use large log files with fast startup after a crash. (Strictly speaking, this performance improvement is available for MySQL 5.1 with the InnoDB Plugin 1.0.7 and higher. It is with MySQL 5.5 that this improvement is available in the default InnoDB storage engine.)

更大的日志文件可以使得InnoDB在checkpoint时候减小磁盘I/O的操作。所以一般我们会把日志文件的总和设置到和buffer pool一样大,或者更大。虽然过去大的日志文件会使得数据库在crash recovery的消耗更多的时间,但是从MySQL5.5开始,crash recovery的性能提升,已经使得即使使用大的日志文件在数据库crash之后也能快速启动。(严格来说,这种性能的提升是从使用InnoDB Plugin 1.0.7的MySQL5.1开始的。而从MySQL5.5开始,InnoDB才开始作为MySQL的默认存储引擎的。)

How Checkpoint Processing Works

InnoDB implements a checkpoint mechanism known as fuzzy checkpointing. InnoDB flushes modified database pages from the buffer pool in small batches. There is no need to flush the buffer pool in one single batch, which would disrupt processing of user SQL statements during the checkpointing process.

InnoDB实现checkpoint的机制被称之fuzzy checkpointing。InnoDB从buffer pool里小批量刷新修改过的数据页。这样就不会在checkpoint的执行过程中妨碍用户SQL语句的执行。

During crash recovery, InnoDB looks for a checkpoint label written to the log files. It knows that all modifications to the database before the label are present in the disk image of the database. Then InnoDB scans the log files forward from the checkpoint, applying the logged modifications to the database.

在crash recovery的时候,InnoDB会查找写日志文件的checkpoint标签。这个标签里面记录了之前磁盘镜像里所有的修改信息。然后InnoDB会从这个checkpoint标签开始向前(向后)扫描日志文件,并把日志里记录的修改信息应用到数据库里。

14.10.4 Defragmenting a Table

Random insertions into or deletions from a secondary index can cause the index to become fragmented. Fragmentation means that the physical ordering of the index pages on the disk is not close to the index ordering of the records on the pages, or that there are many unused pages in the 64-page blocks that were allocated to the index.

secondary index的随机插入以及删除操作会使得索引变得碎片化。这也就意味着索引页在磁盘上的物理顺序会偏离索引记录值的顺序,以及会有很多未使用的64-page数据页分配给索引(,浪费更多的空间,使得索引肿大)。

One symptom of fragmentation is that a table takes more space than it “should” take. How much that is exactly, is difficult to determine. All InnoDB data and indexes are stored in B-trees, and their fill factor may vary from 50% to 100%. Another symptom of fragmentation is that a table scan such as this takes more time than it “should” take:

碎片的一个症状就是一张表会比其正常的情况下占用更多的空间。但是具体多出多少,还不太容易精确计算出。InnoDB的数据和索引都是以B-trees的结构存储的,它们的填充因数可以会从50%到100%不等。碎片的另一个症状就是表扫描的时候会花费更多的时间:

SELECT COUNT(*) FROM t WHERE non_indexed_column <> 12345;

The preceding query requires MySQL to perform a full table scan, the slowest type of query for a large table.

上面的那个查询会执行一个全表扫描,对于一个大表来说这是最慢的一种查询方式。

To speed up index scans, you can periodically perform a “null” ALTER TABLE operation, which causes MySQL to rebuild the table:

为了加快索引的扫描速度,你可以定期执行一个空的ALTER TABLE操作,这样会使MySQL重建表(,可以大幅减少碎片):

ALTER TABLE tbl_name ENGINE=INNODB

As of MySQL 5.6.3, you can also use ALTER TABLE tbl_name FORCE to perform a “null” alter operation that rebuilds the table. Previously the FORCE option was recognized but ignored.

从MySQL5.6.3开始,你也可以使用ALTER TABLE tbl_name FORCE来执行空的ALTER TABLE操作。这里的FORCE选项会被识别但是会被忽略。

As of MySQL 5.6.17, both ALTER TABLE tbl_name ENGINE=INNODB and ALTER TABLE tbl_name FORCE use online DDL (ALGORITHM=COPY). For more information, see Section 14.11.1, “Overview of Online DDL”.

从MySQL5.6.17开始,ALTER TABLE tbl_name ENGINE=INNODB和ALTER TABLE tbl_name FORCE都可以进行online的DDL(ALGORITHM=COPY)。更多的相关信息可以查看Section 14.11.1, “Overview of Online DDL”。

Another way to perform a defragmentation operation is to use mysqldump to dump the table to a text file, drop the table, and reload it from the dump file.

另一个减少碎片的方法是使用mysqldump来把表的数据dump到一个text文件里,然后删除表,再从dump文件里重新加载表。

If the insertions into an index are always ascending and records are deleted only from the end, the InnoDB filespace management algorithm guarantees that fragmentation in the index does not occur.

如果索引的插入操作永远都是升序的,删除也都是发生在数据末端的,那InnoDB的文件空间管理算法会保证索引不会产生碎片。

14.10.5 Reclaiming Disk Space with TRUNCATE TABLE

To reclaim operating system disk space when truncating an InnoDB table, the table must be stored in its own .ibd file. For a table to be stored in its own .ibd file, innodb_file_per_table must enabled when the table is created. Additionally, there cannot be a foreign key constraint between the table being truncated and other tables, otherwise the TRUNCATE TABLE operation fails. A foreign key constraint between two columns in the same table, however, is permitted.

为了让操作系统能够回收truncate table的磁盘空间,表必须要存储在它自己的.ibd文件里。一个表想要存储在它自己的.ibd文件里,那么这个表在啊创建的时候就要开启innodb_file_per_table。此外,被truncate的表和其他表之间是不能有外键约束的,否则TRUNCATE TABLE操作会失败。另外,在同一张表的两个列上建外键也是允许的。

When a table is truncated, it is dropped and re-created in a new .ibd file, and the freed space is returned to the operating system. This is in contrast to truncating InnoDB tables that are stored within the InnoDB system tablespace (tables created when innodb_file_per_table=OFF), where only InnoDB can use the freed space after the table is truncated.

当一个表被truncate,那么会它会被drop并重建一个新的.ibd文件,这样释放出来的空间也就可以被返回给操作系统了。相对于存储在InnoDB系统表空间的表进行truncate(表创建的时候设定innodb_file_per_table=OFF),这样释放出来的空间只能被InnoDB再重复使用。

The ability to truncate tables and return disk space to the operating system also means that physical backups can be smaller. Truncating tables that are stored in the system tablespace (tables created when innodb_file_per_table=OFF) leaves blocks of unused space in the system tablespace.

这种truncate表能把空间释放给操作系系统的情况,同时也意味着物理备份的也会变得更小。而innodb_file_per_table=OFF的情况下则会把未使用的空间留在系统表空间里(,物理备份的时候这部分也是要进行备份的)。

你可能感兴趣的:(mysql,mysql,文档)