Innodb Log checkpointing 和 dirty Buffer pool pages的关系

原文链接:http://www.mysqlperformanceblog.com/2012/02/17/the-relationship-between-innodb-log-checkpointing-and-dirty-buffer-pool-pages/

This is a time-honored topic, and there’s no shortage of articles on the topic on this blog. I wanted to write a post trying to condense and clarify those posts, as it has taken me a while to really understand this relationship.

这是一个历史悠久的话题了,在这个博客里面也不乏这方面的文章。我想去写一篇文章让这些文章得以简洁和清晰,因为我花了一阵子功夫才真的弄清楚它们的关系。

Some basic facts   

  • Most of us know that writing into Innodb updates buffer pool pages in memory and records page operations in the transaction (redo) log.
  • 大部分人都知道,在INNODB的写操作会更新BP里的内存数据页,并在事务日志(REDO LOG)里记录下页的操作。
  • Behind the scenes those updated (dirty) buffer pool pages are flushed down the to the tablespace.
  • 在后台将这些修改的页(脏页)刷写到表空间
  • If Innodb stops (read: crashes) with dirty buffer pool pages, Innodb recovery must be done to rebuild the last consistent picture of the database.
  • 如果INNODB在还存在脏页的情况下停止了(阅读:CRASHES),INNODB恢复操作必须重建数据库,使它恢复到最近的一致状态。
  • Recovery uses the transaction log by redoing (hence the name ‘redo log’) the page operations in the log that had not already been flushed to the tablespaces.
  • 恢复操作通过重做(故名‘redo log’)未刷写到表空间的页操作来使用事务日志。

Ultimately this mechanism was an optimization for slow drives:  if you can sequentially write all the changes into a log, it will be faster to do on the fly as transactions come in than trying to randomly write the changes across the tablespaces.  Sequential IO trumps Random IO.

这个机制最终目的是为了优化缓慢磁盘操作:如果你可以以顺序写的方式把更新写入到日志,它将比随机写更新到整个表空间要快得多。顺序IO优于随机IO。

However, even today in our modern flash storage world where random IO is significantly less expensive (from a latency perspective, not dollars), this is still an optimization because the longer we delay updating the tablespace, the more IOPs we can potentially conserve, condense, merge, etc.  This is because:

然而,尽管在流行FLASH存储的今天,随机IO比以前要便宜了非常多(从延迟的角度来看,而不是金钱的角度),由于我们会更多的延迟更新表空间,这仍然是一种优化方案,更多的OIPS可能保存,压缩,合并等等。这是因为:

  • The same row may be written multiple times before the page is flushed
  • 相同的行可能在页刷写前重复了多次
  • Multiple rows within the same page can be written before the page is flushed
  • 在页刷写前多条记录可以写入到相同的数据页

Innodb Log Checkpointing

So, first of all, what can we see about Innodb log checkpointing and what does it tell us?

那么,首先,什么是设置INNODB日志检查点,它又能告诉我们什么信息?

mysql> SHOW ENGINE INNODB STATUS\G

---

LOG

---

Log sequence number 9682004056

Log flushed up to   9682004056

Last checkpoint at  9682002296

This shows us the virtual head of our log (Log sequence Number), the last place the log was flushed to disk (Log flushed up to), and our last Checkpoint.  The LSN grows forever, while the actual locations inside the transaction logs are reused in a circular fashion.    Based on these numbers, we can determine how many bytes back in the transaction log our oldest uncheckpointed transaction is by subtracting our ‘Log sequence number’ from the ‘Last checkpoint at’ value.  More on what a Checkpoint is in a minute.    If you use Percona server, it does the math for you by including some more output:

这显示了虚拟的日志头(Log sequence Number),刷写到磁盘的最后位置(Log flushed up to)和我们最后的检查点。LSN是持续增长的,然而日志中的实际位置是以环型结构重复使用的。基于这些数据,我们可以通过计算‘Log sequence number’到 ‘Last checkpoint at’之间的差值确定在最老的检查点之后,事务日志写了多少字节。More on what a Checkpoint is in a minute.如果你使用PERCONA,它为这个数学计算提供了更多信息:

---

LOG

---

Log sequence number 9682004056

Log flushed up to   9682004056

Last checkpoint at  9682002296

Max checkpoint age    108005254

Checkpoint age target 104630090

Modified age          1760

Checkpoint age        1760

Probably most interesting here is the Checkpoint age, which is the subtraction I described above.  I think of the Max checkpoint age as roughly the furthest back Innodb will allow us to go in the transaction logs; our Checkpoint age cannot exceed this without blocking client operations in Innodb to flush dirty buffers.  Max checkpoint age appears to be approximately 80% of the total number of bytes in all the transaction logs, but I’m unsure if that’s always the case.

可能这里最有意思的就是Checkpoint age了,这是我前面描述的减法计算得来的。我觉Max checkpoint age可以大致认为是INNODB允许事务进入日志的最长长度;Checkpoint age不能够超过这个值,除非阻塞客户端在INNODB的操作去刷写脏页。 显示的Max checkpoint age大约是全部事务日志总字节数的80%,但是我不确定是否总是这样。

Remember our transaction logs are circular, and the checkpoint age represents how far back the oldest unflushed transaction is in the log.  We cannot overwrite that without potentially losing data on a crash, so Innodb does not permit such an operation and will block incoming writes until the space is available to continue (safely) writing in the log.

记住我们的事务日志是环形的,checkpoint age代表事务日志中最老的未刷写事务到现在有多少日志未刷写。我们不能对它做覆盖写,除非crash时不存在丢失数据的可能性。

所以INNODB不允许这样的操作,它会阻塞写入,直到有可用的空间去继续(安全的)写入日志。

Dirty Buffer Pool Pages

On the other side, we have dirty buffers.  These two numbers are relevant from the BUFFER POOL AND MEMORY section of SHOW ENGINE INNODB STATUS:

另一方面,我们有脏页BUFER,下面2个相关的数据来自于SHOW ENGINE INNODB STATUS的BUFFER POOL AND MEMORY节

Database pages          65530

...

Modified db pages       3

So we have 3 pages that have modified data in them, and that (in this case) is a very small percentage of the total buffer pool.  A page in Innodb contains rows, indexes, etc., while a transaction may modify 1 or millions of rows.  Also realize that a single modified page in the buffer pool may contain modified data from multiple transactions in the transaction log.

我们有3个已经更新过的数据页,这(在这个例子里)在总PB里是非常小的百分比。INNODB中的页包含行记录,索引等等,而一个事务可能会修改1行或数百万行。也要意识到,BP里更新的单个页可能包含在事务日志里的多个事务更新的数据。

As I said before, dirty pages are flushed to disk in the background.  The order in which they are flushed really has little to nothing to do with the transaction they are associated with, nor with the position associated with their modification in the transaction log.    The effect of this is that as the thread managing the dirty page flushing goes about its business, it is not necessarily flushing to optimize the Checkpoint age, it is flushing to try to optimize IO and to obey the LRU in the buffer pool.

如我前面所说,脏页在后台刷写到磁盘。刷写到磁盘的顺序和与之相关的事务之间几乎没有什么关系,和它们修改事务日志的位置也没有什么关系。这个影响是线程管理那些脏页时按照自己的规则进行刷写,它不一定为了优化Checkpoint age去做刷写,而是试图优化IO才做刷写,并遵从BP里的LRU(least recently used)。

Since buffers can and will be flushed out of order, it may be the case that there are a lot of transactions in the transaction log that are fully flushed to disk (i.e., all the pages associated with said transaction are clean), but there still could be  older transactions that are not flushed.  This, in essence, is what fuzzy checkpointing is.

因为缓冲刷写是无序的,可能在事务日志里面有很多事务全部被刷写到磁盘(也就是说,和上述事务相关的页都被清理了),但是仍然存在一些未被刷写的旧事务。这就是模糊检查的本质。

The checkpoint process is really a logical operation.  It occasionally  (as chunks of dirty pages get flushed) has a look through the dirty pages in the buffer pool to find the one with the oldest LSN, and that’s the Checkpoint.  Everything older must be fully flushed.

checkpoint进程事实上是逻辑操作。(当大批的脏页被刷写时)它不定期检查BP里的脏页,去找出最旧的LSN,这就是Checkpoint。比Checkpoint旧的所有东西都必须全部刷写到磁盘。

The main reason this is important is if the Checkpoint Age is not a factor in dirty buffer flushing, it can get too big and cause stalls in client operations:  the algorithm that decides which dirty pages to flush does not optimize for this [well] and sometimes it is not good enough on its own.

这个很重要的主要原因是如果Checkpoint Age不是刷写脏页的要素的话,它会变得很大并导致客户端操作停顿:决定刷写哪些脏页的算法不会很好的优化Checkpoint Age,甚至有时候算法本身都不够完善。

So, how can we optimize here?  The short of it is: make innodb flush more dirty pages.  However, I can’t help but wonder if some tweaks could be made to the page flushing algorithm to be more effective there in choosing older dirty pages.   It is clear how that algorithm works without reading the source code.

那么,我们能做些什么优化呢?简而言之:让INNODB刷写更多的脏页。尽管我无能为力但是还是想知道,是否可以有一些调整有助于使页面刷写算法在选择旧脏页方面更加有效。

这样在不读源码的情况下,这个算法是怎么工作的就很清晰了。

There are a lot of ways to tune this, here is a list of the most signficant, roughly ordered from oldest to newest, and simultaneously listed from least effective to most effective:

有很多方法去调整它,下面这个列表就是从最老的到最新,重要次序从小到大排列:

  • innodb_max_dirty_pages_pct:  This attempts to keep the percentage of dirty pages under control, and before the Innodb plugin this was really the only way to tune dirty buffer flushing.  However, I have seen servers with 3% dirty buffers and they are hitting their max checkpoint age.  The way this increases dirty buffer flushing also doesn’t scale well on high io subsystems, it effectively just doubles the dirty buffer flushing per second when the % dirty pages exceeds this amount.
  • innodb_max_dirty_pages_pct: 这试图使脏页的百分比在控制之中,在INNODB PLUGIN之前它确实是调整脏页刷写的唯一方法。然而我曾经看到过一些有3%的脏页而达到了max checkpoint age的服务器。这种增加脏页刷写次数的方法也不能在高IO子系统下很好的扩展,它仅仅在脏页百分比超过这个数值时让每秒刷写的脏页数的动作翻倍了。
  • innodb_io_capacity: This setting, in spite of all our grand hopes that it would allow Innodb to make better use of our IO in all operations, simply controls the amount of dirty page flushing per second (and other background tasks like read-ahead).  Make this bigger, you flush more per second.  This does not adapt, it simply does that many iops every second if there are dirty buffers to flush.  It will effectively eliminate any optimization of IO consolidation if you have a low enough write workload (that is, dirty pages get flushed almost immediately, we might be better off without a transaction log in this case).  It also can quickly starve data reads and writes to the transaction log if you set this too high.
  • innodb_io_capacity: 这个设置,尽管我们最希望的是它能够让INNODB在所有的操作上都能良好的使用IO,仅仅控制每秒刷写脏页的数量(和另外的类似预读的后台任务)。把它调大点,你就每秒刷写量就更多。这不是自适应的,如果有脏页需要刷写,它只会每秒做很多的IOPS。如果你的写入负载足够低,这个参数将会有效的消除所有的IO合并的优化操作(就是说,脏页差不多被实时刷写到磁盘,在这种情况下没有事务日志可能会更好),在你把这参数设置得很高的情况下,它也能够加速数据读写到事务日志。
  • innodb_write_io_threads: Controls how many threads will have writes in progress to the disk.   I’m not sure why this is still useful if you can use Linux native AIO.  These can also be rendered useless by filesystems that don’t allow parallel writing to the same file by more than one thread (particularly if you have relatively few tables and/or use the global tablespaces) *cough ext3 cough*.
  • innodb_write_io_threads: 控制刷写数据到磁盘的线程数。我不确定为什么在使用Linux native AIO的情况下它还会起作用。在不允许多个线程并行的写同一个文件的文件系统里这也是没用的。*cough ext3 cough*.
  • innodb_adaptive_flushing: An Innodb plugin/5.5 setting that tries to be smarter about flushing more aggressively based on the number of dirty pages and the rate of transaction log growth.
  • innodb_adaptive_flushing: INNODB在PLUGIN和5.5里的一个参数,它让以脏页数量和事务日志增长比例为基础的刷写更加主动更加智能。
  • innodb_adaptive_flushing_method:  (Percona Server only)  This adds a few new different algorithms, but the more effective ones adjust the amount of dirty page flushing using a formula that considers the Checkpoint age and the Checkpoint age target (something you can set in Percona Server, otherwise it is effectively just a hair under the Max Checkpoint age).  The two main methods here would be ‘estimate’ (good default) and ‘keep_average’ designed for SSD type storage. Running Percona Server and using this method is my go-to setting for managing Checkpoint age.
  • innodb_adaptive_flushing_method: (Percona独有),这加入了一个新的不同算法,但是用参考了Checkpoint age 和Checkpoint age target(你可以配置PERCONA的一些参数,否则它在 Max Checkpoint age下基本没什么作用),后得到的公式调整脏页刷写的次数更加有效。这里的2个主要的方法在SSD存储上应该有一个被评估和均衡过的配置。运行PERCONA并使用这个方法是我管理Checkpoint age的首选配置。
To be clear, the best we have today (IMO) is using innodb_adaptive_flushing_method in Percona server.
In any case, if you run any kind of production MySQL server, you should be monitoring your Checkpoint age and your Innodb dirty pages and try to see the relationship with those values and your write operations on disk.  The additional controls in 5.5 and Percona server are excellent reasons to consider an upgrade.
说白了,当前我们在PERCONA里使用 innodb_adaptive_flushing_method是最好的。
总而言之,不管你使用何种MYSQL分支,你都应该监控Checkpoint age 和 Innodb dirty pages并去看看这些值和你磁盘写入操作之间的关系。5.5和PERCONA里的这些附加的控制参数是是否考虑版本升级的绝佳理由。
 
<翻译有风险,参考须谨慎>

你可能感兴趣的:(InnoDB)