This is a time-honored topic, and there’s no shortage of articles on the topic on this blog. I wanted to write a post trying to condense and clarify those posts, as it has taken me a while to really understand this relationship.
这是一个历史悠久的话题了,在这个博客里面也不乏这方面的文章。我想去写一篇文章让这些文章得以简洁和清晰,因为我花了一阵子功夫才真的弄清楚它们的关系。
Ultimately this mechanism was an optimization for slow drives: if you can sequentially write all the changes into a log, it will be faster to do on the fly as transactions come in than trying to randomly write the changes across the tablespaces. Sequential IO trumps Random IO.
这个机制最终目的是为了优化缓慢磁盘操作:如果你可以以顺序写的方式把更新写入到日志,它将比随机写更新到整个表空间要快得多。顺序IO优于随机IO。
However, even today in our modern flash storage world where random IO is significantly less expensive (from a latency perspective, not dollars), this is still an optimization because the longer we delay updating the tablespace, the more IOPs we can potentially conserve, condense, merge, etc. This is because:
然而,尽管在流行FLASH存储的今天,随机IO比以前要便宜了非常多(从延迟的角度来看,而不是金钱的角度),由于我们会更多的延迟更新表空间,这仍然是一种优化方案,更多的OIPS可能保存,压缩,合并等等。这是因为:
So, first of all, what can we see about Innodb log checkpointing and what does it tell us?
那么,首先,什么是设置INNODB日志检查点,它又能告诉我们什么信息?
mysql> SHOW ENGINE INNODB STATUS\G
---
LOG
---
Log sequence number 9682004056
Log flushed up to 9682004056
Last checkpoint at 9682002296
This shows us the virtual head of our log (Log sequence Number), the last place the log was flushed to disk (Log flushed up to), and our last Checkpoint. The LSN grows forever, while the actual locations inside the transaction logs are reused in a circular fashion. Based on these numbers, we can determine how many bytes back in the transaction log our oldest uncheckpointed transaction is by subtracting our ‘Log sequence number’ from the ‘Last checkpoint at’ value. More on what a Checkpoint is in a minute. If you use Percona server, it does the math for you by including some more output:
这显示了虚拟的日志头(Log sequence Number),刷写到磁盘的最后位置(Log flushed up to)和我们最后的检查点。LSN是持续增长的,然而日志中的实际位置是以环型结构重复使用的。基于这些数据,我们可以通过计算‘Log sequence number’到 ‘Last checkpoint at’之间的差值确定在最老的检查点之后,事务日志写了多少字节。More on what a Checkpoint is in a minute.如果你使用PERCONA,它为这个数学计算提供了更多信息:
--- LOG --- Log sequence number 9682004056 Log flushed up to 9682004056 Last checkpoint at 9682002296 Max checkpoint age 108005254 Checkpoint age target 104630090 Modified age 1760 Checkpoint age 1760
Probably most interesting here is the Checkpoint age, which is the subtraction I described above. I think of the Max checkpoint age as roughly the furthest back Innodb will allow us to go in the transaction logs; our Checkpoint age cannot exceed this without blocking client operations in Innodb to flush dirty buffers. Max checkpoint age appears to be approximately 80% of the total number of bytes in all the transaction logs, but I’m unsure if that’s always the case.
可能这里最有意思的就是Checkpoint age了,这是我前面描述的减法计算得来的。我觉Max checkpoint age可以大致认为是INNODB允许事务进入日志的最长长度;Checkpoint age不能够超过这个值,除非阻塞客户端在INNODB的操作去刷写脏页。 显示的Max checkpoint age大约是全部事务日志总字节数的80%,但是我不确定是否总是这样。
Remember our transaction logs are circular, and the checkpoint age represents how far back the oldest unflushed transaction is in the log. We cannot overwrite that without potentially losing data on a crash, so Innodb does not permit such an operation and will block incoming writes until the space is available to continue (safely) writing in the log.
记住我们的事务日志是环形的,checkpoint age代表事务日志中最老的未刷写事务到现在有多少日志未刷写。我们不能对它做覆盖写,除非crash时不存在丢失数据的可能性。
所以INNODB不允许这样的操作,它会阻塞写入,直到有可用的空间去继续(安全的)写入日志。
On the other side, we have dirty buffers. These two numbers are relevant from the BUFFER POOL AND MEMORY section of SHOW ENGINE INNODB STATUS:
另一方面,我们有脏页BUFER,下面2个相关的数据来自于SHOW ENGINE INNODB STATUS的BUFFER POOL AND MEMORY节
Database pages 65530 ... Modified db pages 3
So we have 3 pages that have modified data in them, and that (in this case) is a very small percentage of the total buffer pool. A page in Innodb contains rows, indexes, etc., while a transaction may modify 1 or millions of rows. Also realize that a single modified page in the buffer pool may contain modified data from multiple transactions in the transaction log.
我们有3个已经更新过的数据页,这(在这个例子里)在总PB里是非常小的百分比。INNODB中的页包含行记录,索引等等,而一个事务可能会修改1行或数百万行。也要意识到,BP里更新的单个页可能包含在事务日志里的多个事务更新的数据。
As I said before, dirty pages are flushed to disk in the background. The order in which they are flushed really has little to nothing to do with the transaction they are associated with, nor with the position associated with their modification in the transaction log. The effect of this is that as the thread managing the dirty page flushing goes about its business, it is not necessarily flushing to optimize the Checkpoint age, it is flushing to try to optimize IO and to obey the LRU in the buffer pool.
如我前面所说,脏页在后台刷写到磁盘。刷写到磁盘的顺序和与之相关的事务之间几乎没有什么关系,和它们修改事务日志的位置也没有什么关系。这个影响是线程管理那些脏页时按照自己的规则进行刷写,它不一定为了优化Checkpoint age去做刷写,而是试图优化IO才做刷写,并遵从BP里的LRU(least recently used)。
Since buffers can and will be flushed out of order, it may be the case that there are a lot of transactions in the transaction log that are fully flushed to disk (i.e., all the pages associated with said transaction are clean), but there still could be older transactions that are not flushed. This, in essence, is what fuzzy checkpointing is.
因为缓冲刷写是无序的,可能在事务日志里面有很多事务全部被刷写到磁盘(也就是说,和上述事务相关的页都被清理了),但是仍然存在一些未被刷写的旧事务。这就是模糊检查的本质。
The checkpoint process is really a logical operation. It occasionally (as chunks of dirty pages get flushed) has a look through the dirty pages in the buffer pool to find the one with the oldest LSN, and that’s the Checkpoint. Everything older must be fully flushed.
checkpoint进程事实上是逻辑操作。(当大批的脏页被刷写时)它不定期检查BP里的脏页,去找出最旧的LSN,这就是Checkpoint。比Checkpoint旧的所有东西都必须全部刷写到磁盘。
The main reason this is important is if the Checkpoint Age is not a factor in dirty buffer flushing, it can get too big and cause stalls in client operations: the algorithm that decides which dirty pages to flush does not optimize for this [well] and sometimes it is not good enough on its own.
这个很重要的主要原因是如果Checkpoint Age不是刷写脏页的要素的话,它会变得很大并导致客户端操作停顿:决定刷写哪些脏页的算法不会很好的优化Checkpoint Age,甚至有时候算法本身都不够完善。
So, how can we optimize here? The short of it is: make innodb flush more dirty pages. However, I can’t help but wonder if some tweaks could be made to the page flushing algorithm to be more effective there in choosing older dirty pages. It is clear how that algorithm works without reading the source code.
那么,我们能做些什么优化呢?简而言之:让INNODB刷写更多的脏页。尽管我无能为力但是还是想知道,是否可以有一些调整有助于使页面刷写算法在选择旧脏页方面更加有效。
这样在不读源码的情况下,这个算法是怎么工作的就很清晰了。
There are a lot of ways to tune this, here is a list of the most signficant, roughly ordered from oldest to newest, and simultaneously listed from least effective to most effective:
有很多方法去调整它,下面这个列表就是从最老的到最新,重要次序从小到大排列: