innodb体系与内存管理

体系结构

image.png

master Thread
负责将内存中的数据刷新到磁盘，保证数据的一致性。包括脏页刷新，合并插入缓冲，undo页的回收等。
IO Thread
InnoDb大量使用AIO处理写io请求，目前版本1.0.x开始，write thread和read thread各有4个，由 innodb_read_io_threads 和 innodb_write_io_threads 参数调整。
Purge Thread
用于清理事务提交后无用的undo页
Page Cleaner Thread
用于脏页的刷新

内存

重做日志缓冲 redo_log_buffer
额外内存池 innodb_additional_mem_pool_size
缓冲池 innodb_buffer_pool
- 数据页 data page
- 插入缓冲 insert buffer
- 锁信息 lock info
- 索引页 index page
- 自适应哈希索引
- 数据字典信息

mysql内存结构

缓冲池管理策略

LRU List
Free List
Lastest Recent Used，即频繁使用的页和最近使用的页放在前端，淘汰列表尾部的数据，但是innodb引入midpoint位置进行优化，即最新使用的页被放在midpoint种，该算法被称为midpoint insertion strategy 默认5/8处，由innodb_old_bloccks_pct控制。
一般将midpoint前的数据成为热点数据。如果使用朴素的lru算法的话，在某些查找，诸如需要全表匹配的查找中，会将全表数据全部刷新到lru列表中，但是这些数据仅在全表匹配时使用一次，但是却淘汰了真正的热点数据。
innodb默认前5/8的数据时热点数据，后3/8的数据为临时数据，如果能对自己的数据类型有所预估，可以通过调整参数避免自己的热点数据被淘汰。
innodb引入innodb_old_blocks_time用于管理热点数据，该值默认为1000。当某个页被访问超过该值时，则被晋升为热点数据。
我们可以通过SHOW ENGINE INNODB STATUS查看innodb的运行状态。或在新版本中通过SELECT * FROM information_schema.INNODB_BUFFER_POOL_STATS查看innodb的运行状态。

=====================================
2020-06-08 19:49:26 0x48a8 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 59 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 729 srv_active, 0 srv_shutdown, 26208 srv_idle
srv_master_thread log flush and writes: 26937
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 1468
OS WAIT ARRAY INFO: signal count 1465
RW-shared spins 0, rounds 2908, OS waits 1454
RW-excl spins 0, rounds 0, OS waits 0
RW-sx spins 0, rounds 0, OS waits 0
Spin rounds per wait: 2908.00 RW-shared, 0.00 RW-excl, 0.00 RW-sx
------------
TRANSACTIONS
------------
Trx id counter 3785630
Purge done for trx's n:o < 3785630 undo n:o < 0 state: running but idle
History list length 10
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 283213221646960, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283213221646088, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283213221645216, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283213221644344, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283213221643472, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283213221642600, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283213221641728, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283213221640856, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 283213221639984, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
--------
FILE I/O
--------
I/O thread 0 state: wait Windows aio (insert buffer thread)
I/O thread 1 state: wait Windows aio (log thread)
I/O thread 2 state: wait Windows aio (read thread)
I/O thread 3 state: wait Windows aio (read thread)
I/O thread 4 state: wait Windows aio (read thread)
I/O thread 5 state: wait Windows aio (read thread)
I/O thread 6 state: wait Windows aio (write thread)
I/O thread 7 state: wait Windows aio (write thread)
I/O thread 8 state: wait Windows aio (write thread)
I/O thread 9 state: wait Windows aio (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
 ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 0
1684 OS file reads, 11510 OS file writes, 6523 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.41 writes/s, 0.31 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 51, seg size 53, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 34679, node heap has 2 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 2 buffer(s)
Hash table size 34679, node heap has 1 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
Hash table size 34679, node heap has 4 buffer(s)
1.15 hash searches/s, 1.07 non-hash searches/s
---
LOG
---
Log sequence number 820345990
Log flushed up to   820345990
Pages flushed up to 820345990
Last checkpoint at  820345981
0 pending log flushes, 0 pending chkp writes
4364 log i/o's done, 0.20 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137297920
Dictionary memory allocated 705343
Buffer pool size   8192
Free buffers       6976
Database pages     1207
Old database pages 461
Modified db pages  0
Pending reads      0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1118, created 89, written 6385
0.00 reads/s, 0.00 creates/s, 0.17 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 1207, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
Process ID=4284, Main thread ID=6668, state: sleeping
Number of rows inserted 1704, updated 1428, deleted 0, read 1006480
0.00 inserts/s, 0.07 updates/s, 0.00 deletes/s, 39.32 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

可以看到Buffer pool size共有8192个页，即该库的缓冲池大小为8192*16K = 128M，其中Free buffers与Database pages并不等于Buffer pool size，因为内存池往往还需要分配给自适应哈希索引、lock、insert buffer等，并不全由lru列表管理。此外还有一个参数为Buffer pool hit rate，代表缓冲池命中比例，该报告中为1000/1000，即100%，通常该值不应低于95%。
innodb从1.0.x开始支持页压缩功能。

Flush List
当lru的也被修改之后，该页被称为脏页(dirty page)，此时缓冲池中页数据与硬盘不一致，需要通过checkpoint机制将脏页中的修改写入硬盘。所有的脏页会加入flush list中，当也同时存在lru list中，Modified db pages表明了当前脏页的数量。

重做日志缓冲

通过SHOW VARIABLES LIKE 'innodb_log_buffer_size'可以看到重做日志缓冲区的大小。

master thread
每个事务提交时会刷新到重做日志文件
当重做日志缓冲池剩余空间小于1/2时，重做日志缓冲会刷新到重做日志文件

额外的内存池

在innodb中，内存被以一种成为堆(heap)的方式管理。例如在缓冲池(innodb_buffer_pool)中的帧缓冲（frame buffer）及其对应的缓冲控制对象(bugger control block)，这些对象记录了诸如lru、锁、等待等信息，这些对象需要的内存会从额外的内存池中申请

checkpoint机制

为了协调cpu与硬盘的速度鸿沟，页的操作都是在内存池中完成的，在执行一条DML(Data Manipulation Language)语句后，该页即为需要写入硬盘的脏页。但是如果操作集中在某几条数据，而每次操作都在硬盘上执行一次io操作，那性能将非常差，同时会存在如果写入磁盘时发生意外，可能会导致数据的丢失。为此，很多事务数据库采用了write ahead log 策略，即在事务提交时，先写重做日志，在修改页数据，从而保证了数据的持久性。
而checkpoint的出现则是为了标记可以被覆盖的重做日志，并将内存中的页数据写入磁盘中。在缓冲池不足时，以及重做日志不可用时，会强制产生checkpoint。而具体checkpoint的时间条件，脏页的选择，刷新多少脏页到磁盘的决定十分复杂。
在innodb中，checkpoint分为两种:

sharp checkpoint
fuzzy checckpoint
sharp checkpoint 发生在数据库库关闭时，此时会将所有脏页刷新到磁盘上，这是默认的工作方式，参数innodb_fast_shutdown=1。但在数据库运行时往往不采用将所有脏页刷新到磁盘的工作方式，而是每次刷新一部分脏页，从而提高性能，这种被称为fuzzy checkpoint。
fuzzy checkpoint 又分为以下四种情况：
- master thread checkpoint
- flush_lru_list checkpoint
- async/sync flush checkpoint
- dirty page too much checkpoint

master thread在各个版本中有不同的实现，在此暂且略过。master thread checkpoint差不多以每秒或每10秒的速度将脏页一步刷新到磁盘中，该过程中依然可以进行其他操作。
flush_lru_list checkpoint: innodb需要保证lru队列中有大约100个空闲的页，所以当空闲页不足时，触发该checkpoint。在mysql5.6之前，该操作是在用户查询线程中进行的，所以会阻塞用户操作，但从5.6版本开始，该操作由单独的page cleaner线程进行，并且由innodb_lru_scan_depth参数控制空闲页的数量，默认1024。
saync/sync flush checkpoint 指重做日志不可用时，需要将一部分脏页刷新到磁盘中。
dirty page too much checkpoin:当脏页占内存池总数超过参数innodb_max_dirty_pages_pct时，刷新一部分脏页到磁盘中。在innodb 1.0.x之前，该值默认为90，之后默认为75。