Postgresql关于wal日志总结,一文搞清楚它的所有概念和相关操作

官方文档
https://www.postgresql.org/docs/11/wal-intro.html
https://www.postgresql.org/docs/11/wal-configuration.html
https://www.postgresql.org/docs/11/runtime-config-wal.html
https://www.postgresql.org/docs/11/runtime-config-preset.html
https://www.postgresql.org/docs/11/runtime-config-replication.html

https://www.postgresql.org/docs/11/wal-intro.html
WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage.
简单地说,WAL的核心概念是,对数据文件(表和索引所在的地方)的更改必须在写入了日志文件后这些更改之后才可以写入数据文件,也就是说,描述更改的日志记录被刷新到永久存储之后才可以写数据文件。
Using WAL results in a significantly reduced number of disk writes, because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction. The log file is written sequentially, and so the cost of syncing the log is much less than the cost of flushing the data pages.
使用WAL可以显著减少磁盘写操作的数量,因为只需要将日志文件刷新到磁盘以确保提交事务,而不是事务更改的每个数据文件。日志文件是按顺序写入的,因此同步日志的成本要比刷新数据页的成本低得多。

备注:数据库将脏数据刷到数据文件上,这个动作是随机I/O,性能比写日志的顺序I/O差太多


总结
1、WAL即Write Ahead Log预写式日志,简称wal日志,相当于oracle中的redo日志。只是oracle中redo是固定几个redo日志文件,然后轮着切换去写入。pg中wal日志是动态切换,单个wal日志写满继续写下一个wal日志,连续不断生成wal日志。
2、单个WAL文件的大小,默认为16MB,参数是wal_segment_size,可以理解为PG把Wal日志存储到N个大小为16M(默认值)的WAL segment file,一般不做更改,Postgresql 11版本之前只能在编译pg时指定,Postgresql 11版本开始支持 initdb 和 pg_resetwal 修改 WAL 文件大小,使用pg_resetwal 方式修改wal_segment_size后,重启后原来的数据都不会丢失,因为修改前是正常pg_ctl stop关闭的。如果在流复制的主库修改这个参数再重启,主库数据不会丢失,但是会发现流复制不正常了,此时从库会报错ERROR:  requested WAL segment 0000000100000000000000XX has already been removed,且主库也会报错ERROR:  requested WAL segment 0000000100000000000000XX has already been removed,当从库关闭后,主库不再这个错,如果从库再开启,从库会继续报这个错,主库也继续跟着报这个错
3、决定wal_pg目录下面的wal日志文件数量的参数是min_wal_size、max_wal_size、wal_keep_segments,其中wal_keep_segments配置的是 standby 复制所需的在pg_wal目录中最少保留的 WAL 文件的数目,一般情况下,这大概就是 wal 尺寸范围,如果这个参数配置为0,那么此时 WAL 文件的数量还取决于max_wal_size,max_wal_size指定每两个checkpoint检查点之间产生的WAL最大量。当WAL日志文件超过这个WAL大小时会被自动删除,当然这只是一个软限制,WAL大小可能会超过max_wal_size,独立于max_wal_size之外,wal_keep_segments + 1 个最近的 WAL 文件将总是被保留。还有,如果使用了 WAL 归档,旧的段在被归档之前不能被不能删除或回收,直到它们被归档。如果WAL归档跟不上WAL生成的速度,或者archive_command重复失败,旧的WAL文件将在pg_wal中累积,直到问题解决。
4、如果max_wal_size和wal_keep_segments没起作用导致wal_pg目录下面的wal日志文件不自动删除从而越来越多,原因主要是负载过重、archive_command失败或wal_keep_segments设置过高,当然replication slot不再使用的历史复制槽pg_replication_slots.active=f存在的话,Wal日志也不会自动删除,这是因为replication slot复制插槽提供了一种自动的方法来确保主服务器在所有备用服务器接收到WAL段之前不会删除WAL段,从而导致Wal日志超过max_wal_size,以至于Wal日志越来越多Wal目录越来越大。
5、正常情况下pg_wal目录下的wal文件为在线WAL日志,不能删除,删除后如果遇到数据库重启则会丢失数据
6、pg_wal日志没有设置保留周期的参数,即没有类似mysql的参数expire_logs_days,在检查点之后,当旧的wal日志文件不再需要时,它们将被删除或回收(按编号顺序重命名为将来的段)。
7、replication slot复制插槽提供了一种自动的方法来确保主服务器在所有备用服务器接收到WAL段之前,主服务器不会删除WAL段
8、执行select pg_switch_wal()语句手工切换wal日志时该执行过程会很快,但是pg_wal目录下的wal日志文件并不会被自动删除;执行checkpoint语句时该执行过程会很比较慢,pg_wal目录下的一些wal日志文件被自动删除,这是因为当旧的wal日志文件不再需要时,它们将被删除或回收(已实验过,执行checkpoint后,一些Wal被自动删除了)
9、min_wal_size参数意思:只要 WAL 磁盘使用量保持在这个设置之下,在检查点发生时旧的WAL文件总是被回收以便未来使用(wal文件重复利用,rename成新的文件名),而不是直接被删除。这可以被用来确保有足够的 WAL 空间被保留来应付 WAL 使用的高峰,例如运行大型的批处理任务。 默认是 80 MB。也就是说为wal日志文件保留的最小磁盘空间的,在这个大小之内的wal日志文件不会被删除,而是直接复写。假如该值为100MB,然后pg_wal里面已经生成的日志文件占用磁盘为80MB,此时WAL磁盘使用率为80MB低于这个100MB,那么这80MB的pg_wal日志文件都会被复用,这可以用来确保预留足够的WAL空间处理WAL使用中的峰值。min_wal_size设置的值不是指操作系统df看到物理磁盘可用空间小于min_wal_size时需要回收wal以便使wal在物理磁盘不足时可以不受物理磁盘不足的影响
10、Postgresql哪些情况会触发数据库的checkpoint
10.1、手动执行CHECKPOINT命令
10.2、执行需要检查点的命令(例如pg_start_backup 或pg_ctl stop|restart等等)
10.3、达到检查点配置时间(checkpoint_timeout)
10.4、max_wal_size已满
10.5、数据库正常shutdown
11、Wal日志归档的方法,前提是已经启用了archive_mode=on和设置了archive_command
11.1、select pg_switch_wal();
11.2、wal日志写满后会自动归档
11.3、达到参数archive_timeout值
12、Wal归档后,不会把之前的wal删除,归档就是把wal拷贝一份到归档的路径下面
13、手工清理pg_wal日志的命令是pg_archivecleanup,如下表示删除000000010000000000000005之前的所有日志
pg_archivecleanup /postgresql/pgsql/data/pg_wal 000000010000000000000005
14、WAL的配置参数大部分可以选择默认值,一般配置min_wal_size、max_wal_size、wal_keep_segments、checkpoint_timeout这个几个参数就够了
15、postgresql wal三种模式minimal、replica、logical下,产生的日志量不会相差太大,比较像oracle的redo和mysql的redo log,而不像mysql的binlog参数binlog_format=statement或binlog_format=row时产生的binlog日志会相差很大,因为statement时,binlog只记录sql语句,row时,binlog会记录真实删除行的主键id和其他所有字段的信息,mysql的如果delete语句删掉10万行数据,用statement的话就是一个 SQL语句被记录到binlog中,占用几十个字节的空间。但如果用row格式的binlog,就要把这10万条记录都写到binlog中。
16、PostgreSQL在写入频繁的场景中,会产生大量的WAL日志,而且WAL日志量会远远超过实际更新的数据量,这就叫做“WAL写放大”,产生的原因主要有以下两点。说个题外话当年Uber也是这个原因之一把pg切换成了mysql,参见https://eng.uber.com/postgres-to-mysql-migration/
16.1、full page writes,PostgreSQL会在一个checkpoint检查点之后的页面的第一次修改期间将每个页面的全部内容写到Wal文件
16.2、update时如果新记录位置发生变更,索引记录也要相应变更,这个变更也要记入WAL。而索引记录的变更又有可能导致索引页的全页写,进一步加剧了把内容写到Wal文件。
17、PostgreSQL的安装目录下有个叫做pg_xlogdump的命令可以解析WAL文件(mysql可以使用strings查看redo log和使用mysqlbinlog查看binlog,oracle没有可以直接查看redo log的linux命令),命令如下
pg_xlogdump /pgsql/data/pg_xlog/0000000100000555000000D5 -b
18、wal日志文件和LSN编号对应规则:wal日志文件名的后两位是LSN的前两位;wal日志文件的偏移量是LSN的后六位对应的十进制值
18.1、select pg_walfile_name(pg_current_wal_lsn())命令中我们看到wal日志文件名00000001000000010000004D的后两位是4D,而LSN号1/4D000140的/斜杠后面的前两位也是4D,也就是说日志文件名的后两位是LSN的前两位。
18.2、SELECT * FROM pg_walfile_name_offset(pg_current_wal_lsn())命令中我们看到wal日志中的偏移量是320,而LSN号1/4D000140后6位是000140,000140对应的十进制值刚好是320,也就是说日志文件的偏移量是LSN的后六位的十进制值
例如当前wal日志偏移量为504
postgres=# select pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
 1/4D000140
(1 row)

postgres=# select pg_walfile_name(pg_current_wal_lsn());
     pg_walfile_name
--------------------------
 00000001000000010000004D
(1 row)

postgres=# SELECT * FROM pg_walfile_name_offset(pg_current_wal_lsn());
        file_name         | file_offset
--------------------------+-------------
 00000001000000010000004D |         320
(1 row)

postgres=# select x'000140'::int;
 int4
------
  320
(1 row)

WAL的涉及的参数及解释:

wal_segment_size:
Reports the size of write ahead log segments. The default value is 16MB
单个WAL文件的大小,默认为16MB,一般不做更改,且在pg11之前,只能在编译pg时指定。

wal_level
确定将多少信息写入 WAL 。 默认值为 replica ,它写入足够的数据以支持 WAL 归档和复制,包括在备用服务器上运行只读查询。
从低到告的三个级别分别是minimal、replica、logical,minimal会去掉除从崩溃或者立即关机中进行恢复所需的信息之外的所有记录,logical=replica+逻辑解码所需的信息。
在 9.6 之前的版本中,这个参数也允许值archive和hot_standby。现在仍然接受这些值,但是它们会被映射到replica。

wal_keep_segments:
Specifies the minimum number of past log file segments kept in the pg_wal directory, in case a standby server needs to fetch them for streaming replication. Each segment is normally 16 megabytes.
这个参数配置的是 standby 复制所需的在pg_wal目录中最少保留的 WAL 文件的数目,一般情况下,这大概就是 wal 尺寸范围,如果这个参数配置为0,那么此时 WAL 文件的数量还取决于max_wal_size等其他参数。默认值为0

min_wal_size:
min_wal_size puts a minimum on the amount of WAL files recycled for future usage
min_wal_size为将来使用而回收的WAL文件的数量设置了一个最小值
https://www.postgresql.org/docs/11/runtime-config-wal.html
As long as WAL disk usage stays below this setting, old WAL files are always recycled for future use at a checkpoint, rather than removed. This can be used to ensure that enough WAL space is reserved to handle spikes in WAL usage, for example when running large batch jobs. The default is 80 MB. This parameter can only be set in the postgresql.conf file or on the server command line.
只要WAL磁盘使用率低于这个设置,旧的WAL文件总是被回收(wal文件重复利用,rename成新的文件名),以供将来检查点使用。而不是删除。 可以理解为wal日志文件保留的磁盘大小,在这个大小之内的wal日志文件不会被删除,而是直接复写。默认是 80 MB。
备注:假如该值为100MB,然后pg_wal里面已经生成的日志文件占用磁盘为80MB,此时WAL磁盘使用率为80MB低于这个100MB,那么这80MB的pg_wal日志文件都会被复用,这可以用来确保预留足够的WAL空间处理WAL使用中的峰值

max_wal_size:
Maximum size to let the WAL grow to between automatic WAL checkpoints. This is a soft limit; WAL size can exceed max_wal_size under special circumstances, like under heavy load, a failing archive_command, or a high wal_keep_segments setting. The default is 1 GB. Increasing this parameter can increase the amount of time needed for crash recovery.
这个参数指定每两个checkpoint检查点之间产生的WAL最大量。当WAL日志文件超过这个WAL大小时会被自动删除,当然这只是一个软限制,WAL大小可能会超过max_wal_size,比如在负载过重、archive_command失败或wal_keep_segments设置过高的情况下,增加这个参数 可能导致崩溃恢复所需的时间。默认为 1 GB
https://www.postgresql.org/docs/11/warm-standby.html#STREAMING-REPLICATION-SLOTS
Replication slots provide an automated way to ensure that the master does not remove WAL segments until they have been received by all standbys
复制插槽提供了一种自动的方法来确保主服务器在所有备用服务器接收到WAL段之前不会删除WAL段,比如replication slot不再使用的历史复制槽pg_replication_slots.active=f存在的话,Wal日志就不会自动删除,从而导致Wal日志超过max_wal_size,以至于Wal日志越来越多Wal目录越来越大

checkpoint_timeout
Maximum time between automatic WAL checkpoints, in seconds. The valid range is between 30 seconds and one day. The default is five minutes (5min). Increasing this parameter can increase the amount of time needed for crash recovery.
自动WAL检查点之间的最长时间,以秒计,合理的范围在30秒到1天之间,默认是5分钟
备注:如果检查点触发的频率小于30s,则运行日志中会提醒你增加max_wal_size

checkpoint_completion_target
Specifies the target of checkpoint completion, as a fraction of total time between checkpoints. The default is 0.5.
默认值为0.5,PostgreSQL被期望能够在下一个检查点启动之前的大约一半时间内完成每个检查点,也就是说每个checkpoint需要在checkpoints间隔时间的50%内完成。
假如我的checkpoint_timeout设置是30分钟,而wal生成了100G遇到IO很慢的时间就需要很长时间来完成这100GB的checkpoint,那么设置成0.5就允许我在15分钟内完成checkpoint,设置成0.8就允许我在24分钟内完成checkpoint,调大这个值就可以降低checkpoint对性能的影响,但是万一数据库出现故障,那么这个值设置越大数据库就越危险。

fsync:表示每次提交时是否强制Wal更新到磁盘,默认就是ON,也就是说提交的信息一定会写入Wal日志文件,虽然关闭fsync常常可以得到性能上的收益,但当发生断电或系统崩溃时可能造成不可恢复的数据损坏。和Mysql的innodb_flush_log_at_trx_commit参数类似
如果关闭这个参数,那么一般也需要关闭full_page_writes参数

synchronous_commit:指定在命令返回“success”指示给客户端之前,一个事务是否需要等待 WAL 记录被写入磁盘,默认值是ON。当设置为off时,在向客户端报告成功和真正保证事务不会被服务器崩溃威胁之间会有延迟(最大的延迟是wal_writer_delay的三倍)。不同于fsync,将这个参数设置为off不会产生数据库不一致性的风险

full_page_writes:表示PostgreSQL服务器在一个检查点之后的页面的第一次修改期间是否将每个页面的全部内容写到 WAL 中。这么做是因为在操作系统崩溃期间正在处理的一次页写入可能只有部分完成,从而导致在一个磁盘页面中混合有新旧数据。在崩溃后的恢复期间,通常存储在 WAL 中的行级改变数据不足以完全恢复这样一个页面。存储完整的页面映像可以保证页面被正确存储,但代价是增加了必须被写入 WAL 的数据量,因为 WAL 重放总是从一个检查点开始,所以在检查点后每个页面的第一次改变时这样做就够了。因此,一种减小全页面写开销的方法是增加检查点间隔参数值。把这个参数关闭会加快正常操作,但是在系统失败后可能导致不可恢复的数据损坏,或者静默的数据损坏。默认值是on。

wal_buffers:表示还未写入磁盘的 WAL 数据的共享内存量。默认值 -1 选择等于shared_buffers的 1/32 的尺寸(大约3%),但是通常为不小于64kB也不大于 WAL 段的尺寸。如果自动的选择太大或太小可以手工设置该值,但是任何小于32kB的正值都将被当作32kB。这个参数只能在服务器启动时设置。在每次事务提交时,WAL 缓冲区的内容被写出到磁盘,因此极大的值不可能提供显著的收益。不过,把这个值设置为几个兆字节可以在一个繁忙的服务器(其中很多客户端会在同一时间提交)上提高写性能。由默认设置 -1 选择的自动调节将在大部分情况下得到合理的结果。

wal_writer_delay:指定 WAL 写入器刷写 WAL 的频繁程度。在刷写 WAL 之后它会睡眠wal_writer_delay毫秒,除非被一个异步提交事务唤醒。假如上一次刷写发生在少于wal_writer_delay毫秒以前并且从上一次刷写发生以来产生了少于wal_writer_flush_after字节的 WAL,则WAL将只被写入到操作系统而不是被刷到磁盘。默认值是 200 毫秒(200ms)。通俗理解就是WAL writer进程的间歇时间,类似oracle redo buffer每3秒刷一次磁盘,如果时间过长可能造成WAL buffer的内存不足;反之过小将会引起WAL的不断的写入,对磁盘的IO也是很大考验。

commit_delay:表示在一次 WAL 刷写被发起之前,commit_delay增加一个时间延迟,以微妙计。如果系统负载足够高,使得在一个给定间隔内有额外的事务准备好提交,那么通过允许更多事务通过一个单次 WAL 刷写来提交能够提高组提交的吞吐量。但是,它也把每次 WAL 刷写的潜伏期增加到了最多commit_delay微秒。因为如果没有其他事务准备好提交,就会浪费一次延迟,只有在当一次刷写将要被发起时有至少commit_siblings个其他活动事务时,才会执行一次延迟。另外,如果fsync被禁用,则将不会执行任何延迟。默认的commit_delay是零(无延迟)。通俗解释就是表示了一个已经提交的数据在WAL buffer中存放的时间。如果设置为非0,表明了某个事务执行commit后不会立即写入WAL中,而仍存放在WAL buffer中,这样对于后面的事务申请WAL buffer时非常不利,尤其是提交事务较多的高峰期,可能引起WAL buffer内存不足,同样如果此时崩溃数据面临着丢失的危险。优点是能够使数据集中写入这样降低了系统的IO,提高了性能。类似mysql的binlog_group_commit_sync_delay 参数
    
commit_siblings:表示在执行commit_delay延迟时,要求的并发活动事务的最小数目,系统默认值是5。 表示当一个事务发出提交请求,此时数据库中正在执行的事务数量大于5,则该事务将等待一段时间(commit_delay的值),反之,该事务则直接写入WAL。类似mysql的binlog_group_commit_sync_no_delay_count 参数

archive_mode
当启用archive_mode时,可以通过设置 archive_command命令将完成的 WAL 段发送到 归档存储。除用于禁用的off之外,还有两种模式: on和always。在普通操作期间,这两种模式之间 没有区别,但是当设置为always时,WAL 归档器在归档恢复 或者后备模式下也会被启用。在always模式下,所有从归档恢复 的或者用流复制传来的文件将被(再次)归档。默认值是off。

archive_timeout
you can set archive_timeout to force the server to switch to a new WAL segment file periodically. When this parameter is greater than zero, the server will switch to a new segment file whenever this many seconds have elapsed since the last segment file switch, and there has been any database activity, including a single checkpoint (checkpoints are skipped if there is no database activity).
您可以设置archive_timeout来强制服务器定期切换到新的WAL段文件。当此参数大于0时,只要从最后一次段文件切换以来已经经过了这么多秒,并且存在任何数据库活动,包括单个检查点(如果没有数据库活动,则跳过检查点),服务器就会切换到新的段文件。只要DB实例有写入,尽管写入量很少,只要到达 archive_timeout 会触发WAL文件的强制切换。因此如果archive_timeout如果太短就会产生很多新的WAL日志文件,从而产生大量归档。默认值为0
archive_timeout就是定时归档的时间
如果为0表示不启用定时归档,而是当Wal文件写满后开始写下一个Wal文件时,在archive_mode=on的情况下时通过archive_command命令将Wal文件归档


https://www.postgresql.org/docs/11/wal-configuration.html
Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. (The change records were previously flushed to the WAL files.) In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the log (known as the redo record) from which it should start the REDO operation. Any changes made to data files before that point are guaranteed to be already on disk. Hence, after a checkpoint, log segments preceding the one containing the redo record are no longer needed and can be recycled or removed. (When WAL archiving is being done, the log segments must be archived before being recycled or removed.)
检查点是事务序列中的点,在这些点上,可以保证所有更新信息已经在检查点之前写入了堆和索引数据文件。在检查点时,将所有脏数据页刷新到磁盘,并将一个特殊的检查点记录写入日志文件。(更改记录之前被刷新到WAL文件中。)在发生崩溃时,崩溃恢复过程将查看最新的检查点记录,以确定日志中应该从何处开始重做操作(称为重做记录)。在此之前对数据文件所做的任何更改都保证已经在磁盘上。因此,在检查点之后,不再需要包含重做记录的日志段之前的日志段,可以回收或删除。(在进行WAL归档时,必须在回收或删除日志段之前对其进行归档。)

The checkpoint requirement of flushing all dirty data pages to disk can cause a significant I/O load. For this reason, checkpoint activity is throttled so that I/O begins at checkpoint start and completes before the next checkpoint is due to start; this minimizes performance degradation during checkpoints.
检查点要求将所有脏数据页刷新到磁盘,这可能导致大量I/O负载。由于这个原因,检查点活动被调节以便I/O在检查点开始时开始并在下一个检查点开始之前完成;这可以最大限度地降低检查点期间的性能下降。

The server's checkpointer process automatically performs a checkpoint every so often. A checkpoint is begun every checkpoint_timeout seconds, or if max_wal_size is about to be exceeded, whichever comes first. The default settings are 5 minutes and 1 GB, respectively. If no WAL has been written since the previous checkpoint, new checkpoints will be skipped even if checkpoint_timeout has passed. (If WAL archiving is being used and you want to put a lower limit on how often files are archived in order to bound potential data loss, you should adjust the archive_timeout parameter rather than the checkpoint parameters.) It is also possible to force a checkpoint by using the SQL command CHECKPOINT.
服务器的检查指针进程经常自动执行检查点。每个checkpoint_timeout秒开始一个检查点,或者如果要超过max_wal_size,则以先到的为准。默认设置分别为5分钟和1 GB。如果在前一个检查点之后没有写入WAL,那么将跳过新的检查点,即使checkpoint_timeout已经通过。(如果正在使用WAL归档,并且您希望对文件归档的频率设置一个较低的限制,以便绑定潜在的数据丢失,那么您应该调整archive_timeout参数,而不是检查点参数。)还可以使用SQL命令检查点强制设置检查点。

Reducing checkpoint_timeout and/or max_wal_size causes checkpoints to occur more often. This allows faster after-crash recovery, since less work will need to be redone. However, one must balance this against the increased cost of flushing dirty data pages more often. If full_page_writes is set (as is the default), there is another factor to consider. To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content. In that case, a smaller checkpoint interval increases the volume of output to the WAL log, partially negating the goal of using a smaller interval, and in any case causing more disk I/O.
减少checkpoint_timeout和/或max_wal_size会导致更频繁地出现检查点。这允许更快的崩溃后恢复,因为需要重做的工作更少。但是,必须在这一点与更频繁地刷新脏数据页所增加的成本之间进行权衡。如果设置了full_page_writes(默认设置),那么还需要考虑另一个因素。为了确保数据页面的一致性,数据页面在每个检查点之后的第一次修改将导致记录整个页面内容。在这种情况下,更小的检查点间隔会增加输出容量到WAL日志,在一定程度上否定了使用更小的间隔的目标,并且在任何情况下都会导致更多的磁盘I/O。

Checkpoints are fairly expensive, first because they require writing out all currently dirty buffers, and second because they result in extra subsequent WAL traffic as discussed above. It is therefore wise to set the checkpointing parameters high enough so that checkpoints don't happen too often. As a simple sanity check on your checkpointing parameters, you can set the checkpoint_warning parameter. If checkpoints happen closer together than checkpoint_warning seconds, a message will be output to the server log recommending increasing max_wal_size. Occasional appearance of such a message is not cause for alarm, but if it appears often then the checkpoint control parameters should be increased. Bulk operations such as large COPY transfers might cause a number of such warnings to appear if you have not set max_wal_size high enough.
检查点相当昂贵,首先是因为它们需要写出所有当前的脏缓冲区,其次是因为它们会导致前面讨论的额外的后续WAL流量。因此,明智的做法是将检查点参数设置得足够高,这样检查点就不会经常出现。作为对检查点参数的简单完整性检查,可以设置checkpoint_warning参数。如果检查点发生的距离比checkpoint_warning秒更近,则会向服务器日志输出一条消息,建议增加max_wal_size。偶尔出现这样的消息并不会引起警报,但如果经常出现,则应该增加检查点控制参数。如果没有将max_wal_size设置得足够高,那么像大型复制传输这样的大容量操作可能会导致出现大量这样的警告。

To avoid flooding the I/O system with a burst of page writes, writing dirty buffers during a checkpoint is spread over a period of time. That period is controlled by checkpoint_completion_target, which is given as a fraction of the checkpoint interval. The I/O rate is adjusted so that the checkpoint finishes when the given fraction of checkpoint_timeout seconds have elapsed, or before max_wal_size is exceeded, whichever is sooner. With the default value of 0.5, PostgreSQL can be expected to complete each checkpoint in about half the time before the next checkpoint starts. On a system that's very close to maximum I/O throughput during normal operation, you might want to increase checkpoint_completion_target to reduce the I/O load from checkpoints. The disadvantage of this is that prolonging checkpoints affects recovery time, because more WAL segments will need to be kept around for possible use in recovery. Although checkpoint_completion_target can be set as high as 1.0, it is best to keep it less than that (perhaps 0.9 at most) since checkpoints include some other activities besides writing dirty buffers. A setting of 1.0 is quite likely to result in checkpoints not being completed on time, which would result in performance loss due to unexpected variation in the number of WAL segments needed.
为了避免大量的页写操作充斥I/O系统,在检查点期间写入脏缓冲区要分散在不同时间段。该时间段周期由checkpoint_completion_target控制,它是检查点间隔的一部分。调整I/O速率,以便在经过给定的checkpoint_timeout秒数时或者在超过max_wal_size之前结束检查点(checkpoint_timeout秒数或超过max_wal_size这两者以谁先到达为准)。默认值为0.5,PostgreSQL可以在下一个检查点开始前的一半时间内完成每个检查点。在正常操作期间非常接近最大I/O吞吐量的系统上,您可能希望增加checkpoint_completion_target以减少检查点的I/O负载。这样做的缺点是,延长检查点会影响恢复时间,因为需要保留更多的WAL段以供恢复时使用。尽管checkpoint_completion_target可以设置为1.0,但是最好将其设置在1.0以下(可能最多0.9),因为检查点除了写入脏缓冲区之外还包含其他活动。设置为1.0很可能会导致检查点不能按时完成,这将导致性能损失,因为所需的WAL段数量会发生意外变化。

On Linux and POSIX platforms checkpoint_flush_after allows to force the OS that pages written by the checkpoint should be flushed to disk after a configurable number of bytes. Otherwise, these pages may be kept in the OS's page cache, inducing a stall when fsync is issued at the end of a checkpoint. This setting will often help to reduce transaction latency, but it also can have an adverse effect on performance; particularly for workloads that are bigger than shared_buffers, but smaller than the OS's page cache.
在Linux和POSIX平台上,checkpoint_flush_after允许强制操作系统将检查点写入的页面以可配置的字节数刷新到磁盘。否则,这些页面可能会保存在操作系统的页面缓存中,导致在检查点结束时发出fsync时出现停顿。此设置通常有助于减少事务延迟,但也可能对性能产生不利影响;特别是对于比shared_buffers大,但比OS的页面缓存小的工作负载。

The number of WAL segment files in pg_wal directory depends on min_wal_size, max_wal_size and the amount of WAL generated in previous checkpoint cycles. When old log segment files are no longer needed, they are removed or recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of log output rate, max_wal_size is exceeded, the unneeded segment files will be removed until the system gets back under this limit. Below that limit, the system recycles enough WAL files to cover the estimated need until the next checkpoint, and removes the rest. The estimate is based on a moving average of the number of WAL files used in previous checkpoint cycles. The moving average is increased immediately if the actual usage exceeds the estimate, so it accommodates peak usage rather than average usage to some extent. min_wal_size puts a minimum on the amount of WAL files recycled for future usage; that much WAL is always recycled for future use, even if the system is idle and the WAL usage estimate suggests that little WAL is needed.
pg_wal目录中的WAL段文件的数量取决于min_wal_size、max_wal_size和在以前的检查点周期中生成的WAL数量。当旧的日志段文件不再需要时,它们将被删除或回收(也就是说,按编号顺序重命名为将来的段)。如果由于日志输出率的短期峰值,超过了max_wal_size,那么不需要的段文件将被删除,直到系统返回到这个限制之下。如果低于这个限制,系统将回收足够的WAL文件来满足估计的需求,直到下一个检查点,并删除其余的。这个估计是基于前一个检查点周期中使用的WAL文件数量的移动平均值。如果实际使用量超过估计,则移动平均会立即增加,因此在某种程度上它适应峰值使用量,而不是平均使用量。min_wal_size为将来使用而回收的WAL文件的数量设置一个最小值;即使系统处于闲置状态,而且WAL的使用量估计表明需要少量的WAL,那么大量的WAL也会被回收以备将来使用。

Independently of max_wal_size, wal_keep_segments + 1 most recent WAL files are kept at all times. Also, if WAL archiving is used, old segments can not be removed or recycled until they are archived. If WAL archiving cannot keep up with the pace that WAL is generated, or if archive_command fails repeatedly, old WAL files will accumulate in pg_wal until the situation is resolved. A slow or failed standby server that uses a replication slot will have the same effect (see Section 26.2.6).
独立于max_wal_size, wal_keep_segments + 1个最新的WAL文件会一直保存。此外,如果使用了WAL归档,旧的段不能删除或回收,直到它们被归档。如果WAL归档跟不上WAL生成的速度,或者archive_command重复失败,旧的WAL文件将在pg_wal中累积,直到问题解决。使用复制槽的慢速或失败的备用服务器也会产生同样的效果。

In archive recovery or standby mode, the server periodically performs restartpoints, which are similar to checkpoints in normal operation: the server forces all its state to disk, updates the pg_control file to indicate that the already-processed WAL data need not be scanned again, and then recycles any old log segment files in the pg_wal directory. Restartpoints can't be performed more frequently than checkpoints in the master because restartpoints can only be performed at checkpoint records. A restartpoint is triggered when a checkpoint record is reached if at least checkpoint_timeout seconds have passed since the last restartpoint, or if WAL size is about to exceed max_wal_size. However, because of limitations on when a restartpoint can be performed, max_wal_size is often exceeded during recovery, by up to one checkpoint cycle's worth of WAL. (max_wal_size is never a hard limit anyway, so you should always leave plenty of headroom to avoid running out of disk space.)
在存档恢复或待机模式下,服务器定期执行重启点,类似于正常运行时的检查点:服务器强制将其所有状态转移到磁盘,更新pg_control文件,以表明不需要再次扫描已处理的WAL数据,然后回收pg_wal目录中的所有旧日志段文件。在主程序中,重新起始点不能比检查点执行得更频繁,因为重新起始点只能在检查点记录上执行。如果从最后一个重启点开始已经经过了至少checkpoint_timeout秒,或者如果WAL大小即将超过max_wal_size,那么当到达检查点记录时,就会触发一个重启点。但是,由于对何时可以执行重新起始点的限制,在恢复期间经常会超过max_wal_size,最多超过一个检查点周期的WAL。(max_wal_size从来都不是一个硬限制,所以您应该始终留有足够的空间,以避免耗尽磁盘空间。)

There are two commonly used internal WAL functions: XLogInsertRecord and XLogFlush. XLogInsertRecord is used to place a new record into the WAL buffers in shared memory. If there is no space for the new record, XLogInsertRecord will have to write (move to kernel cache) a few filled WAL buffers. This is undesirable because XLogInsertRecord is used on every database low level modification (for example, row insertion) at a time when an exclusive lock is held on affected data pages, so the operation needs to be as fast as possible. What is worse, writing WAL buffers might also force the creation of a new log segment, which takes even more time. Normally, WAL buffers should be written and flushed by an XLogFlush request, which is made, for the most part, at transaction commit time to ensure that transaction records are flushed to permanent storage. On systems with high log output, XLogFlush requests might not occur often enough to prevent XLogInsertRecord from having to do writes. On such systems one should increase the number of WAL buffers by modifying the wal_buffers parameter. When full_page_writes is set and the system is very busy, setting wal_buffers higher will help smooth response times during the period immediately following each checkpoint.
有两个常用的内部WAL函数:XLogInsertRecord和XLogFlush。XLogInsertRecord用于将一个新记录放入共享内存中的WAL缓冲区中。如果没有新记录的空间,XLogInsertRecord将不得不写入(移动到内核缓存)一些已填充的WAL缓冲区。这是不希望看到的,因为当受影响的数据页上持有排他锁时,每个数据库低级修改(例如行插入)都使用XLogInsertRecord,因此操作需要尽可能快。更糟糕的是,写入WAL缓冲区可能还会强制创建新的日志段,这会花费更多的时间。通常,应该通过一个XLogFlush请求来写入和刷新WAL缓冲区,这个请求在大多数情况下是在事务提交时发出的,以确保事务记录被刷新到永久存储中。在具有高日志输出的系统上,XLogFlush请求发生的频率可能不足以防止XLogInsertRecord不得不进行写操作。在这样的系统上,应该通过修改wal_buffers参数来增加WAL缓冲区的数量。当设置了full_page_writes并且系统非常繁忙时,将wal_buffers设置得更高将有助于使每个检查点之后的响应时间更平稳。

The commit_delay parameter defines for how many microseconds a group commit leader process will sleep after acquiring a lock within XLogFlush, while group commit followers queue up behind the leader. This delay allows other server processes to add their commit records to the WAL buffers so that all of them will be flushed by the leader's eventual sync operation. No sleep will occur if fsync is not enabled, or if fewer than commit_siblings other sessions are currently in active transactions; this avoids sleeping when it's unlikely that any other session will commit soon. Note that on some platforms, the resolution of a sleep request is ten milliseconds, so that any nonzero commit_delay setting between 1 and 10000 microseconds would have the same effect. Note also that on some platforms, sleep operations may take slightly longer than requested by the parameter.
commit_delay参数定义了在获取XLogFlush中的锁之后,组提交leader进程将休眠多少微秒,而组提交follower将在leader后面排队。这种延迟允许其他服务器进程将它们的提交记录添加到WAL缓冲区中,以便所有这些记录都将被leader的最终同步操作刷新。如果没有启用fsync,或者当前处于活动事务中的其他会话少于commit_sibling,则不会发生睡眠;这样可以避免在其他会话不太可能很快提交时出现睡眠。注意,在一些平台上,睡眠请求的分辨率是10毫秒,因此在1到10000微秒之间的任何非零的commit_delay设置都具有相同的效果。还要注意的是,在一些平台上,睡眠操作所花费的时间可能比该参数所请求的时间稍长。

Since the purpose of commit_delay is to allow the cost of each flush operation to be amortized across concurrently committing transactions (potentially at the expense of transaction latency), it is necessary to quantify that cost before the setting can be chosen intelligently. The higher that cost is, the more effective commit_delay is expected to be in increasing transaction throughput, up to a point. The pg_test_fsync program can be used to measure the average time in microseconds that a single WAL flush operation takes. A value of half of the average time the program reports it takes to flush after a single 8kB write operation is often the most effective setting for commit_delay, so this value is recommended as the starting point to use when optimizing for a particular workload. While tuning commit_delay is particularly useful when the WAL log is stored on high-latency rotating disks, benefits can be significant even on storage media with very fast sync times, such as solid-state drives or RAID arrays with a battery-backed write cache; but this should definitely be tested against a representative workload. Higher values of commit_siblings should be used in such cases, whereas smaller commit_siblings values are often helpful on higher latency media. Note that it is quite possible that a setting of commit_delay that is too high can increase transaction latency by so much that total transaction throughput suffers.
由于commit_delay的目的是允许每个刷新操作的成本在并发提交事务中平摊(可能以事务延迟为代价),所以有必要在智能选择设置之前对该成本进行量化。成本越高,在一定程度上,commit_delay在提高事务吞吐量方面的预期效果就越好。pg_test_fsync程序可用于测量单个WAL刷新操作所花费的平均时间(以微秒为单位)。对于commit_delay来说,在一个8kB的写操作之后,程序报告刷新所花费的平均时间的一半通常是最有效的设置,因此在针对特定工作负载进行优化时,建议将这个值作为起始点使用。当WAL日志存储在高延迟的旋转磁盘上时,调整commit_delay特别有用,即使在同步时间非常快的存储介质上,比如固态驱动器或带有电池支持的写缓存的RAID阵列,也会有显著的好处;但是这一定要在一个有代表性的工作负载上进行测试。在这种情况下应该使用较高的commit_sibling值,而较小的commit_sibling值在延迟较高的媒体上通常很有帮助。注意,设置过高的commit_delay很可能会增加事务延迟,从而影响整个事务吞吐量。

When commit_delay is set to zero (the default), it is still possible for a form of group commit to occur, but each group will consist only of sessions that reach the point where they need to flush their commit records during the window in which the previous flush operation (if any) is occurring. At higher client counts a “gangway effect” tends to occur, so that the effects of group commit become significant even when commit_delay is zero, and thus explicitly setting commit_delay tends to help less. Setting commit_delay can only help when (1) there are some concurrently committing transactions, and (2) throughput is limited to some degree by commit rate; but with high rotational latency this setting can be effective in increasing transaction throughput with as few as two clients (that is, a single committing client with one sibling transaction).
当commit_delay设置为0(默认),它仍然是一种小组提交的现象有可能发生,但每组只包含会话到达的地方是,他们需要冲洗他们的提交记录的窗口前冲洗操作期间发生(如果有的话)。在较高的客户端计数时,往往会出现“通道效应”,因此即使在commit_delay为0时,组提交的效果也会变得显著,因此显式地设置commit_delay帮助不大。设置commit_delay只有在以下情况下才有帮助:(1)有一些并发提交的事务,(2)吞吐量受到提交率的一定限制;但是由于具有很高的旋转延迟,此设置可以有效地提高只有两个客户机(即一个提交客户机具有一个兄弟事务)的事务吞吐量。

The wal_sync_method parameter determines how PostgreSQL will ask the kernel to force WAL updates out to disk. All the options should be the same in terms of reliability, with the exception of fsync_writethrough, which can sometimes force a flush of the disk cache even when other options do not do so. However, it's quite platform-specific which one will be the fastest. You can test the speeds of different options using the pg_test_fsync program. Note that this parameter is irrelevant if fsync has been turned off.
wal_sync_method参数决定了PostgreSQL将如何要求内核强制将WAL更新到磁盘。就可靠性而言,所有选项都应该是相同的,除了fsync_writethrough,它有时会强制刷新磁盘缓存,即使其他选项没有这样做。但是,哪个平台最快是完全不同于平台的。您可以使用pg_test_fsync程序测试不同选项的速度。注意,如果fsync已经关闭,那么这个参数是不相关的。

Enabling the wal_debug configuration parameter (provided that PostgreSQL has been compiled with support for it) will result in each XLogInsertRecord and XLogFlush WAL call being logged to the server log. This option might be replaced by a more general mechanism in the future.
启用wal_debug配置参数(前提是PostgreSQL已经编译并支持它)将导致每个XLogInsertRecord和XLogFlush WAL调用被记录到服务器日志中。这一选择可能在将来被一种更一般的机制所取代。

你可能感兴趣的:(postgres,数据库)