PG14归档失败解决办法archiver failed on wal_lsn

案例1:pg_wal下有wal_lsn文件

案例1适用于以下场景:

  • pg_wal下有该wal_lsn文件而归档目录下无该wal_lsn文件
  • pg_wal和归档目录下同时都有该wal_lsn文件

问题描述

昨晚Repmgr+PG14主备主库因wal日志撑爆磁盘,删除主库过期wal文件重做备库后上午进行主备状态巡查,主库向备库发送wal文件正常,但是查主库状态时发现显示有1条归档失败的记录。
postgres: archiver failed on 000000010000006F00000086

  • 主库:

walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10" walsender正常
archiver failed on 000000010000006F00000086" 归档失败

  • 备库:

walreceiver streaming 77/9EB6A198" “” “” " walreceiver正常

--查主库数据库状态
[root@pgmaster ~]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 3710970 (postgres)
Tasks: 53 (limit: 201967)
Memory: 19.0G
CGroup: /system.slice/postgres.service
├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710995 "postgres: archiver failed on 000000010000006F00000086" "" "" "" "" "" "" "" "" ""
├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10"  #wal 发送正常

--查备库状态
[root@pgslave ~]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2023-10-13 00:12:19 CST; 12h ago
Process: 1931221 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 1931223 (postgres)
Tasks: 7 (limit: 201967)
Memory: 23.2G
CGroup: /system.slice/postgres.service
├─ 1931223 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 1931224 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 1931225 "postgres: startup recovering 00000001000000770000009E" "" "" "" "" "" "" "" "" ""
├─ 1931226 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 1931227 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 1931230 "postgres: walreceiver streaming 77/9EB6A198" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""   #wal接收
└─ 1931430 "postgres: repmgr repmgr 172.28.32.23(22956) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

Oct 13 00:12:17 pgslave systemd[1]: Starting PostgreSQL database server...
Oct 13 00:12:17 pgslave pg_ctl[1931221]: waiting for server to start....
Oct 13 00:12:17 pgslave pg_ctl[1931223]: 2023-10-13 00:12:17.497 CST [1931223] LOG:  redirecting log output to logging collector process
Oct 13 00:12:17 pgslave pg_ctl[1931223]: 2023-10-13 00:12:17.497 CST [1931223] HINT:  Future log output will appear in directory "log".
Oct 13 00:12:19 pgslave pg_ctl[1931221]: . done
Oct 13 00:12:19 pgslave pg_ctl[1931221]: server started
Oct 13 00:12:19 pgslave systemd[1]: Started PostgreSQL database server.

问题分析

1.查看数据库日志

PG14归档失败解决办法archiver failed on wal_lsn_第1张图片

2.查看归档配置参数

参数配置正确,归档目录权限也正确

postgres=# show archive_command;
                      archive_command                      
-----------------------------------------------------------
 /usr/bin/lz4 -q -z %p /server/data/pgdb/pg_archive/%f.lz4
(1 row)

postgres=# show archive_mode;
 archive_mode 
--------------
 on
(1 row)

--查看归档目录的权限
[postgres@pgmaster ~]$ ls -ld /server/data/pgdb/pg_archive
drwxr-x--- 2 postgres postgres 4214784 Oct 13 13:14 /server/data/pgdb/pg_archive

3.手动切日志

手工归档成功,但是未解决,查看状态依然时卡住归档失败的那条wal记录那里

--手工归档
top_portal=# select pg_switch_wal();
 pg_switch_wal 
---------------
 72/51C4CFD8
(1 row)

--查主库数据库状态
[root@pgmaster ~]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 3710970 (postgres)
Tasks: 53 (limit: 201967)
Memory: 19.0G
CGroup: /system.slice/postgres.service
├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710995 "postgres: archiver failed on 000000010000006F00000086" "" "" "" "" "" "" "" "" ""
├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10"  #wal 发送正常


--查当前wal_lsn
top_portal=# select pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 72/52638F10
(1 row)

--查当前wal_lsn对应的wal文件
top_portal=# select pg_walfile_name(pg_current_wal_lsn());
     pg_walfile_name      
--------------------------
 000000010000007200000052
(1 row)

--查当前最新检查点,最新检查点之前的wal文件均可以删除
[postgres@pgmaster ~]$ pg_controldata $PGDATA
pg_control version number:            1300
Catalog version number:               202107181
Database system identifier:           7268852449124462799
Database cluster state:               in production
pg_control last modified:             Fri 13 Oct 2023 10:07:35 AM CST  
Latest checkpoint location:           71/CDD2FF28
Latest checkpoint's REDO location:    71/CDD28F18
Latest checkpoint's REDO WAL file:    0000000100000071000000CD

--查报错中的wal文件
[postgres@pgmaster pg_wal]$ ls -l 000000010000006F00000086
-rw------- 1 postgres postgres 16777216 Oct 12 21:12 000000010000006F00000086
[postgres@pgmaster pg_wal]$ find /server/data/pgdb/pg_archive -name 000000010000006F00000086*
ls: cannot access '000000010000006F00000086': No such file or directory
[postgres@pgmaster pg_wal]$ find /server -name 000000010000006F00000086*
-rw------- 1 postgres postgres 16777216 Oct 12 21:12 000000010000006F00000086

4.检查$PGDATA/pg_wal/archive_status/目录下文件

[postgres@pgmaster ~]$ cd /server/data/pgdb/data/pg_wal/archive_status/
[postgres@pgmaster archive_status]$ ls -l *.ready
ls: cannot access '*.ready': No such file or directory

说明不存在需要归档但没归档的文件

该目录下,ready说明是需要归档但是没归档的,done是归档完成了的

解决办法

1.将归档失败的wal文件备份到/home/postgres目录下(生产环境如果磁盘空间允许切记不要rm删除,mv备份到目标位置)
2.手工归档select pg_switch_wal();
3.再次查看主备库状态

--1.将归档失败的wal文件备份到/home/postgres目录下
[postgres@pgmaster pg_wal]$ mv 000000010000006F00000086 /home/postgres/000000010000006F00000086
[postgres@pgmaster pg_wal]$ ls -l /home/postgres/000000010000006F00000086
-rw------- 1 postgres postgres 16777216 Oct 12 21:12 /home/postgres/000000010000006F00000086

--2.手工归档
postgres=# select pg_switch_wal();
 pg_switch_wal 
---------------
 73/7EF502E0
(1 row)

--3.再次查看主库状态显示正常
[root@pgmaster data]# systemctl status postgres
● postgres.service - PostgreSQL database server
     Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
     Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
    Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
   Main PID: 3710970 (postgres)
      Tasks: 50 (limit: 201967)
     Memory: 26.6G
     CGroup: /system.slice/postgres.service
             ├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
             ├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3710995 "postgres: archiver archiving 000000010000007100000035" "" "" "" "" "" "" "" "" ""
             ├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
             ├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
             ├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
             ├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
             ├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 73/7F000BD0"

补充

若$PGDATA/pg_wal/archive_status/目录下存在大量的*.ready文件
可能的原因分析:如果数据库是突然断电,那么可能arvchive命令没有完全完成,归档目录会存在不完整的文件名称,重启数据库后,会出现归档失败的情况,这个时候,需要去归档目录删除相关归档失败文件,那么归档就会重新归档。
需要注意的是,archive_command 设定的归档命令是否成功执行,如果未成功,它会周期性的重试,在此期间已有的WAL日志将不会被覆盖重用,新的WAL日志信息会不断占用 pg_wal 的磁盘空间,知道pg_wal所在磁盘沾满后数据库关闭。由于参数 wal_level 与 archive_mode 需要重启数据库,可以在安装之初启动数据库之前,开启这两个参数,然后将 archive_command 的值设置为永远为真的值,例如:/bin/true。当需要开启归档时,只需要修改 archive_command,reload即可。省去重启数据库的步骤。

案例2:pg_wal和归档目录下同时都没该wal_lsn文件

案例2适用于以下场景:

  • pg_wal和归档目录下同时都没该wal_lsn文件

问题描述

开发让释放测试环境pg10数据库的归档空间,清理前检查数据库运行状态发现归档失败,提示archiver process failed on 000000010000000000000001,分析发下pg_wal和归档目录下同时都没该wal_lsn文件,查看多个日志最终发现从2022-12-31开始就已经归档失败了,沟通得知该库一直没人维护。

--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1103     1  0 11月14 ?      00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1595  1103  0 11月14 ?      00:00:16 postgres: logger process   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
postgres  1682  1103  0 11月14 ?      00:00:00 postgres: checkpointer process   
postgres  1683  1103  0 11月14 ?      00:00:19 postgres: writer process   
postgres  1684  1103  0 11月14 ?      00:00:18 postgres: wal writer process   
postgres  1685  1103  0 11月14 ?      00:00:13 postgres: autovacuum launcher process   
postgres  1686  1103  0 11月14 ?      00:05:19 postgres: archiver process   failed on 000000010000000000000001
postgres  1687  1103  0 11月14 ?      00:00:28 postgres: stats collector process   
postgres  1688  1103  0 11月14 ?      00:00:01 postgres: bgworker: logical replication launcher   
root      8779  8736  0 15:01 pts/0    00:00:00 su - postgres
postgres  8780  8779  0 15:01 pts/0    00:00:00 -bash
root     10057  8888  0 15:17 pts/1    00:00:00 grep --color=auto postgres
postgres 16957  1103  0 11月21 ?      00:00:00 postgres: topicis topicis 192.168.5.211(58552) idle
postgres 16958  1103  0 11月21 ?      00:00:00 postgres: topicis topicis 192.168.5.211(58555) idle
postgres 16959  1103  0 11月21 ?      00:00:00 postgres: topicis topicis 192.168.5.211(58556) idle
postgres 16960  1103  0 11月21 ?      00:00:00 postgres: topicis topicis 192.168.5.211(58558) idle

问题分析

--检查归档参数配置
-bash-4.2$ /usr/pgsql-10/bin/psql -p 54310
psql (10.22)
输入 "help" 来获取帮助信息.

postgres=# show archive_mode;
 archive_mode 
--------------
 on
(1 行记录)

postgres=# show archive_command;
               archive_command                
----------------------------------------------
 cp %p /dsg3/postgres/pg10_data/pg_archive/%f
(1 行记录)

postgres=# \q

--检查归档目录权限
-bash-4.2$ ls -ld /dsg3/postgres/pg10_data/pg_archive
drwxr-xr-x 2 postgres postgres 4096 12月  1 13:59 /dsg3/postgres/pg10_data/pg_archive

--查看多个日志最终发现从2022-12-31开始就已经归档失败了
-bash-4.2$ tail -200f postgresql-2022-12-31_000000.log
cp: 无法获取"pg_wal/000000010000000000000001" 的文件状态(stat): 没有那个文件或目录
2022-12-31 23:58:29.002 CST [1706] 日志:  归档命令执行失败,退出代码为 1
2022-12-31 23:58:29.002 CST [1706] 详细信息:  执行失败的归档命令是: cp pg_wal/000000010000000000000001 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000001
cp: 无法获取"pg_wal/000000010000000000000001" 的文件状态(stat): 没有那个文件或目录
2022-12-31 23:58:30.012 CST [1706] 日志:  归档命令执行失败,退出代码为 1
2022-12-31 23:58:30.012 CST [1706] 详细信息:  执行失败的归档命令是: cp pg_wal/000000010000000000000001 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000001
2022-12-31 23:58:30.012 CST [1706] 警告:  archiving write-ahead log file "000000010000000000000001" failed too many times, will try again later
2022-12-31 23:59:00.016 CST [23391] 错误:  字段 "sysdate" 不存在 第 147 个字符处
2022-12-31 23:59:00.016 CST [23391] 语句:  select code,sum(1) as sum,sum(investorcount) as invsum from LOG_SYNCNAMEINFO where logtype = '成功' and date_trunc('day',logTime)= date_trunc('day',sysdate - interval '1 day') group by code order by code
2022-12-31 23:59:00.027 CST [23391] 错误:  字段 "sysdate" 不存在 第 123 个字符处
2022-12-31 23:59:00.027 CST [23391] 语句:  select * from(select * from log_entopenplatformpush t where t.logtype in('失败','异常') and (t.nexttime is null or t.nexttime

尝试解决办法

1.关闭归档开启归档(未解决)

关闭归档–>重启库–>开启归档–>重启库,依然报如下错误:

--关闭归档,更改postgresql.conf,注释掉以下参数
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
#archive_mode = on
#archive_command = 'cp %p /dsg3/postgres/pg10_data/pg_archive/%f'

--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--开启归档,更改postgresql.conf,解除以下参数的注释
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
archive_mode = on
archive_command = 'cp %p /dsg3/postgres/pg10_data/pg_archive/%f'

--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1103     1  0 11月14 ?      00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1595  1103  0 11月14 ?      00:00:16 postgres: logger process   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
postgres  1682  1103  0 11月14 ?      00:00:00 postgres: checkpointer process   
postgres  1683  1103  0 11月14 ?      00:00:19 postgres: writer process   
postgres  1684  1103  0 11月14 ?      00:00:18 postgres: wal writer process   
postgres  1685  1103  0 11月14 ?      00:00:13 postgres: autovacuum launcher process   
postgres  1686  1103  0 11月14 ?      00:05:19 postgres: archiver process   failed on 000000010000000000000001
postgres  1687  1103  0 11月14 ?      00:00:28 postgres: stats collector process   
postgres  1688  1103  0 11月14 ?      00:00:01 postgres: bgworker: logical replication launcher   
root      8779  8736  0 15:01 pts/0    00:00:00 su - postgres

2.pg_archivecleanup清理过期wal文件(未解决)

--查看pg_wal下面得文件
-bash-4.2$ ls -l /dsg3/postgres/pg10_data/pg_wal/
总用量 394736
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000BC
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000BD
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000BE
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000BF
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C0
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C1
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C2
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C3
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C4
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C5
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C6
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C7
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C8
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C9
-rw------- 1 postgres postgres 16777216 12月  1 14:00 000000010000005C000000CA
-rw------- 1 postgres postgres 16777216 12月  1 14:04 000000010000005C000000CB
-rw------- 1 postgres postgres 16777216 12月  1 16:17 000000010000005C000000CC
-rw------- 1 postgres postgres 16777216 12月  1 16:26 000000010000005C000000CD
-rw------- 1 postgres postgres 16777216 12月  1 16:39 000000010000005C000000CE

--查当前wal_lsn
postgres=# select pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 5C/CC000098
(1 row)

--查当前wal_lsn对应的wal文件
postgres=# select pg_walfile_name(pg_current_wal_lsn());
     pg_walfile_name      
--------------------------
 000000010000005C000000CC
(1 row)

--清除检查点之前的wal文件
# 000000010000005C000000CC  之前的pg_wal文件可以删除 (pg10以前的叫做pg_xlog)
[postgres@Server ~]$ pg_archivecleanup -d $PGDATA/pg_wal 000000010000005C000000C2
pg_archivecleanup: keep WAL file "/server/data/pgdb/data/pg_wal/000000010000005C000000C2" and later  
pg_archivecleanup: removing file "/server/data/pgdb/data/pg_wal/000000010000005C000000C1" 

虽然是测试环境还是保留了部分wal文件,未从当前wal_lsn000000010000005C000000CC清除,而是选择清除
000000010000005C000000C2之前的文件

--手动切日志
-bash-4.2$ /usr/pgsql-10/bin/psql -p 54310
psql (10.22)
输入 "help" 来获取帮助信息.

postgres=# select pg_switch_wal();
 pg_switch_wal 
---------------
 5C/D10000E8
(1 行记录)

--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1103     1  0 11月14 ?      00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1595  1103  0 11月14 ?      00:00:16 postgres: logger process   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
postgres  1682  1103  0 11月14 ?      00:00:00 postgres: checkpointer process   
postgres  1683  1103  0 11月14 ?      00:00:19 postgres: writer process   
postgres  1684  1103  0 11月14 ?      00:00:18 postgres: wal writer process   
postgres  1685  1103  0 11月14 ?      00:00:13 postgres: autovacuum launcher process   
postgres  1686  1103  0 11月14 ?      00:05:19 postgres: archiver process   failed on 000000010000000000000001
postgres  1687  1103  0 11月14 ?      00:00:28 postgres: stats collector process   
postgres  1688  1103  0 11月14 ?      00:00:01 postgres: bgworker: logical replication launcher   
root      8779  8736  0 15:01 pts/0    00:00:00 su - postgres

3.$PG_DATA/pg_wal下创建空文件(未解决)

--关闭数据库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/

--创建和报错同名的wal_lsn文件
cd /dsg3/postgres/pg10_data/pg_wal
touch 000000010000000000000001

--启动数据库
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1103     1  0 11月14 ?      00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1595  1103  0 11月14 ?      00:00:16 postgres: logger process   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
postgres  1682  1103  0 11月14 ?      00:00:00 postgres: checkpointer process   
postgres  1683  1103  0 11月14 ?      00:00:19 postgres: writer process   
postgres  1684  1103  0 11月14 ?      00:00:18 postgres: wal writer process   
postgres  1685  1103  0 11月14 ?      00:00:13 postgres: autovacuum launcher process   
postgres  1686  1103  0 11月14 ?      00:05:19 postgres: archiver process   failed on 000000010000000000000001
postgres  1687  1103  0 11月14 ?      00:00:28 postgres: stats collector process   
postgres  1688  1103  0 11月14 ?      00:00:01 postgres: bgworker: logical replication launcher   
root      8779  8736  0 15:01 pts/0    00:00:00 su - postgres

最终解决办法

--关闭数据库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/

--备份data目录(如果磁盘空间允许务必备份以防万一)
cd /dsg3/postgres/
cp -r pg10_data pg10_data_bak_20231201

--更改postgresql.conf中以下归档参数
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
#archive_mode = on
archive_command = 'ls -l /dsg3/postgres/pg10_data/pg_archive/'

--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--查看数据库状态,
-bash-4.2$ ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:06 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
root     12967 12922  0 15:56 pts/0    00:00:00 su - postgres
postgres 12968 12967  0 15:56 pts/0    00:00:00 -bash
root     13392 13350  0 16:00 pts/1    00:00:00 su - postgres
postgres 13393 13392  0 16:00 pts/1    00:00:00 -bash
root     15935 15815  0 16:34 pts/2    00:00:00 su - postgres
postgres 15936 15935  0 16:34 pts/2    00:00:00 -bash
postgres 17190     1  3 16:49 pts/0    00:00:00 /usr/pgsql-10/bin/postgres -D /dsg3/postgres/pg10_data
postgres 17191 17190  0 16:49 ?        00:00:00 postgres: logger process   
postgres 17193 17190  0 16:49 ?        00:00:00 postgres: checkpointer process   
postgres 17194 17190  0 16:49 ?        00:00:00 postgres: writer process   
postgres 17195 17190  0 16:49 ?        00:00:00 postgres: wal writer process   
postgres 17196 17190  0 16:49 ?        00:00:00 postgres: autovacuum launcher process   
postgres 17197 17190 71 16:49 ?        00:00:04 postgres: archiver process   last was 000000010000000100000074
postgres 17198 17190  0 16:49 ?        00:00:00 postgres: stats collector process   
postgres 17199 17190  0 16:49 ?        00:00:00 postgres: bgworker: logical replication launcher   
postgres 17584 12968  0 16:49 pts/0    00:00:00 ps -ef
postgres 17585 12968  0 16:49 pts/0    00:00:00 grep --color=auto postgres

多次执行ps -ef | grep postgres会发现
archiver process   last was 000000010000000100000074这个地方会不断地变化,是正常现象,不要慌
等不变为止

--检查$PGDATA/pg_wal/archive_status/目录下文件
[postgres@pgmaster ~]$ cd /server/data/pgdb/data/pg_wal/archive_status/
[postgres@pgmaster archive_status]$ ls -l *.ready
[postgres@pgmaster archive_status]$ ls -l *.done
原来的.ready结尾的文件都变成了.done结尾的文件

补充:.ready结尾的文件说明是需要归档但是没归档的,done是归档完成了的

--开启归档,更改postgresql.conf,修改以下归档参数
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
archive_mode = on
archive_command = 'cp %p /dsg3/postgres/pg10_data/pg_archive/%f'

--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--查看数据库状态
-bash-4.2$ ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:06 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:14 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:15 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
root      9783 16354  0 17:00 pts/3    00:00:00 su - postgres
postgres  9784  9783  0 17:00 pts/3    00:00:00 -bash
root     10888 10844  0 17:14 pts/4    00:00:00 su - postgres
postgres 10889 10888  0 17:14 pts/4    00:00:00 -bash
root     12967 12922  0 15:56 pts/0    00:00:00 su - postgres
postgres 12968 12967  0 15:56 pts/0    00:00:00 -bash
root     13392 13350  0 16:00 pts/1    00:00:00 su - postgres
postgres 13393 13392  0 16:00 pts/1    00:00:00 -bash
postgres 15098     1  0 18:16 pts/4    00:00:00 /usr/pgsql-10/bin/postgres -D /dsg3/postgres/pg10_data
postgres 15099 15098  0 18:16 ?        00:00:00 postgres: logger process   
postgres 15101 15098  0 18:16 ?        00:00:00 postgres: checkpointer process   
postgres 15102 15098  0 18:16 ?        00:00:00 postgres: writer process   
postgres 15103 15098  0 18:16 ?        00:00:00 postgres: wal writer process   
postgres 15104 15098  0 18:16 ?        00:00:00 postgres: autovacuum launcher process   
postgres 15105 15098  0 18:16 ?        00:00:00 postgres: archiver process   last was 000000010000005C000000D1
postgres 15106 15098  0 18:16 ?        00:00:00 postgres: stats collector process   
postgres 15107 15098  0 18:16 ?        00:00:00 postgres: bgworker: logical replication launcher   
postgres 15182 10889  0 18:17 pts/4    00:00:00 ps -ef
postgres 15183 10889  0 18:17 pts/4    00:00:00 grep --color=auto postgres
root     15935 15815  0 16:34 pts/2    00:00:00 su - postgres
postgres 15936 15935  0 16:34 pts/2    00:00:00 -bash

问题最终解决,虽说是测试库,但是也吓得不轻,157G的数据。不管测试还是生产环境还是得慎重,毕竟数据无法重现。

你可能感兴趣的:(Postgres,PG14归档失败解决办法)