在使用MogDB/openGauss数据库的过程中,有时候大量业务,或者导数据会导致pg_xlog下的日志数量持续增长,此时如果xlog的产生频率太快,而来不及自动清理,极有可能造成pg_xlog目录的打满。如果对数据库的xlog不太了解的时候,可能造成误删未归档的xlog日志,或者更严重地,把对应操作还未写入数据文件的xlog也删除了。
本文将讲解了通常情况下pg_xlog下的xlog文件所处状态,并总结了数据已经落盘但未进行归档的xlog日志被误删时,日志周期产生缺失xlog日志报错和归档失败问题的几种解决方法。
通常情况下我们是不建议手动删除pg_xlog下的日志的,因为pg_xlog下的xlog有自动清理机制,可以根据需求配置参数调整清理速度。
而正常情况下,pg_xlog下应该存在如下的三种状态的xlog文件,在开启归档的情况下,可以进行相关讨论:
第一种:对应数据已经落盘,已经进行完归档。pg_xlog/archive_status中的状态为.done
可以手动删除,对数据库无影响,但是不建议手动删除,因为pg_xlog下的xlog有自动清理机制,可以根据需求配置参数调整清理速度
第二种:对应数据已经落盘,未进行归档。pg_xlog/archive_status中的状态为.ready
数据已落盘,但是未归档,删除pg_xlog下的xlog后,对当前的数据库里的数据无影响,但是如果想基于全量备份和连续的归档日志做PITR,则会缺少日志,而且归档会因为缺失被删除的这部分xlog而失败,后续归档都不成功,从而阻塞pg_xlog下xlog日志的正常的自动清理,数据库会打印相关报错:
DETAIL: The failed archive command was: "cp pg_xlog/000000010000000200000071 /data/om3/data/archivedir/000000010000000200000071 "
cp: cannot stat 'pg_xlog/000000010000000200000071': No such file or directory
第三种:对应数据未落盘,未进行归档
刚写完xlog,但是数据还未落盘,此时删除xlog可能会丢数据,而且数据库可能服务出现问题,数据库无法启动,可能需要使用pg_resetxlog工具清理xlog,并重置pg_control文件中的一些其他控制信息,来保证数据库正常启动。pg_resetxlog将作为数据库修复的最后手段使用。而且修复而启动数据库后,可能会由于部分提交的事务,导致数据库和之前的数据不一致的情况。
本篇测试内容使用的主要归档参数是archive_mode和archive_command。数据库版本是MogDB-3.0.5。
MogDB=# show archive_mode ;
archive_mode
--------------
on
(1 row)
MogDB=# show archive_command ;
archive_command
-------------------------------------
cp %p /data/om3/data/archivedir/%f
(1 row)
MogDB=# show archive_dest ;
archive_dest
--------------
(1 row)
如果是使用archive_command这个参数决定归档行为的时候,可以从archive_command命令下手,修改这个归档命令,骗过数据库说归档成功了。
如下环境已经模拟出了误删未归档的xlog的现象
om3@lmt0003 archive_status]$ rm ../000000010000000200000074
[om3@lmt0003 archive_status]$ cat /data/om3/log/pg_log/dn_6001/postgresql-2023-11-01_115121.log | grep 000000010000000200000074|more
cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000074': No such file or directory2023-11-01 14:55:21.521 [unk
nown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG: archive command failed with exit code 1
2023-11-01 14:55:21.521 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL: The failed archive command was: "cp pg_xl
og/000000010000000200000074 /data/om3/data/archivedir/1/000000010000000200000074 "
cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000074': No such file or directory2023-11-01 14:55:22.527 [unk
nown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG: archive command failed with exit code 1
2023-11-01 14:55:22.527 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL: The failed archive command was: "cp pg_xl
og/000000010000000200000074 /data/om3/data/archivedir/1/000000010000000200000074 "
cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000074': No such file or directory2023-11-01 14:55:23.532 [unk
nown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG: archive command failed with exit code 1
2023-11-01 14:55:23.532 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL: The failed archive command was: "cp pg_xl
og/000000010000000200000074 /data/om3/data/archivedir/1/000000010000000200000074 "
2023-11-01 14:55:23.532 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] WARNING: xlog file "000000010000000200000074" cou
ld not be archived: too many failures
[om3@lmt0003 pg_xlog]$ cd archive_status/
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000071.done 000000010000000200000073.done
000000010000000200000070.done 000000010000000200000072.done 000000010000000200000074.ready
[om3@lmt0003 archive_status]$ gsql -r
gsql ((MogDB 3.0.5 build 76182eb6) compiled at 2023-07-20 16:53:13 commit 0 last mr 1801 )
Non-SSL connection (SSL connection is recommended when requiring high-security)
Type "help" for help.
MogDB=# select pg_switchover_xlog();
MogDB=# select pg_switch_xlog();
pg_switch_xlog
----------------
2/750019D8
(1 row)
MogDB=# \q
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000071.done 000000010000000200000073.done 000000010000000200000075.ready
000000010000000200000070.done 000000010000000200000072.done 000000010000000200000074.ready
1、修改postgresql.conf
[om3@lmt0003 archive_status]$ vi ../../postgresql.conf
archive_mode = on
#archive_command = 'cp %p /data/om3/data/archivedir/%f '
archive_command = 'ls -l /data/om3/data/ ' #别的命令也可以,只要执行的时候不报错就可以。达到骗过数据库的目的就可以。
2.刷新配置
[om3@lmt0003 archive_status]$ gs_ctl reload
3.不产生error日志,并且archive_status的状态变为done
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000071.done 000000010000000200000073.done 000000010000000200000075.ready
000000010000000200000070.done 000000010000000200000072.done 000000010000000200000074.ready
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000071.done 000000010000000200000073.done 000000010000000200000075.ready
000000010000000200000070.done 000000010000000200000072.done 000000010000000200000074.ready
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000071.done 000000010000000200000073.done 000000010000000200000075.done
000000010000000200000070.done 000000010000000200000072.done 000000010000000200000074.done
-----归档的报错之前大概每一分钟打印一次,每次打印多行。
[om3@lmt0003 archive_status]$ cat /data/om3/log/pg_log/dn_6001/postgresql-2023-11-01_115121.log | grep 000000010000000200000074|tail -n 5
cp: cannot stat 'pg_xlog/000000010000000200000074': No such file or directory2023-11-01 15:00:17.297 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG: archive command failed with exit code 1
2023-11-01 15:00:17.297 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL: The failed archive command was: "cp pg_xlog/000000010000000200000074 /data/om3/data/archivedir/000000010000000200000074 "
cp: cannot stat 'pg_xlog/000000010000000200000074': No such file or directory2023-11-01 15:00:18.302 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG: archive command failed with exit code 1
2023-11-01 15:00:18.302 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL: The failed archive command was: "cp pg_xlog/000000010000000200000074 /data/om3/data/archivedir/000000010000000200000074 "
2023-11-01 15:00:18.302 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] WARNING: xlog file "000000010000000200000074" could not be archived: too many failures
[om3@lmt0003 archive_status]$ date
Wed Nov 1 15:13:43 CST 2023
4、修改postgresql.conf为正常。
archive_mode = on
archive_command = 'cp %p /data/om3/data/archivedir/%f '
#archive_command = 'ls -l /data/om3/data/ '
然后刷新配置。这样就一切恢复正常了。只是缺少了删除的这部分xlog以及欺骗数据库归档命令期间的xlog,参数调整回来的后续日志可以继续归档。也解决了持续产生日志报错的问题。
[om3@lmt0003 archive_status]$ gs_ctl reload
如下环境已经模拟出了误删未归档的xlog的现象
om3@lmt0003 pg_xlog]$ rm 000000010000000200000071
日志出现相关报错
并且后续的xlog日志
查看日志打印频率,每一分钟打印一次,一次打印多行
手动将archive_status下日志提示的缺少的xlog对应的状态文件的xxx.ready改成xxx.done
om3@lmt0003 archive_status]$ cp 000000010000000200000071.ready 000000010000000200000071.ready_bak
om3@lmt0003 archive_status]$ mv 000000010000000200000071.ready 000000010000000200000071.done
日志不再报错,除了丢失的xlog外,后续日志可以正常进行归档。
模拟误删未归档的xlog的现象
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000071.done 000000010000000200000073.done 000000010000000200000075.done
000000010000000200000070.done 000000010000000200000072.done 000000010000000200000074.done
[om3@lmt0003 archive_status]$ gsql -r
gsql ((MogDB 3.0.5 build 76182eb6) compiled at 2023-07-20 16:53:13 commit 0 last mr 1801 )
Non-SSL connection (SSL connection is recommended when requiring high-security)
Type "help" for help.
MogDB=# select pg_switch_xlog();
pg_switch_xlog
----------------
2/7600BAC0
(1 row)
MogDB=# \q
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000071.done 000000010000000200000073.done 000000010000000200000075.done
000000010000000200000070.done 000000010000000200000072.done 000000010000000200000074.done 000000010000000200000076.ready
[om3@lmt0003 archive_status]$ rm ../000000010000000200000076
[om3@lmt0003 archive_status]$ cat /data/om3/log/pg_log/dn_6001/postgresql-2023-11-01_115121.log | grep 000000010000000200000076|tail -n 5cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000076': No such file or directory2023-11-01 15:25:37.642 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG: archive command failed with exit code 1
2023-11-01 15:25:37.642 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL: The failed archive command was: "cp pg_xlog/000000010000000200000076 /data/om3/data/archivedir/1/000000010000000200000076 "
cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000076': No such file or directory2023-11-01 15:25:38.647 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG: archive command failed with exit code 1
2023-11-01 15:25:38.647 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL: The failed archive command was: "cp pg_xlog/000000010000000200000076 /data/om3/data/archivedir/1/000000010000000200000076 "
2023-11-01 15:25:38.647 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] WARNING: xlog file "000000010000000200000076" could not be archived: too many failures
MogDB=# select pg_switch_xlog();
pg_switch_xlog
----------------
2/77001AC8
(1 row)
MogDB=# \q
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000072.done 000000010000000200000075.done
000000010000000200000070.done 000000010000000200000073.done 000000010000000200000076.ready
000000010000000200000071.done 000000010000000200000074.done 000000010000000200000077.ready
在pg_xlog/archive_status下删除缺失的xlog对应的xxx.ready的状态文件
00000001000000020000006F.done 000000010000000200000072.done 000000010000000200000075.done
000000010000000200000070.done 000000010000000200000073.done 000000010000000200000076.ready
000000010000000200000071.done 000000010000000200000074.done 000000010000000200000077.ready
[om3@lmt0003 archive_status]$ mv 000000010000000200000076.ready 000000010000000200000076.ready_bak
[om3@lmt0003 archive_status]$ rm -rf 000000010000000200000076.ready
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000072.done 000000010000000200000075.done
000000010000000200000070.done 000000010000000200000073.done 000000010000000200000076.ready_bak
000000010000000200000071.done 000000010000000200000074.done 000000010000000200000077.ready
发现日志已经不再报缺失xlog以及归档失败的error了,而且后续pg_xlog下的xlog日志可以正常进行归档。
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000072.done 000000010000000200000075.done
000000010000000200000070.done 000000010000000200000073.done 000000010000000200000076.ready_bak
000000010000000200000071.done 000000010000000200000074.done 000000010000000200000077.ready
[om3@lmt0003 archive_status]$ tail -f /data/om3/log/pg_log/dn_6001/postgresql-2023-11-01_115121.log | grep 000000010000000200000076^C
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done 000000010000000200000072.done 000000010000000200000075.done
000000010000000200000070.done 000000010000000200000073.done 000000010000000200000076.ready_bak
000000010000000200000071.done 000000010000000200000074.done 000000010000000200000077.done
[om3@lmt0003 archive_status]$ cd ../../archivedir/
[om3@lmt0003 archivedir]$ ls 000000010000000200000077
000000010000000200000077