【MogDB/openGauss误删未归档的xlog日志如何解决】

在使用MogDB/openGauss数据库的过程中,有时候大量业务,或者导数据会导致pg_xlog下的日志数量持续增长,此时如果xlog的产生频率太快,而来不及自动清理,极有可能造成pg_xlog目录的打满。如果对数据库的xlog不太了解的时候,可能造成误删未归档的xlog日志,或者更严重地,把对应操作还未写入数据文件的xlog也删除了。

本文将讲解了通常情况下pg_xlog下的xlog文件所处状态,并总结了数据已经落盘但未进行归档的xlog日志被误删时,日志周期产生缺失xlog日志报错和归档失败问题的几种解决方法。

一、pg_xlog下xlog文件的状态

通常情况下我们是不建议手动删除pg_xlog下的日志的,因为pg_xlog下的xlog有自动清理机制,可以根据需求配置参数调整清理速度。

而正常情况下,pg_xlog下应该存在如下的三种状态的xlog文件,在开启归档的情况下,可以进行相关讨论:

第一种:对应数据已经落盘,已经进行完归档。pg_xlog/archive_status中的状态为.done

可以手动删除,对数据库无影响,但是不建议手动删除,因为pg_xlog下的xlog有自动清理机制,可以根据需求配置参数调整清理速度

第二种:对应数据已经落盘,未进行归档。pg_xlog/archive_status中的状态为.ready

数据已落盘,但是未归档,删除pg_xlog下的xlog后,对当前的数据库里的数据无影响,但是如果想基于全量备份和连续的归档日志做PITR,则会缺少日志,而且归档会因为缺失被删除的这部分xlog而失败,后续归档都不成功,从而阻塞pg_xlog下xlog日志的正常的自动清理,数据库会打印相关报错:

DETAIL:  The failed archive command was: "cp pg_xlog/000000010000000200000071 /data/om3/data/archivedir/000000010000000200000071 "
cp: cannot stat 'pg_xlog/000000010000000200000071': No such file or directory

第三种:对应数据未落盘,未进行归档

刚写完xlog,但是数据还未落盘,此时删除xlog可能会丢数据,而且数据库可能服务出现问题,数据库无法启动,可能需要使用pg_resetxlog工具清理xlog,并重置pg_control文件中的一些其他控制信息,来保证数据库正常启动。pg_resetxlog将作为数据库修复的最后手段使用。而且修复而启动数据库后,可能会由于部分提交的事务,导致数据库和之前的数据不一致的情况。

【MogDB/openGauss误删未归档的xlog日志如何解决】_第1张图片

二、处于第二种时,误删未归档的xlog日志报错如何解决

本篇测试内容使用的主要归档参数是archive_mode和archive_command。数据库版本是MogDB-3.0.5。

MogDB=# show archive_mode ;
 archive_mode 
--------------
 on
(1 row)

MogDB=# show archive_command ;
           archive_command           
-------------------------------------
 cp %p /data/om3/data/archivedir/%f 
(1 row)

MogDB=# show archive_dest ;
 archive_dest 
--------------
 
(1 row)

1、临时调整archive_command

如果是使用archive_command这个参数决定归档行为的时候,可以从archive_command命令下手,修改这个归档命令,骗过数据库说归档成功了。

如下环境已经模拟出了误删未归档的xlog的现象

om3@lmt0003 archive_status]$ rm ../000000010000000200000074

[om3@lmt0003 archive_status]$ cat  /data/om3/log/pg_log/dn_6001/postgresql-2023-11-01_115121.log | grep 000000010000000200000074|more
cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000074': No such file or directory2023-11-01 14:55:21.521 [unk
nown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG:  archive command failed with exit code 1
2023-11-01 14:55:21.521 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL:  The failed archive command was: "cp pg_xl
og/000000010000000200000074 /data/om3/data/archivedir/1/000000010000000200000074 " 
cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000074': No such file or directory2023-11-01 14:55:22.527 [unk
nown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG:  archive command failed with exit code 1
2023-11-01 14:55:22.527 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL:  The failed archive command was: "cp pg_xl
og/000000010000000200000074 /data/om3/data/archivedir/1/000000010000000200000074 " 
cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000074': No such file or directory2023-11-01 14:55:23.532 [unk
nown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG:  archive command failed with exit code 1
2023-11-01 14:55:23.532 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL:  The failed archive command was: "cp pg_xl
og/000000010000000200000074 /data/om3/data/archivedir/1/000000010000000200000074 " 
2023-11-01 14:55:23.532 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] WARNING:  xlog file "000000010000000200000074" cou
ld not be archived: too many failures

[om3@lmt0003 pg_xlog]$ cd archive_status/
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000071.done  000000010000000200000073.done
000000010000000200000070.done  000000010000000200000072.done  000000010000000200000074.ready
[om3@lmt0003 archive_status]$ gsql -r
gsql ((MogDB 3.0.5 build 76182eb6) compiled at 2023-07-20 16:53:13 commit 0 last mr 1801 )
Non-SSL connection (SSL connection is recommended when requiring high-security)
Type "help" for help.

MogDB=# select pg_switchover_xlog();
MogDB=# select pg_switch_xlog();
 pg_switch_xlog 
----------------
 2/750019D8
(1 row)

MogDB=# \q
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000071.done  000000010000000200000073.done   000000010000000200000075.ready
000000010000000200000070.done  000000010000000200000072.done  000000010000000200000074.ready

1、修改postgresql.conf


[om3@lmt0003 archive_status]$ vi ../../postgresql.conf

archive_mode = on                                           
#archive_command = 'cp %p /data/om3/data/archivedir/%f '               
archive_command = 'ls -l /data/om3/data/ '           #别的命令也可以,只要执行的时候不报错就可以。达到骗过数据库的目的就可以。  

2.刷新配置

[om3@lmt0003 archive_status]$ gs_ctl reload

3.不产生error日志,并且archive_status的状态变为done


[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000071.done  000000010000000200000073.done   000000010000000200000075.ready
000000010000000200000070.done  000000010000000200000072.done  000000010000000200000074.ready
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000071.done  000000010000000200000073.done   000000010000000200000075.ready
000000010000000200000070.done  000000010000000200000072.done  000000010000000200000074.ready
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000071.done  000000010000000200000073.done  000000010000000200000075.done
000000010000000200000070.done  000000010000000200000072.done  000000010000000200000074.done

-----归档的报错之前大概每一分钟打印一次,每次打印多行。

[om3@lmt0003 archive_status]$ cat  /data/om3/log/pg_log/dn_6001/postgresql-2023-11-01_115121.log | grep 000000010000000200000074|tail -n 5
cp: cannot stat 'pg_xlog/000000010000000200000074': No such file or directory2023-11-01 15:00:17.297 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG:  archive command failed with exit code 1
2023-11-01 15:00:17.297 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL:  The failed archive command was: "cp pg_xlog/000000010000000200000074 /data/om3/data/archivedir/000000010000000200000074 " 
cp: cannot stat 'pg_xlog/000000010000000200000074': No such file or directory2023-11-01 15:00:18.302 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG:  archive command failed with exit code 1
2023-11-01 15:00:18.302 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL:  The failed archive command was: "cp pg_xlog/000000010000000200000074 /data/om3/data/archivedir/000000010000000200000074 " 
2023-11-01 15:00:18.302 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] WARNING:  xlog file "000000010000000200000074" could not be archived: too many failures
[om3@lmt0003 archive_status]$ date
Wed Nov  1 15:13:43 CST 2023

4、修改postgresql.conf为正常。

archive_mode = on                                           
archive_command = 'cp %p /data/om3/data/archivedir/%f '               
#archive_command = 'ls -l /data/om3/data/ '  

然后刷新配置。这样就一切恢复正常了。只是缺少了删除的这部分xlog以及欺骗数据库归档命令期间的xlog,参数调整回来的后续日志可以继续归档。也解决了持续产生日志报错的问题。

[om3@lmt0003 archive_status]$ gs_ctl reload

2、修改archive_status目录下误删的xlog对应的xxx.ready状态文件

如下环境已经模拟出了误删未归档的xlog的现象

om3@lmt0003 pg_xlog]$ rm 000000010000000200000071

image.png

日志出现相关报错
 

【MogDB/openGauss误删未归档的xlog日志如何解决】_第2张图片


并且后续的xlog日志

【MogDB/openGauss误删未归档的xlog日志如何解决】_第3张图片


 

image.png


查看日志打印频率,每一分钟打印一次,一次打印多行

【MogDB/openGauss误删未归档的xlog日志如何解决】_第4张图片

手动将archive_status下日志提示的缺少的xlog对应的状态文件的xxx.ready改成xxx.done

om3@lmt0003 archive_status]$ cp 000000010000000200000071.ready 000000010000000200000071.ready_bak
om3@lmt0003 archive_status]$ mv 000000010000000200000071.ready 000000010000000200000071.done

日志不再报错,除了丢失的xlog外,后续日志可以正常进行归档。

【MogDB/openGauss误删未归档的xlog日志如何解决】_第5张图片

【MogDB/openGauss误删未归档的xlog日志如何解决】_第6张图片

3.删除archive_status目录下误删的xlog对应的xxx.ready状态文件

模拟误删未归档的xlog的现象

[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000071.done  000000010000000200000073.done  000000010000000200000075.done
000000010000000200000070.done  000000010000000200000072.done  000000010000000200000074.done

[om3@lmt0003 archive_status]$ gsql -r
gsql ((MogDB 3.0.5 build 76182eb6) compiled at 2023-07-20 16:53:13 commit 0 last mr 1801 )
Non-SSL connection (SSL connection is recommended when requiring high-security)
Type "help" for help.

MogDB=# select pg_switch_xlog();
 pg_switch_xlog 
----------------
 2/7600BAC0
(1 row)

MogDB=# \q
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000071.done  000000010000000200000073.done  000000010000000200000075.done
000000010000000200000070.done  000000010000000200000072.done  000000010000000200000074.done  000000010000000200000076.ready
[om3@lmt0003 archive_status]$ rm ../000000010000000200000076
[om3@lmt0003 archive_status]$ cat  /data/om3/log/pg_log/dn_6001/postgresql-2023-11-01_115121.log | grep 000000010000000200000076|tail -n 5cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000076': No such file or directory2023-11-01 15:25:37.642 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG:  archive command failed with exit code 1
2023-11-01 15:25:37.642 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL:  The failed archive command was: "cp pg_xlog/000000010000000200000076 /data/om3/data/archivedir/1/000000010000000200000076 " 
cp: cannot create regular file '/data/om3/data/archivedir/1/000000010000000200000076': No such file or directory2023-11-01 15:25:38.647 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] LOG:  archive command failed with exit code 1
2023-11-01 15:25:38.647 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] DETAIL:  The failed archive command was: "cp pg_xlog/000000010000000200000076 /data/om3/data/archivedir/1/000000010000000200000076 " 
2023-11-01 15:25:38.647 [unknown] [unknown] localhost 70393223549024 0[0:0#0] 0 [BACKEND] WARNING:  xlog file "000000010000000200000076" could not be archived: too many failures

MogDB=# select pg_switch_xlog();
 pg_switch_xlog 
----------------
 2/77001AC8
(1 row)

MogDB=# \q
[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000072.done  000000010000000200000075.done
000000010000000200000070.done  000000010000000200000073.done  000000010000000200000076.ready
000000010000000200000071.done  000000010000000200000074.done  000000010000000200000077.ready

在pg_xlog/archive_status下删除缺失的xlog对应的xxx.ready的状态文件

00000001000000020000006F.done  000000010000000200000072.done  000000010000000200000075.done
000000010000000200000070.done  000000010000000200000073.done  000000010000000200000076.ready
000000010000000200000071.done  000000010000000200000074.done  000000010000000200000077.ready

[om3@lmt0003 archive_status]$ mv 000000010000000200000076.ready 000000010000000200000076.ready_bak
[om3@lmt0003 archive_status]$ rm -rf 000000010000000200000076.ready

[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000072.done  000000010000000200000075.done
000000010000000200000070.done  000000010000000200000073.done  000000010000000200000076.ready_bak
000000010000000200000071.done  000000010000000200000074.done  000000010000000200000077.ready

发现日志已经不再报缺失xlog以及归档失败的error了,而且后续pg_xlog下的xlog日志可以正常进行归档。

[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000072.done  000000010000000200000075.done
000000010000000200000070.done  000000010000000200000073.done  000000010000000200000076.ready_bak
000000010000000200000071.done  000000010000000200000074.done  000000010000000200000077.ready
[om3@lmt0003 archive_status]$ tail -f  /data/om3/log/pg_log/dn_6001/postgresql-2023-11-01_115121.log | grep 000000010000000200000076^C

[om3@lmt0003 archive_status]$ ls
00000001000000020000006F.done  000000010000000200000072.done  000000010000000200000075.done
000000010000000200000070.done  000000010000000200000073.done  000000010000000200000076.ready_bak
000000010000000200000071.done  000000010000000200000074.done  000000010000000200000077.done

[om3@lmt0003 archive_status]$ cd ../../archivedir/
[om3@lmt0003 archivedir]$ ls 000000010000000200000077
000000010000000200000077

你可能感兴趣的:(PostgreSQL,openGauss,数据库,postgresql,gaussdb)