了解PostgreSQL的参数zero_damaged_pages

写这篇BLOG,源自我们的一台虚拟测试机出现了异常, 报错信息如下: invalid page header in block 59640 of relation base/175812/1077620; zeroing out page。 看到这里得提到PostgreSQL的一个隐藏系统参数,zero_damaged_pages,官方文档上是这样描述的:
zero_damaged_pages (boolean)
Detection of a damaged page header normally causes PostgreSQL to report an error, aborting the current transaction. Setting zero_damaged_pages to on causes the system to instead report a warning, zero out the damaged page in memory, and continue processing. This behavior will destroy data, namely all the rows on the damaged page. However, it does allow you to get past the error and retrieve rows from any undamaged pages that might be present in the table. It is useful for recovering data if corruption has occurred due to a hardware or software error. You should generally not set this on until you have given up hope of recovering data from the damaged pages of a table. Zeroed-out pages are not forced to disk so it is recommended to recreate the table or the index before turning this parameter off again. The default setting is off, and it can only be changed by a superuser.

这个参数简单翻译解释一下,意思是说当系统检测到磁盘页损坏,并导致postgresql数据取消当前的事务并提交一份错误报告信息。这个参数是bool类型的,默认是off,意思是系统遇到这类因磁盘、内存等硬件引起的问题就会给出这样一份错误提示,当我们设置为on时,就可以忽略这些错误报告,并擦除掉这些损坏的数据,没受影响的数据还是正常的。

这个参数有一个严重的坏处,就是会擦除到那些被损坏的数据,好处也有,你可以忽略掉那些错误信息,扫描时跨过那些错误块,使数据库能正常使用。所以为了数据完整性考虑是不建议打开这个参数的,只有当你的数据库真的打不开、宕机了,没有其他希望恢复数据库的时候再去使用这个参数。 我们是在回收大数据的时候出现的如下故障:
----回收大数据的命令 

test_db=# select lo_unlink(oid) from pg_largeobject_metadata where lomowner = 17305 limit 20000;
ERROR:  invalid page header in block 45736 of relation base/175812/1077620
----发生这个错误时通常pg_dump,pg_dumpall等操作都会报类似错并中断退出,还有比如做vacuum full都做不下去

test_db=# vacuum full analyze verbose pg_largeobject;
INFO:  vacuuming "pg_catalog.pg_largeobject"
ERROR:  invalid page header in block 45736 of relation base/175812/1077620


这个大对象数据保留的是一些测试单点登录信息,非用户信息,故可以清理,生产上需要做好备份,把相关文件拷贝出来,本次处理如下:
test_db=# show zero_damaged_pages ;
zero_damaged_pages
--------------------
off
(1 row)
test_db=# SET zero_damaged_pages = on;
SET
test_db=# show zero_damaged_pages ;
zero_damaged_pages 
--------------------
on
(1 row) 

test_db=# vacuum full analyze verbose pg_largeobject;
WARNING:  invalid page header in block 59640 of relation base/175812/1077620; zeroing out page
WARNING:  invalid page header in block 59641 of relation base/175812/1077620; zeroing out page
WARNING:  invalid page header in block 59642 of relation base/175812/1077620; zeroing out page
WARNING:  invalid page header in block 59643 of relation base/175812/1077620; zeroing out page
WARNING:  invalid page header in block 59644 of relation base/175812/1077620; zeroing out page
WARNING:  invalid page header in block 59645 of relation base/175812/1077620; zeroing out page
WARNING:  invalid page header in block 59646 of relation base/175812/1077620; zeroing out page
WARNING:  invalid page header in block 59647 of relation base/175812/1077620; zeroing out page
...........
WARNING:  invalid page header in block 59703 of relation base/175812/1077620; zeroing out page
WARNING:  invalid page header in block 59703 of relation base/175812/1077620; zeroing out page
WARNING:  invalid page header in block 59704 of relation base/175812/1077620; zeroing out page
INFO:  "pg_largeobject": found 2 removable, 3650 nonremovable row versions in 59711 pages
DETAIL:  0 dead row versions cannot be removed yet.CPU 0.45s/0.28u sec elapsed 7.22 sec.
INFO:  analyzing "pg_catalog.pg_largeobject"
INFO:  "pg_largeobject": scanned 205 of 205 pages, containing 3650 live rows and 0 dead rows; 3650 rows in sample, 3650 estimated total rows
VACUUM
至此,数据库暂时恢复使用了,但这是临时的,如果是磁盘文件系统故障,不久还是会重现这个问题,临了再修复了一下
1.reboot进入单用户模式
2.umount出现数据库异常的磁盘
3.fsck -v -t -p /dev/sda1
4.reboot
修复完一段时间后,暂时没有遇到这个错误了。


相关的源代码文件在src/backend/storage/buffer/bufmgr.c,src/backend/storage/smgr/smgr.c  其中有关键的一段 ,逻辑看着很清爽
 /* check for garbage data */
                        if (!PageIsVerified((Page) bufBlock, blockNum))
                        {
                                if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
                                {
                                        ereport(WARNING,
                                                        (errcode(ERRCODE_DATA_CORRUPTED),
                                                         errmsg("invalid page in block %u of relation %s; zeroing out page",
                                                                        blockNum,
                                                                        relpath(smgr->smgr_rnode, forkNum))));
                                        MemSet((char *) bufBlock, 0, BLCKSZ);
                                }
                                else
                                        ereport(ERROR,
                                                        (errcode(ERRCODE_DATA_CORRUPTED),
                                                         errmsg("invalid page in block %u of relation %s",
                                                                        blockNum,
                                                                        relpath(smgr->smgr_rnode, forkNum))));
                        }



你可能感兴趣的:(head,block,page,Invalid,postgresq)