在安装插件的时候使用\dx元命令的时候,突然发现报了一个错误:
postgres=# \dx
ERROR: missing chunk number 0 for toast value 32789 in pg_toast_2619
根据提示来看,主表字段还留存着Toast Pointer,但Toast表中已经没有对应的Chunk条目,怀疑toast表存在损坏或者缺失数据。
toast表的表名是字符串"pg_toast"与表的oid拼接而成,根据这个pg_toast_2619的2619的oid,可以定位到是pg_statistic的表。
postgres=# select 2619::regclass;
regclass
--------------
pg_statistic
(1 row)
postgres=# select * from pg_statistic;
ERROR: missing chunk number 0 for toast value 32789 in pg_toast_2619
postgres=#
对应toast表的数据可能存在损坏。
需要定位损坏的行的位置,然后可以删除有问题的行。因为pg_statistic的toast数据是更新统计信息的时候插入的,损坏后可以直接清理。vacuum analyze后会重新生成。
其他系统表损坏比较建议建议zero_damaged_pages设置为on来跳过损坏的块,然后备份业务数据,之后恢复到一个新的环境。
可以暂时先做下简单的修复,看是否能解决问题,可能执行过程会出现报错:
postgres=# REINDEX TABLE pg_toast.pg_toast_2619;
postgres=# REINDEX TABLE pg_statistic;
postgres=# VACUUM ANALYZE pg_statistic;
可以使用如下的存储过程
DO $$
DECLARE
rec record;
BEGIN
FOR rec in SELECT * FROM pg_statistic LOOP
raise notice 'Parameter is: %', rec.ctid;
raise notice 'Parameter is: %', rec;
END LOOP;
END;
$$
LANGUAGE plpgsql;
结果如下:
NOTICE: Parameter is: (0,1)
NOTICE: Parameter is: (1255,28,f,0.9830149,1294,-0.016985118,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,)
NOTICE: Parameter is: (0,2)
NOTICE: Parameter is: (1255,29,f,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,)
NOTICE: Parameter is: (0,3)
NOTICE: Parameter is: (2606,22,f,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,)
NOTICE: Parameter is: (0,4)
NOTICE: Parameter is: (1260,1,f,0,4,-1,2,3,0,0,0,609,609,0,0,0,0,0,0,0,0,,{0.4365325},,,,"{10,3373,3374,3375,3377,4200,4544,4549,4550,4569,4570,4571,6171,6181,6182,6304,16392,16393}",,,,)
NOTICE: Parameter is: (0,5)
NOTICE: Parameter is: (1260,2,f,0,64,-1,2,3,0,0,0,660,660,0,0,0,950,950,0,0,0,,{0.104231164},,,,"{pg_checkpoint,pg_create_subscription,pg_database_owner,pg_execute_server_program,pg_maintain,pg_monitor,pg_read_all_data,pg_read_all_settings,pg_read_all_stats,pg_read_server_files,pg_signal_backend,pg_stat_scan_tables,pg_use_reserved_connections,pg_write_all_data,pg_write_server_files,postgres,u1,u2}",,,,)
NOTICE: Parameter is: (0,6)
NOTICE: Parameter is: (1260,3,f,0,1,-0.11111111,1,3,0,0,0,91,58,0,0,0,0,0,0,0,0,{0.9444444},{0.68421054},,,,{f},,,,)
NOTICE: Parameter is: (0,7)
NOTICE: Parameter is: (1260,4,f,0,1,1,1,3,0,0,0,91,58,0,0,0,0,0,0,0,0,{1},{1},,,,{t},,,,)
... ...
NOTICE: Parameter is: (16,6)
NOTICE: Parameter is: (2618,7,f,0,5,2,1,3,0,0,0,98,664,0,0,0,950,950,0,0,0,{0.993007},{0.9716709},,,,{<>},,,,)
NOTICE: Parameter is: (16,7)
NOTICE: Parameter is: (2618,8,f,0,589,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,)
ERROR: missing chunk number 0 for toast value 32789 in pg_toast_2619
CONTEXT: PL/pgSQL function inline_code_block line 5 at FOR over SELECT rows·
最后在ctid=(16,7)的后边停了下来,我们根据查询下发现ctid= '(16,7)'的是正常的数据,接下来的ctid= '(16,8)'是有问题的数据。
postgres=# select * from pg_statistic where ctid= '(16,7)';
starelid | staattnum | stainherit | stanullfrac | stawidth | stadistinct | stakind1 | stakind2 | stakind3 | stakind4 | stakind5 | staop1 | staop2 | sta
op3 | staop4 | staop5 | stacoll1 | stacoll2 | stacoll3 | stacoll4 | stacoll5 | stanumbers1 | stanumbers2 | stanumbers3 | stanumbers4 | stanumbers5 | sta
values1 | stavalues2 | stavalues3 | stavalues4 | stavalues5
----------+-----------+------------+-------------+----------+-------------+----------+----------+----------+----------+----------+--------+--------+----
----+--------+--------+----------+----------+----------+----------+----------+-------------+-------------+-------------+-------------+-------------+----
--------+------------+------------+------------+------------
2618 | 8 | f | 0 | 589 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | | | | |
| | | |
(1 row)
postgres=# select * from pg_statistic where ctid= '(16,8)';
ERROR: missing chunk number 0 for toast value 32789 in pg_toast_2619
接下来删除此条有问题的数据,然后发现\dx的元命令可以执行了。
postgres=# delete from pg_statistic where ctid= '(16,8)';
DELETE 1
postgres=# \dx
List of installed extensions
Name | Version | Schema | Description
--------------------+---------+------------+------------------------------------------------------------------------
pg_stat_statements | 1.10 | public | track planning and execution statistics of all SQL statements executed
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)
紧接着我们再次查询pg_statistic表,发现还存在问题数据。函数再执行一次,发现又报了问题。
postgres=# select * from pg_statistic;
ERROR: missing chunk number 0 for toast value 32791 in pg_toast_2619
NOTICE: Parameter is: (18,15)
NOTICE: Parameter is: (12624,1,f,0,5,-0.81349206,1,2,3,0,0,98,664,664,0,0,950,950,950,0,0,"{0.018518519,0.017195767,0.014550265,0.013227513,0.011904762,0.011904762,0.011904762,0.01058201,0.01058201,0.01058201,0.01058201,0.009259259,0.009259259,0.007936508,0.007936508,0.007936508,0.0066137565,0.005291005,0.005291005,0.003968254,0.003968254,0.003968254,0.0026455026,0.0026455026}",,{0.97739166},,,"{E061,E021,E081,E121,E051,E141,F051,E091,F031,F041,T321,E011,F021,E071,F131,F311,F261,E031,E101,E151,E152,F501,F302,S011}","{B011,B016,B033,B111,B117,B126,B203,B209,E161,F035,F054,F114,F171,F221,F271,F304,F341,F385,F393,F405,F432,F437,F471,F531,F641,F690,F711,F762,F812,F841,F847,F856,F862,F868,M005,M012,M018,M024,S023,S041,S091,S096,S161,S211,S251,S281,S404,T031,T044,T050,T061,T101,T133,T173,T180,T212,T218,T271,T322,T332,T434,T491,T522,T571,T612,T618,T624,T651,T662,T812,T824,T829,T835,T851,T865,T871,T877,X010,X016,X034,X041,X046,X052,X058,X068,X074,X081,X090,X111,X131,X142,X152,X170,X192,X205,X232,X254,X263,X281,X301,X410}",,,)
NOTICE: Parameter is: (18,16)
NOTICE: Parameter is: (12624,3,f,0,1,18,1,2,3,0,0,98,664,664,0,0,950,950,950,0,0,"{0.81349206,0.03042328,0.02910053,0.023809524,0.022486772,0.017195767,0.015873017,0.013227513,0.01058201,0.005291005,0.005291005,0.0026455026,0.0026455026,0.0026455026}",,{0.26829666},,,"{"""",01,02,03,04,05,06,07,08,09,10,11,12,13}","{14,16,17,19}",,,)
ERROR: missing chunk number 0 for toast value 32791 in pg_toast_2619
CONTEXT: PL/pgSQL function inline_code_block line 5 at FOR over SELECT rows
同样的再次删除问题的行,发现pg_statistic表已经可以正常访问了。
postgres=# delete from pg_statistic where ctid= '(18,17)';
DELETE 1
postgres=# select * from pg_statistic;
starelid | staattnum | stainherit | stanullfrac | stawidth | stadistinct | stakind1 | stakind2 | stakind3 | stakind4 | stakind5 | staop1 | staop2 | s
taop3 | staop4 | staop5 | stacoll1 | stacoll2 | stacoll3 | stacoll4 | stacoll5 |
... ...
完成以上步骤后对该表进行一次完整的维护和索引重建,。
postgres=# REINDEX TABLE pg_toast.pg_toast_2619;
postgres=# REINDEX TABLE pg_statistic;
postgres=# VACUUM ANALYZE pg_statistic;
在实际的使用中,如果业务的toast表发生了损坏,我们的首选,其实是根据备份恢复,如果存在全量的备份以及连续的wal日志的话,我们可以在另外的测试环境进行PITR基于时间点恢复,然后导出对应的表的数据,在沟通好变更窗口后,进行恢复操作。(切忌盲目在生产环境直接恢复,就算要操作生产环境也要沟通好变更窗口进行变更)
会用到一个参数zero_damaged_pages,可以将该参数设置为on来跳过损坏的块,然后重新备份,恢复。
但是如果没有完整的备份,也没有其他更好的恢复手段的时候,可以考虑删除掉损坏的行来恢复表。
这个时候也需要定位有问题的行,可以使用如下的脚本,适当进行调整,最好根据主键去定位:
#!/bin/bash
j=500 #500是表的总行数
for ((i=1; i<=j;i++))
do
psql -U user_recover -d postgres -c "SELECT * FROM test_tab order by id LIMIT 1 offset $i" >/dev/null || echo $i
done
执行的结果大致如下
postgres@ubuntu-linux-22-04-desktop:~$ sh check_toast_tup.sh
错误: missing chunk number 0 for toast value 3132142 in pg_toast_521351
100
错误: missing chunk number 0 for toast value 3132142 in pg_toast_521351
101
错误: missing chunk number 0 for toast value 3132143 in pg_toast_521351
102
错误: missing chunk number 0 for toast value 3132144 in pg_toast_521351
103
错误: missing chunk number 0 for toast value 3132145 in pg_toast_521351
104
错误: missing chunk number 0 for toast value 3132146 in pg_toast_521351
105
错误: missing chunk number 0 for toast value 3132147 in pg_toast_521351
106
错误: missing chunk number 0 for toast value 3132148 in pg_toast_521351
107
错误: missing chunk number 0 for toast value 3132149 in pg_toast_521351
108
定位到数据后,使用delete语句清除,然后再查询,表的数据恢复正常。
select * from test_tab;
id | name
-----+--------
1 | a
2 | b
3 | c
4 | d
5 | e
6 | f
7 | g
... ...
完成以上步骤后对该表进行一次完整的维护和索引重建。
postgres=# REINDEX TABLE test_tab;
postgres=# VACUUM ANALYZE test_tab;
postgresql