AntDB属性丢失问题分析

背景
[zgy@intel175 pgbench]$ coord1
psql (PGXC 1.2devel, based on PG 9.3.13)
Type "help" for help.

postgres=# select count(*) from test1000;
ERROR:catalog is missing 1 attribute(s) for relid 46954523

在数据库运行过程中,偶然出现属性丢失的现象。

测试场景

由于最近几次偶现的属性丢失的都是系统表,用户自建的表属性都很稳定。本次测试重点放在系统表的变更上。采用postgresql自带的pgbench测试工具。并发随机对1万张表进行写操作,同时并发对这批表进行ddl属性更新。最后逐个对这个表进行查询,来验证是否出现表属性丢失的现象。


测试集群架构如下(2C2D,一主一从架构):

AntDB属性丢失问题分析_第1张图片

Coord1和DB1,DB2均开启xlog日志归档:

[zgy@intel175 pgbench]$ adbmgr
psql (PGXC 1.2devel, based on PG 9.3.13)
Type "help" for help.
postgres=# set datanode all (archive_mode='on',archive_command = 'test ! -f /data/zgy/archive_xlog/%f && cp %p/data/zgy/archive_xlog/%f');                                 
NOTICE:  parameter "archive_mode" cannot be changed without restarting the server
SET PARAM
archive_mode相关的设置,需要重启服务才能生效。另外如果一台服务器部署多个节点,需要注意归档目录不要出现交集。


猜测属性丢失,可能跟autovacuum有关系,设置触发autovacuum的间隔为30秒(检测间隔):

[zgy@intel175 pgbench]$ adbmgr
psql (PGXC 1.2devel, based on PG 9.3.13)
Type "help" for help.

postgres=# set datanode all (autovacuum_naptime='30s',autovacuum_vacuum_threshold=10,autovacuum_vacuum_scale_factor=0);
SET PARAM
postgres=# set coordinator all (autovacuum_naptime='30s',autovacuum_vacuum_threshold=10,autovacuum_vacuum_scale_factor=0);        
SET PARAM

autovacuum触发条件只有两个:一、表上(update,delte 记录) >= autovacuum_vacuum_scale_factor* reltuples(表上记录数) + autovacuum_vacuum_threshold;二、表的max_age > autovacuum_freeze_max_age


测试步骤

初始化1万张测试表:

[zgy@intel175 pgbench]$ vim init.sql 
do language plpgsql $$
declare
begin
  execute 'drop table if exists test';
  execute 'create table test(id serial8, c1 int default 0,c2 int default 0,c3 int default 0,c4 int default 0,c5 int default 0,c6 int default 0,c7 int default 0,c8 int default 0,c9
  int default 0,c10 int default 0, crt_time timestamp)';

  for i in 1..10000 loop
    execute format('drop table if exists test%s', i);
    execute format('create table test%s (like test including all)', i);
  end loop;
end;
$$;

初始化完成后,准备测试脚本:

 public | test9983  | table | zgy
 public | test9984  | table | zgy
 public | test9985  | table | zgy
 public | test9986  | table | zgy
 public | test9987  | table | zgy
 public | test9988  | table | zgy
 public | test9989  | table | zgy
 public | test999   | table | zgy
 public | test9990  | table | zgy
 public | test9991  | table | zgy
 public | test9992  | table | zgy
 public | test9993  | table | zgy
 public | test9994  | table | zgy
 public | test9995  | table | zgy
 public | test9996  | table | zgy
 public | test9997  | table | zgy
 public | test9998  | table | zgy
 public | test9999  | table | zgy
(10001 rows)

pgbench测试脚本如下:

[zgy@intel175 pgbench]$ vim ddl.sql 
\setrandom id 1 10000
begin;
alter table test:id alter column c1 type int8;
alter table test:id rename column c1 to d1;
alter table test:id drop column d1;
alter table test:id add column c1 int default 0 ;
end;
[zgy@intel175 pgbench]$ vim dml.sql 
\setrandom id 1 10000
begin;
insert into test:id(id,crt_time) values(:id,now());
insert into test:id(crt_time) values(now());
update test:id set crt_time=now() where id=:id;
delete from test:id where id=:id;
end;

pgbench执行测试脚本如下:

[zgy@intel175 pgbench]$ vim pgbench.sh 
#!/bin/bash
dbname=$1
username=$2
port=$3
test_file=$4
bench_num=$5
excute_time=$6

nohup pgbench  -d $dbname  -U $username -p $port   -n -r -f ./$test_file -c $bench_num -j $bench_num -T $excute_time >>pgbench.log &

开始执行测试

100并发执行1小时

[zgy@intel175 pgbench]$ sh pgbench.sh  postgres zgy 6603 ddl.sql 100 3600
[zgy@intel175 pgbench]$ nohup: redirecting stderr to stdout
[zgy@intel175 pgbench]$ sh pgbench.sh  postgres zgy 6603 dml.sql 100 3600 
[zgy@intel175 pgbench]$ nohup: redirecting stderr to stdout

pgbench执行完成后,依次对1万张表进行普通查询:

#!/bin/bash
conn="psql -d postgres -U zgy -p 6603"
for ((i=1;i<=10000;i++));
do
  $conn -c "select * from test$i;"
done

 

压测1小时,未复现系统表属性丢失的,后续多压测几次尝试。



你可能感兴趣的:(AntDB)