HybridDB for PostgreSQL 列存表(AO表)的膨胀、垃圾检查与空间收缩

背景

Greenplum支持行存储（堆存储）与AO存储，堆存储的垃圾回收和膨胀检测方法请参考：

《如何检测、清理Greenplum膨胀、垃圾 - 阿里云HybridDB for PG最佳实践》

对于AO存储，虽然是appendonly，但实际上GP是支持DELETE和UPDATE的，被删除或更新的行，通过BITMAP来标记。

AO存储是块级组织，当一个块内的数据大部分都被删除或更新掉时，扫描它浪费的成本实际上是很高的。

如何检查AO表的膨胀，收缩AO表呢？

如何查看表的存储结构

pg_class.relstorage表示这个对象是什么存储：

postgres=# select distinct relstorage from pg_class ;  
 relstorage   
------------  
 a  -- 行存储AO表  
 h  -- heap堆表、索引  
 x  -- 外部表(external table)  
 v  -- 视图  
 c  -- 列存储AO表  
(5 rows)

查询当前数据库有哪些AO表：

postgres=# select t2.nspname, t1.relname from pg_class t1, pg_namespace t2 where t1.relnamespace=t2.oid and relstorage in ('c', 'a');  
 nspname  |      relname        
----------+-------------------  
 postgres | tbl_tag  
 postgres | tbl_pos_1_prt_p1  
 postgres | tbl_pos_1_prt_p2  
 postgres | tbl_pos_1_prt_p3  
 postgres | tbl_pos_1_prt_p4  
 postgres | tbl_pos_1_prt_p5  
 postgres | tbl_pos_1_prt_p6  
 postgres | tbl_pos_1_prt_p7  
 postgres | tbl_pos_1_prt_p8  
 postgres | tbl_pos_1_prt_p9  
 postgres | tbl_pos_1_prt_p10  
 postgres | tbl_pos  
 postgres | xx_czrk_qm_col  
 postgres | ao1  
(14 rows)

查询当前数据库有哪些堆表：

select t2.nspname, t1.relname from pg_class t1, pg_namespace t2 where t1.relnamespace=t2.oid and relstorage in ('h') and relkind='r';

查看AO表的膨胀率(有多少垃圾)

使用这个函数可以查看AO表的膨胀率

postgres=# \df+ gp_toolkit.__gp_aovisimap_compaction_info  
List of functions  
-[ RECORD 1 ]-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
Schema              | gp_toolkit  
Name                | __gp_aovisimap_compaction_info  
Result data type    | SETOF record  
Argument data types | ao_oid oid, OUT content integer, OUT datafile integer, OUT compaction_possible boolean, OUT hidden_tupcount bigint, OUT total_tupcount bigint, OUT percent_hidden numeric  
Type                | normal  
Data access         | no sql  
Volatility          | volatile  
Owner               | dege.zzz  
Language            | plpgsql  
Source code         |   
                    | DECLARE  
                    |     hinfo_row RECORD;  
                    |     threshold float;  
                    | BEGIN  
                    |     EXECUTE 'show gp_appendonly_compaction_threshold' INTO threshold;  
                    |     FOR hinfo_row IN SELECT gp_segment_id,  
                    |     gp_toolkit.__gp_aovisimap_hidden_typed(ao_oid)::gp_toolkit.__gp_aovisimap_hidden_t  
                    |     FROM gp_dist_random('gp_id') LOOP  
                    |         content := hinfo_row.gp_segment_id;  
                    |         datafile := (hinfo_row.__gp_aovisimap_hidden_typed).seg;  
                    |         hidden_tupcount := (hinfo_row.__gp_aovisimap_hidden_typed).hidden;  
                    |         total_tupcount := (hinfo_row.__gp_aovisimap_hidden_typed).total;  
                    |         compaction_possible := false;  
                    |         IF total_tupcount > 0 THEN  
                    |             percent_hidden := (100 * hidden_tupcount / total_tupcount::numeric)::numeric(5,2);  
                    |         ELSE  
                    |             percent_hidden := 0::numeric(5,2);  
                    |         END IF;  
                    |         IF percent_hidden > threshold THEN  
                    |             compaction_possible := true;  
                    |         END IF;  
                    |         RETURN NEXT;  
                    |     END LOOP;  
                    |     RAISE NOTICE 'gp_appendonly_compaction_threshold = %', threshold;  
                    |     RETURN;  
                    | END;  
                    |   
Description         |

例子

postgres=# select * from gp_toolkit.__gp_aovisimap_compaction_info('postgres.ao1'::regclass);  
NOTICE:  gp_appendonly_compaction_threshold = 10  
  
  
 content | datafile | compaction_possible | hidden_tupcount | total_tupcount | percent_hidden   
---------+----------+---------------------+-----------------+----------------+----------------  
       2 |        1 | f                   |              21 |         208283 |           0.01  
       9 |        1 | f                   |              25 |         208303 |           0.01  
      16 |        1 | f                   |              16 |         208352 |           0.01  
       4 |        1 | f                   |              23 |         208356 |           0.01  
       6 |        1 | f                   |              23 |         208386 |           0.01  
       3 |        1 | f                   |              14 |         208333 |           0.01  
       1 |        1 | f                   |              14 |         208329 |           0.01  
      14 |        1 | f                   |              12 |         208350 |           0.01  
      15 |        1 | f                   |              24 |         208346 |           0.01  
       7 |        1 | f                   |              22 |         208329 |           0.01  
       8 |        1 | f                   |              18 |         208334 |           0.01  
       0 |        1 | f                   |              21 |         208314 |           0.01  
      18 |        1 | f                   |              16 |         208417 |           0.01  
      11 |        1 | f                   |              24 |         208337 |           0.01  
      17 |        1 | f                   |              31 |         208380 |           0.01  
      12 |        1 | f                   |              12 |         208367 |           0.01  
      13 |        1 | f                   |              22 |         208365 |           0.01  
       5 |        1 | f                   |              22 |         208367 |           0.01  
      10 |        1 | f                   |              18 |         208347 |           0.01  
      20 |        1 | f                   |              17 |         208384 |           0.01  
      27 |        1 | f                   |              22 |         208348 |           0.01  
      19 |        1 | f                   |              31 |         208425 |           0.01  
      23 |        1 | f                   |              28 |         208344 |           0.01  
      26 |        1 | f                   |              14 |         208339 |           0.01  
      25 |        1 | f                   |              21 |         208386 |           0.01  
      24 |        1 | f                   |              14 |         208332 |           0.01  
      21 |        1 | f                   |              28 |         208360 |           0.01  
      33 |        1 | f                   |              18 |         208354 |           0.01  
      31 |        1 | f                   |              23 |         208335 |           0.01  
      22 |        1 | f                   |              17 |         208309 |           0.01  
      28 |        1 | f                   |              21 |         208314 |           0.01  
      29 |        1 | f                   |              23 |         208329 |           0.01  
      30 |        1 | f                   |              22 |         208342 |           0.01  
      35 |        1 | f                   |              18 |         208305 |           0.01  
      34 |        1 | f                   |              26 |         208363 |           0.01  
      32 |        1 | f                   |              25 |         208396 |           0.01  
      36 |        1 | f                   |              23 |         208323 |           0.01  
      38 |        1 | f                   |              22 |         208367 |           0.01  
      37 |        1 | f                   |              12 |         208334 |           0.01  
      39 |        1 | f                   |              12 |         208389 |           0.01  
      41 |        1 | f                   |              16 |         208373 |           0.01  
      42 |        1 | f                   |              30 |         208376 |           0.01  
      40 |        1 | f                   |              31 |         208397 |           0.01  
      43 |        1 | f                   |              16 |         208378 |           0.01  
      46 |        1 | f                   |              29 |         208371 |           0.01  
      47 |        1 | f                   |              16 |         208382 |           0.01  
      45 |        1 | f                   |              17 |         208368 |           0.01  
      44 |        1 | f                   |              29 |         208381 |           0.01  
(48 rows)

解读：

1、content：对应gp_configuration.content，表示greenplum每个节点的唯一编号。

2、datafile：这条记录对应的这个表的其中一个数据文件的编号，每个数据文件假设1GB。

3、compaction_possible：这个数据文件是否可以被收缩。（通过gp_appendonly_compaction_threshold参数和percent_hidden值判断）。

4、hidden_tupcount：有多少条记录已更新或删除（不可见）。

5、total_tupcount：总共有多少条记录（包括已更新或删除的记录）。

6、percent_hidden：不可见记录的占比。如果这个占比大于gp_appendonly_compaction_threshold参数，那么执行vacuum时，会收缩这个数据文件。

检查系统中膨胀率超过N的AO表

膨胀率超过千分之2的AO表：

select * from (  
  select t2.nspname, t1.relname, (gp_toolkit.__gp_aovisimap_compaction_info(t1.oid)).*   
  from pg_class t1, pg_namespace t2 where t1.relnamespace=t2.oid and relstorage in ('c', 'a')   
) t   
where t.percent_hidden > 0.2;

返回结果

 nspname  |      relname      | content | datafile | compaction_possible | hidden_tupcount | total_tupcount | percent_hidden   
----------+-------------------+---------+----------+---------------------+-----------------+----------------+----------------  
 postgres | tbl_pos_1_prt_p1  |       1 |        1 | t                   |        20833382 |       20833382 |         100.00  
 postgres | tbl_pos_1_prt_p1  |       7 |        1 | t                   |        20833495 |       20833628 |         100.00  
 postgres | tbl_pos_1_prt_p1  |       5 |        1 | t                   |        20833628 |       20833495 |         100.00  
 postgres | tbl_pos_1_prt_p1  |       3 |        1 | t                   |        20833469 |       20833469 |         100.00  
....

显然膨胀了100%，验证如下（1条记录有效，占用了15GB空间）：

postgres=# select pg_size_pretty(pg_relation_size('tbl_pos_1_prt_p1'));  
 pg_size_pretty   
----------------  
 15 GB  
(1 row)  
  
postgres=# select count(*) from tbl_pos_1_prt_p1;  
 count   
-------  
     1  
(1 row)

vacuum可以直接收缩（因为膨胀率大于gp_appendonly_compaction_threshold参数）：

postgres=# vacuum tbl_pos_1_prt_p1;  
VACUUM  
postgres=# select count(*) from tbl_pos_1_prt_p1;  
 count   
-------  
     1  
(1 row)  
  
postgres=# select pg_size_pretty(pg_relation_size('tbl_pos_1_prt_p1'));  
 pg_size_pretty   
----------------  
 40 bytes  
(1 row)

VACUUM后，只占用40字节。

收缩已膨胀的AO表

方法有三：

1、执行VACUUM。（当膨胀率大于gp_appendonly_compaction_threshold参数时），为共享锁。

2、执行VACUUM FULL。（不管gp_appendonly_compaction_threshold参数的设置，都会回收垃圾空间。），为DDL锁。

3、执行重分布。（不管gp_appendonly_compaction_threshold参数，都会回收垃圾空间。），为DDL锁。

alter table  set with (reorganize=true) distributed randomly;    
    
或    
    
alter table  set with (reorganize=true) distributed by (,....)

set distribute可以回收索引的膨胀空间。set distribute 加载的锁与DDL锁类似，是排它锁。建议在没有业务的时候执行，不要堵塞业务。

同时set distribute只要分布条件不变，就是在节点内完成的，不会涉及数据的重分布。

建议的操作流程：

1 记录下表的分布列    
    
2 执行set distribute (REORGANIZE=true)    
    
如果是随机分布，则设置为随机分布

参考

http://greenplum.org/docs/510/admin_guide/managing/maintain.html

If the ratio of hidden rows to total rows in a segment file is less than a threshold value (10, by default),   
the segment file is not compacted.   
  
The threshold value can be configured with the gp_appendonly_compaction_threshold server configuration parameter.   
  
VACUUM FULL ignores the value of gp_appendonly_compaction_threshold and rewrites the segment file regardless of the ratio.  
  
You can use the __gp_aovisimap_compaction_info() function in the the gp_toolkit schema to investigate   
the effectiveness of a VACUUM operation on append-optimized tables.