读薄《PostgreSQL内核分析》:表和元组的组织方式

0. 前言

主要记录对彭智勇和彭煜玮老师所著《PostgreSQL数据库内核分析》一书的观后总结,用于读薄这本经典著作,同时也方便自己时常温习,加深印象。

部分内容摘自EthanHe的内核分析系列文章:https://www.jianshu.com/u/6b8fc3f18f72

1. 表的存储结构

下图摘自EthanHe的内核分析系列文章:https://www.jianshu.com/p/012643cfba25。在PG中,一个表对应一个数据文件(文件过大时会分割),每个数据文件由若干个数据页Page(文件块)组成,每个数据页中存储着若干个元组Tuple。

普通数据表存储结构.png

2.Page页结构解析

* a postgres
* disk page is always a slotted page of the form:
* +----------------+---------------------------------+
* | PageHeaderData | linp1 linp2 linp3 ...           |
* +-----------+----+---------------------------------+
* | ... linpN |                                      |
* +-----------+--------------------------------------+
* |           ^ pd_lower                             |
* |                                                  |
* |             v pd_upper                           |
* +-------------+------------------------------------+
* |             | tupleN ...                         |
* +-------------+------------------+-----------------+
* |       ... tuple3 tuple2 tuple1 | "special space" |
* +--------------------------------+-----------------+
*                                  ^ pd_special

1)PageHeaderData:
在页面起始位置分配了由结构PageHeaderData定义的首部数据,存储LSN号、校验位等元数据信息,至少占用24Bytes(为什么说至少,因为里面存储着数据行指针pd_linp,其大小不定,如果没有元组插入,那么整个页头大小就是24字节),主要成员变量如下:

  • pd_lsn———— 本页面最近一次变更所写入的XLOG记录对应的LSN。其类型是PageXLogRecPtr,该结构由xlogid和xrecoff两个属性组成,前者表示wal日志的逻辑id,后者表是在wal日志中的偏移量,两者都是32位无符号数。因此pd_lsn是一个8B的无符号整数。
  • pd_checksum———— 本页面的校验和值(9.3版本以后才有),2个字节的无符号整型。
  • pd_flags———— 标志位,见下面的定义,2个字节的无符号整型。
  • pd_lower、pd_upper———— pd_lower指向行指针的末尾,表示空闲空间的起始位置。pd_upper指向最新堆元组的起始位置,表示空闲空间的结束位置。都是2个字节的无符号整型。
  • pd_special ———— 在索引页中会用到该字段,在堆表页中它指向页尾。2个字节无符号整型。
  • pd_pagesize_version ———— 不知道干啥的,2个字节。
  • pd_prune_xid ———— 字面意思是可剪枝的最老的事务ID,4个字节。
  • pd_linp[FLEXIBLE_ARRAY_MEMBER] ———— ItemIdData类型的数组。ItemIdData类型由lp_off、lp_flags、lp_len三个属性组成。每一个ItemIdData结构用来指向文件块中的一个元组,其中lp_off是元组在文件块(Page)中的偏移量,lp_len则说明了该元组的长度,lp_flags则表示元组的状态(分为未使用、正常使用、HOT重定向和死亡四种状态)。每个ItemIdData元素大小为4个字节。

PageHeaderData结构以及相关数据结构定义如下:

头文件:src/include/storage/bufpage.h
typedef struct PageHeaderData
{
    /* XXX LSN is member of *any* block, not only page-organized ones */
    PageXLogRecPtr pd_lsn;        /* LSN: next byte after last byte of xlog
                                 * record for last change to this page */
    uint16        pd_checksum;    /* checksum */
    uint16        pd_flags;        /* flag bits, see below */
    LocationIndex pd_lower;        /* offset to start of free space */
    LocationIndex pd_upper;        /* offset to end of free space */
    LocationIndex pd_special;    /* offset to start of special space */
    uint16        pd_pagesize_version;
    TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
    ItemIdData    pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
} PageHeaderData;

typedef PageHeaderData *PageHeader;

/*
* pd_flags contains the following flag bits.  Undefined bits are initialized
* to zero and may be used in the future.
*
* PD_HAS_FREE_LINES is set if there are any LP_UNUSED line pointers before
* pd_lower.  This should be considered a hint rather than the truth, since
* changes to it are not WAL-logged.
*
* PD_PAGE_FULL is set if an UPDATE doesn't find enough free space in the
* page for its new tuple version; this suggests that a prune is needed.
* Again, this is just a hint.
*/
#define PD_HAS_FREE_LINES    0x0001    /* are there any unused line pointers? */
#define PD_PAGE_FULL        0x0002    /* not enough free space for new tuple? */
#define PD_ALL_VISIBLE        0x0004    /* all tuples on page are visible to
                                     * everyone */


#define PD_VALID_FLAG_BITS    0x0007    /* OR of all valid pd_flags bits */

头文件:src/include/storage/itemid.h
/*
* An item pointer (also called line pointer) on a buffer page
*
* In some cases an item pointer is "in use" but does not have any associated
* storage on the page.  By convention, lp_len == 0 in every item pointer
* that does not have storage, independently of its lp_flags state.
*/
typedef struct ItemIdData
{
    unsigned    lp_off:15,        /* offset to tuple (from start of page) */
                lp_flags:2,        /* state of item pointer, see below */
                lp_len:15;        /* byte length of tuple */
} ItemIdData;


typedef ItemIdData *ItemId;

/*
* lp_flags has these possible states.  An UNUSED line pointer is available
* for immediate re-use, the other states are not.
*/
#define LP_UNUSED        0        /* unused (should always have lp_len=0) */
#define LP_NORMAL        1        /* used (should always have lp_len>0) */
#define LP_REDIRECT        2        /* HOT redirect (should have lp_len=0) */
#define LP_DEAD            3        /* dead, may or may not have storage */


/*
* Item offsets and lengths are represented by these types when
* they're not actually stored in an ItemIdData.
*/
typedef uint16 ItemOffset;
typedef uint16 ItemLength;

2)空闲空间:页头PageHeaderData中pd_lower到pd_upper之间的空间。新插入页面的元组和其对应的Linp元素都将从这部分分配。其中Linp从空闲空间的开头开始分配,元组从尾部开始分配。
3)实际的元组:实际存储的行数据。
4)Special Space: 特殊空间, 用于存放和索引方法相关的特定数据,不同的索引方法存放不同的数据。由于索引文件的文件块结构和普通表文件的相同,因此Special Space在普通表文件块中并没有使用,其内容被置空。

3. 元组结构解析

这一部分主要介绍元组的数据结构。每个元组包含两部分,第一部分是Tuple头部信息,第二部分是实际的数据。

HeapTupleHeader及其相关数据结构如下:

//--------------------- src/include/storage/off.h
/*
* OffsetNumber:
*
* this is a 1-based index into the linp (ItemIdData) array in the
* header of each disk page.
*/
typedef uint16 OffsetNumber;

//--------------------- src/include/storage/block.h
/*
* BlockId:
*
* this is a storage type for BlockNumber. in other words, this type
* is used for on-disk structures (e.g., in HeapTupleData) whereas
* BlockNumber is the type on which calculations are performed (e.g.,
* in access method code).
*
* there doesn't appear to be any reason to have separate types except
* for the fact that BlockIds can be SHORTALIGN'd (and therefore any
* structures that contains them, such as ItemPointerData, can also be
* SHORTALIGN'd). this is an important consideration for reducing the
* space requirements of the line pointer (ItemIdData) array on each
* page and the header of each heap or index tuple, so it doesn't seem
* wise to change this without good reason.
*/
typedef struct BlockIdData
{
    uint16      bi_hi;
    uint16      bi_lo;
} BlockIdData;

typedef BlockIdData *BlockId; /* block identifier */

//--------------------- src/include/storage/itemptr.h
/*
* ItemPointer:
*
* This is a pointer to an item within a disk page of a known file
* (for example, a cross-link from an index to its parent table).
* blkid tells us which block, posid tells us which entry in the linp
* (ItemIdData) array we want.
*
* Note: because there is an item pointer in each tuple header and index
* tuple header on disk, it's very important not to waste space with
* structure padding bytes. The struct is designed to be six bytes long
* (it contains three int16 fields) but a few compilers will pad it to
* eight bytes unless coerced. We apply appropriate persuasion where
* possible. If your compiler can't be made to play along, you'll waste
* lots of space.
*/
typedef struct ItemPointerData
{
    BlockIdData ip_blkid;
    OffsetNumber ip_posid;
}

//--------------------- src/include/access/htup_details.h
typedef struct HeapTupleFields
{
    TransactionId t_xmin;       /* inserting xact ID */
    TransactionId t_xmax;       /* deleting or locking xact ID */
    union
    {
        CommandId   t_cid;      /* inserting or deleting command ID, or both */
        TransactionId t_xvac;   /* old-style VACUUM FULL xact ID */
    }           t_field3;
} HeapTupleFields;

typedef struct DatumTupleFields
{
    int32       datum_len_;     /* varlena header (do not touch directly!) */
    int32       datum_typmod;   /* -1, or identifier of a record type */
    Oid         datum_typeid;   /* composite type OID, or RECORDOID */
    /*
     * Note: field ordering is chosen with thought that Oid might someday
     * widen to 64 bits.
     */
} DatumTupleFields;

struct HeapTupleHeaderData
{
    union
    {
        HeapTupleFields t_heap;
        DatumTupleFields t_datum;
    }           t_choice;
    ItemPointerData t_ctid;     /* current TID of this or newer tuple (or a
                                 * speculative insertion token) */
    /* Fields below here must match MinimalTupleData! */
    uint16      t_infomask2;    /* number of attributes + various flags */
    uint16      t_infomask;     /* various flag bits, see below */
    uint8       t_hoff;         /* sizeof header incl. bitmap, padding */
    /* ^ - 23 bytes - ^ */
    bits8       t_bits[FLEXIBLE_ARRAY_MEMBER];  /* bitmap of NULLs */
    /* MORE DATA FOLLOWS AT END OF STRUCT */
};

HeapTupleHeaderData结构分析:

1)t_choice是具有两个成员的联合类型:

  • t_heap: 用于记录对元组执行插入、删除操作的事务ID和命令ID,这些信息主要用于并发控制时检查元组对事务的可见性。将t_heap展开,包括如下成员:
Field       Type            Length    Offset Description
--------------------------------------------------------------------------------------------------
t_xmin      TransactionId   4 bytes     0    insert XID stamp
t_xmax      TransactionId   4 bytes     4    delete XID stamp
t_cid       CommandId       4 bytes     8    insert and/or delete CID stamp (overlays with t_xvac)
t_xvac      TransactionId   4 bytes     8    XID for VACUUM operation moving a row version
//注意:t_cid和t_xvac为联合体,共用存储空间
//下面是HeapTupleHeaderData其他成员
t_ctid      ItemPointerData 6 bytes     12   current TID of this or newer row version
t_infomask2 uint16          2 bytes     18   number of attributes, plus various flag bits
t_infomask  uint16          2 bytes     20   various flag bits
t_hoff      uint8           1 byte      22   offset to user data
  • t_datum:当一个新元组在内存中形成时,我们并不关心其事务可见性,因此在t_choice中只需用DatumTupleFields结构来记录元组的长度等信息。但把元组插入表文件中时,需要在元组头部信息中记录插入该元组的事务和命令ID,此时会把t_choice占用的内存转换为HeapTupleFields结构。

2)t_ctid用于记录当前元组或新元组的物理位置(块内偏移量和元组长度),若元组被更新,则记录的是新版本元组的物理位置。

3)t_infomask用于标识元组的当前状态, 比如元组是否具有OID、是否有空属性等,t_infomask的每一位对应不同的状态,共16种状态。

4)t_infomask2使用其低11位来表述当前元组的属性个数。其他位则用于HOT技术以及元组可见性的标志位。

5)t_hoff表示该元组头的大小

6)_bits[] 数组用于标识该元组哪些字段为空。

HeapTupleData的数据结构

typedef struct HeapTupleData
{
    uint32      t_len;          /* length of *t_data */
    ItemPointerData t_self;     /* SelfItemPointer */
    Oid         t_tableOid;     /* table the tuple came from */
    HeapTupleHeader t_data;     /* -> tuple header and data */
} HeapTupleData;

typedef HeapTupleData *HeapTuple;

#define HEAPTUPLESIZE   MAXALIGN(sizeof(HeapTupleData))

4. 使用hexdump透析Page页和元组的物理存储

1) 测试数据准备

-- 创建一张表,插入几行数据

drop table if exists t_page;

create table t_page (id int,c1 char(8),c2 varchar(16));

insert into t_page values(1,'1','a');

insert into t_page values(2,'2','b');

insert into t_page values(3,'3','c');

insert into t_page values(4,'4','d');

-- 获取该表对应的数据文件

postgres=# select pg_relation_filepath('t_page');
-[ RECORD 1 ]--------+-----------------
pg_relation_filepath | base/13451/17020

-- Dump数据文件中的数据

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020
00000000  00 00 00 00 e0 2d 02 0f  19 b5 00 00 28 00 60 1f  |.....-......(.`.|
00000010  00 20 04 20 00 00 00 00  d8 9f 4e 00 b0 9f 4e 00  |. . ......N...N.|
00000020  88 9f 4e 00 60 9f 4e 00  00 00 00 00 00 00 00 00  |..N.`.N.........|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001f60  73 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |s...............|
00001f70  04 00 03 00 02 08 18 00  04 00 00 00 13 34 20 20  |.............4  |
00001f80  20 20 20 20 20 05 64 00  72 02 00 00 00 00 00 00  |     .d.r.......|
00001f90  00 00 00 00 00 00 00 00  03 00 03 00 02 08 18 00  |................|
00001fa0  03 00 00 00 13 33 20 20  20 20 20 20 20 05 63 00  |.....3       .c.|
00001fb0  71 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |q...............|
00001fc0  02 00 03 00 02 08 18 00  02 00 00 00 13 32 20 20  |.............2  |
00001fd0  20 20 20 20 20 05 62 00  70 02 00 00 00 00 00 00  |     .b.p.......|
00001fe0  00 00 00 00 00 00 00 00  01 00 03 00 02 08 18 00  |................|
00001ff0  01 00 00 00 13 31 20 20  20 20 20 20 20 05 61 00  |.....1       .a.|
00002000

2) 使用hexdump透析页头PageHeaderData物理存储

pd_lsn(8bytes)

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 0 -n 8
00000000  00 00 00 00 e0 2d 02 0f                           |.....-..|
00000008

数据文件的8个Bytes存储的是LSN,其中最开始的4个Bytes是逻辑文件ID,在这里是\x0000 0000(即数字0),后面的4个Bytes是\x0F022DE0,组合起来LSN为0/0F022DE0

注意:

A、0000000&0000008是hexdump工具的输出,不是数据内容

B、X86使用小端模式,阅读字节码时注意高低位变换

pd_checksum(2bytes)

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 8 -n 2
00000008  19 b5                                             |..|
0000000a

checksum为0xb519

pd_flags(2bytes)

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 10 -n 2
0000000a  00 00                                             |..|
0000000c

flags为0x0000

pd_lower(2bytes)

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 12 -n 2
0000000c  28 00                                             |(.|
0000000e

lower为0x0028,十进制值为40

pd_upper(2bytes)

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 14 -n 2
0000000e  60 1f                                             |`.|
00000010

[postgres@sndspstdb62 ~]$ echo $((0x1f60))
8032

upper为0x1f60,十进制为8032

pd_special(2bytes)

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 16 -n 2
00000010  00 20                                             |. |
00000012

Special Space为0x2000,十进制值为8192

pd_pagesize_version(2bytes)

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 18 -n 2
00000012  04 20                                             |. |
00000014

pagesize_version为0x2004,十进制为8196(即版本4)

pd_prune_xid(4bytes)

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 20 -n 4
00000014  00 00 00 00                                       |....|
00000018

prune_xid为0x0000,即0

3) 使用hexdump透析行指针Linp(即ItemId)数组

PageHeaderData后面就是Linp数组,每个元素占用4字节。数据结构再写一遍,方便分析:

typedef struct ItemIdData
{
    unsigned    lp_off:15,        /* offset to tuple (from start of page) */
                lp_flags:2,        /* state of item pointer, see below */
                lp_len:15;        /* byte length of tuple */
} ItemIdData;

typedef ItemIdData *ItemId;

/*
* lp_flags has these possible states.  An UNUSED line pointer is available
* for immediate re-use, the other states are not.
*/
#define LP_UNUSED        0        /* unused (should always have lp_len=0) */
#define LP_NORMAL        1        /* used (should always have lp_len>0) */
#define LP_REDIRECT        2        /* HOT redirect (should have lp_len=0) */
#define LP_DEAD            3        /* dead, may or may not have storage */

lp_off

元组的偏移量(相对页面起始处)

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 24 -n 2
00000018  d8 9f                                             |..|
0000001a

lp_off 的值为0x9fd8 取低15位

[xdb@localhost utf8db]$ echo $((0x9fd8 & ~$((1<<15))))
8152

lp_off 转换成10进制为8152,表示第1个Item(tuple)的偏移量为8152

lp_len

[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 26 -n 2
0000001a  4e 00                                             |N.|
0000001c

取高15位

[xdb@localhost utf8db]$ echo $((0x004e >> 1))
39

表示第1个Item(tuple)的长度为39字节

lp_flags

取第17-16位,01,即1, 表示该元组处于正常状态。

4) 使用hexdump透析元组头HeapTupleHeaderData物理存储

t_xmin

t_xmin保存插入此元组的事务的txid

[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8152 -n 4
00001fd8 e2 1b 18 00 |....|
00001fdc

t_xmin = 0x00181be2
将t_xmin转成10进制

[xdb@localhost ~]$ echo $((0x00181be2))
1580002

t_xmax

保存删除或更新此元组的事务的txid。如果尚未删除或更新此元组,则t_xmax设置为0,即无效

[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8156 -n 4
00001fdc 00 00 00 00 |....|
00001fe0

发现t_xmax是0,说明此元组未发生删除或更新。

t_cid/t_xvac

t_cid保存命令标识(command id,cid),cid的意思是在当前事务中,执行当前命令之前针对此元组执行了多少SQL命令,从零开始计数

[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8160 -n 4
00001fe0 00 00 00 00 |....|
00001fe4

发现t_cid = 0,说明在当前事务中,当前命令是针对此元组的第一条SQL命令。

t_ctid

保存着指向自身或新元组的元组标识符(tid)

[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8164 -n 6
00001fe4 00 00 00 00 01 00 |......|
00001fea

ip_blkid=0x0000,即blockid=0
ip_posid=0x0001,即posid=1,第1个tuple

t_infomask2

[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8170 -n 2
00001fea 03 00 |..|
00001fec

t_infomask2=0x0003,前(低)11位为属性的个数,3代表这个元组有三个字段

t_infomask

[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8172 -n 2
00001fec 02 08 |..|
00001fee
[xdb@localhost ~]$ echo $((0x0802))
2050
[xdb@localhost ~]$ echo "obase=2;2050"|bc
100000000010

//t_infomask=0x0802,十进制值为2050,二进制值为100000000010, 标识着元组当前的状态,结合掩码分析,略过了这里

t_hoff

[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8174 -n 1
00001fee 18 |.|
00001fef
[xdb@localhost ~]$ echo $((0x18))
24

用户数据开始偏移为24,即8152+24=8176

5) 使用hexdump透析元组的实际存储数据

说完了Tuple的头部数据,接下来我们看看实际的数据存储。
前面我们得到第一个Tuple总的长度是39,结合t_hoff = 24, 计算得到数据大小为39-24=15。

[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8176 -n 15
00001ff0 01 00 00 00 13 31 20 20 20 20 20 20 20 05 61 |.....1 .a|
00001fff

回顾我们的表结构:
create table t_page (id int,c1 char(8),c2 varchar(16));
第1个字段为int,第2个字段为定长字符,第3个字段为变长字符。
相应的数据:
id=\x00000001,数字1
c1=\x133120202020202020,字符串,无需高低位变换,第1个字节\x13为标志位,后面是字符'1'+7个空格
c2=\x0561,字符串,第1个字节\x05为标志位,后面是字符'a'

你可能感兴趣的:(读薄《PostgreSQL内核分析》:表和元组的组织方式)