0. 前言
主要记录对彭智勇和彭煜玮老师所著《PostgreSQL数据库内核分析》一书的观后总结,用于读薄这本经典著作,同时也方便自己时常温习,加深印象。
部分内容摘自EthanHe的内核分析系列文章:https://www.jianshu.com/u/6b8fc3f18f72
1. 表的存储结构
下图摘自EthanHe的内核分析系列文章:https://www.jianshu.com/p/012643cfba25。在PG中,一个表对应一个数据文件(文件过大时会分割),每个数据文件由若干个数据页Page(文件块)组成,每个数据页中存储着若干个元组Tuple。
2.Page页结构解析
* a postgres
* disk page is always a slotted page of the form:
* +----------------+---------------------------------+
* | PageHeaderData | linp1 linp2 linp3 ... |
* +-----------+----+---------------------------------+
* | ... linpN | |
* +-----------+--------------------------------------+
* | ^ pd_lower |
* | |
* | v pd_upper |
* +-------------+------------------------------------+
* | | tupleN ... |
* +-------------+------------------+-----------------+
* | ... tuple3 tuple2 tuple1 | "special space" |
* +--------------------------------+-----------------+
* ^ pd_special
1)PageHeaderData:
在页面起始位置分配了由结构PageHeaderData定义的首部数据,存储LSN号、校验位等元数据信息,至少占用24Bytes(为什么说至少,因为里面存储着数据行指针pd_linp,其大小不定,如果没有元组插入,那么整个页头大小就是24字节),主要成员变量如下:
- pd_lsn———— 本页面最近一次变更所写入的XLOG记录对应的LSN。其类型是PageXLogRecPtr,该结构由xlogid和xrecoff两个属性组成,前者表示wal日志的逻辑id,后者表是在wal日志中的偏移量,两者都是32位无符号数。因此pd_lsn是一个8B的无符号整数。
- pd_checksum———— 本页面的校验和值(9.3版本以后才有),2个字节的无符号整型。
- pd_flags———— 标志位,见下面的定义,2个字节的无符号整型。
- pd_lower、pd_upper———— pd_lower指向行指针的末尾,表示空闲空间的起始位置。pd_upper指向最新堆元组的起始位置,表示空闲空间的结束位置。都是2个字节的无符号整型。
- pd_special ———— 在索引页中会用到该字段,在堆表页中它指向页尾。2个字节无符号整型。
- pd_pagesize_version ———— 不知道干啥的,2个字节。
- pd_prune_xid ———— 字面意思是可剪枝的最老的事务ID,4个字节。
- pd_linp[FLEXIBLE_ARRAY_MEMBER] ———— ItemIdData类型的数组。ItemIdData类型由lp_off、lp_flags、lp_len三个属性组成。每一个ItemIdData结构用来指向文件块中的一个元组,其中lp_off是元组在文件块(Page)中的偏移量,lp_len则说明了该元组的长度,lp_flags则表示元组的状态(分为未使用、正常使用、HOT重定向和死亡四种状态)。每个ItemIdData元素大小为4个字节。
PageHeaderData结构以及相关数据结构定义如下:
头文件:src/include/storage/bufpage.h
typedef struct PageHeaderData
{
/* XXX LSN is member of *any* block, not only page-organized ones */
PageXLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog
* record for last change to this page */
uint16 pd_checksum; /* checksum */
uint16 pd_flags; /* flag bits, see below */
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
uint16 pd_pagesize_version;
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
} PageHeaderData;
typedef PageHeaderData *PageHeader;
/*
* pd_flags contains the following flag bits. Undefined bits are initialized
* to zero and may be used in the future.
*
* PD_HAS_FREE_LINES is set if there are any LP_UNUSED line pointers before
* pd_lower. This should be considered a hint rather than the truth, since
* changes to it are not WAL-logged.
*
* PD_PAGE_FULL is set if an UPDATE doesn't find enough free space in the
* page for its new tuple version; this suggests that a prune is needed.
* Again, this is just a hint.
*/
#define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */
#define PD_PAGE_FULL 0x0002 /* not enough free space for new tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
头文件:src/include/storage/itemid.h
/*
* An item pointer (also called line pointer) on a buffer page
*
* In some cases an item pointer is "in use" but does not have any associated
* storage on the page. By convention, lp_len == 0 in every item pointer
* that does not have storage, independently of its lp_flags state.
*/
typedef struct ItemIdData
{
unsigned lp_off:15, /* offset to tuple (from start of page) */
lp_flags:2, /* state of item pointer, see below */
lp_len:15; /* byte length of tuple */
} ItemIdData;
typedef ItemIdData *ItemId;
/*
* lp_flags has these possible states. An UNUSED line pointer is available
* for immediate re-use, the other states are not.
*/
#define LP_UNUSED 0 /* unused (should always have lp_len=0) */
#define LP_NORMAL 1 /* used (should always have lp_len>0) */
#define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */
#define LP_DEAD 3 /* dead, may or may not have storage */
/*
* Item offsets and lengths are represented by these types when
* they're not actually stored in an ItemIdData.
*/
typedef uint16 ItemOffset;
typedef uint16 ItemLength;
2)空闲空间:页头PageHeaderData中pd_lower到pd_upper之间的空间。新插入页面的元组和其对应的Linp元素都将从这部分分配。其中Linp从空闲空间的开头开始分配,元组从尾部开始分配。
3)实际的元组:实际存储的行数据。
4)Special Space: 特殊空间, 用于存放和索引方法相关的特定数据,不同的索引方法存放不同的数据。由于索引文件的文件块结构和普通表文件的相同,因此Special Space在普通表文件块中并没有使用,其内容被置空。
3. 元组结构解析
这一部分主要介绍元组的数据结构。每个元组包含两部分,第一部分是Tuple头部信息,第二部分是实际的数据。
HeapTupleHeader及其相关数据结构如下:
//--------------------- src/include/storage/off.h
/*
* OffsetNumber:
*
* this is a 1-based index into the linp (ItemIdData) array in the
* header of each disk page.
*/
typedef uint16 OffsetNumber;
//--------------------- src/include/storage/block.h
/*
* BlockId:
*
* this is a storage type for BlockNumber. in other words, this type
* is used for on-disk structures (e.g., in HeapTupleData) whereas
* BlockNumber is the type on which calculations are performed (e.g.,
* in access method code).
*
* there doesn't appear to be any reason to have separate types except
* for the fact that BlockIds can be SHORTALIGN'd (and therefore any
* structures that contains them, such as ItemPointerData, can also be
* SHORTALIGN'd). this is an important consideration for reducing the
* space requirements of the line pointer (ItemIdData) array on each
* page and the header of each heap or index tuple, so it doesn't seem
* wise to change this without good reason.
*/
typedef struct BlockIdData
{
uint16 bi_hi;
uint16 bi_lo;
} BlockIdData;
typedef BlockIdData *BlockId; /* block identifier */
//--------------------- src/include/storage/itemptr.h
/*
* ItemPointer:
*
* This is a pointer to an item within a disk page of a known file
* (for example, a cross-link from an index to its parent table).
* blkid tells us which block, posid tells us which entry in the linp
* (ItemIdData) array we want.
*
* Note: because there is an item pointer in each tuple header and index
* tuple header on disk, it's very important not to waste space with
* structure padding bytes. The struct is designed to be six bytes long
* (it contains three int16 fields) but a few compilers will pad it to
* eight bytes unless coerced. We apply appropriate persuasion where
* possible. If your compiler can't be made to play along, you'll waste
* lots of space.
*/
typedef struct ItemPointerData
{
BlockIdData ip_blkid;
OffsetNumber ip_posid;
}
//--------------------- src/include/access/htup_details.h
typedef struct HeapTupleFields
{
TransactionId t_xmin; /* inserting xact ID */
TransactionId t_xmax; /* deleting or locking xact ID */
union
{
CommandId t_cid; /* inserting or deleting command ID, or both */
TransactionId t_xvac; /* old-style VACUUM FULL xact ID */
} t_field3;
} HeapTupleFields;
typedef struct DatumTupleFields
{
int32 datum_len_; /* varlena header (do not touch directly!) */
int32 datum_typmod; /* -1, or identifier of a record type */
Oid datum_typeid; /* composite type OID, or RECORDOID */
/*
* Note: field ordering is chosen with thought that Oid might someday
* widen to 64 bits.
*/
} DatumTupleFields;
struct HeapTupleHeaderData
{
union
{
HeapTupleFields t_heap;
DatumTupleFields t_datum;
} t_choice;
ItemPointerData t_ctid; /* current TID of this or newer tuple (or a
* speculative insertion token) */
/* Fields below here must match MinimalTupleData! */
uint16 t_infomask2; /* number of attributes + various flags */
uint16 t_infomask; /* various flag bits, see below */
uint8 t_hoff; /* sizeof header incl. bitmap, padding */
/* ^ - 23 bytes - ^ */
bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs */
/* MORE DATA FOLLOWS AT END OF STRUCT */
};
HeapTupleHeaderData结构分析:
1)t_choice是具有两个成员的联合类型:
- t_heap: 用于记录对元组执行插入、删除操作的事务ID和命令ID,这些信息主要用于并发控制时检查元组对事务的可见性。将t_heap展开,包括如下成员:
Field Type Length Offset Description
--------------------------------------------------------------------------------------------------
t_xmin TransactionId 4 bytes 0 insert XID stamp
t_xmax TransactionId 4 bytes 4 delete XID stamp
t_cid CommandId 4 bytes 8 insert and/or delete CID stamp (overlays with t_xvac)
t_xvac TransactionId 4 bytes 8 XID for VACUUM operation moving a row version
//注意:t_cid和t_xvac为联合体,共用存储空间
//下面是HeapTupleHeaderData其他成员
t_ctid ItemPointerData 6 bytes 12 current TID of this or newer row version
t_infomask2 uint16 2 bytes 18 number of attributes, plus various flag bits
t_infomask uint16 2 bytes 20 various flag bits
t_hoff uint8 1 byte 22 offset to user data
- t_datum:当一个新元组在内存中形成时,我们并不关心其事务可见性,因此在t_choice中只需用DatumTupleFields结构来记录元组的长度等信息。但把元组插入表文件中时,需要在元组头部信息中记录插入该元组的事务和命令ID,此时会把t_choice占用的内存转换为HeapTupleFields结构。
2)t_ctid用于记录当前元组或新元组的物理位置(块内偏移量和元组长度),若元组被更新,则记录的是新版本元组的物理位置。
3)t_infomask用于标识元组的当前状态, 比如元组是否具有OID、是否有空属性等,t_infomask的每一位对应不同的状态,共16种状态。
4)t_infomask2使用其低11位来表述当前元组的属性个数。其他位则用于HOT技术以及元组可见性的标志位。
5)t_hoff表示该元组头的大小
6)_bits[] 数组用于标识该元组哪些字段为空。
HeapTupleData的数据结构
typedef struct HeapTupleData
{
uint32 t_len; /* length of *t_data */
ItemPointerData t_self; /* SelfItemPointer */
Oid t_tableOid; /* table the tuple came from */
HeapTupleHeader t_data; /* -> tuple header and data */
} HeapTupleData;
typedef HeapTupleData *HeapTuple;
#define HEAPTUPLESIZE MAXALIGN(sizeof(HeapTupleData))
4. 使用hexdump透析Page页和元组的物理存储
1) 测试数据准备
-- 创建一张表,插入几行数据
drop table if exists t_page;
create table t_page (id int,c1 char(8),c2 varchar(16));
insert into t_page values(1,'1','a');
insert into t_page values(2,'2','b');
insert into t_page values(3,'3','c');
insert into t_page values(4,'4','d');
-- 获取该表对应的数据文件
postgres=# select pg_relation_filepath('t_page');
-[ RECORD 1 ]--------+-----------------
pg_relation_filepath | base/13451/17020
-- Dump数据文件中的数据
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020
00000000 00 00 00 00 e0 2d 02 0f 19 b5 00 00 28 00 60 1f |.....-......(.`.|
00000010 00 20 04 20 00 00 00 00 d8 9f 4e 00 b0 9f 4e 00 |. . ......N...N.|
00000020 88 9f 4e 00 60 9f 4e 00 00 00 00 00 00 00 00 00 |..N.`.N.........|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001f60 73 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |s...............|
00001f70 04 00 03 00 02 08 18 00 04 00 00 00 13 34 20 20 |.............4 |
00001f80 20 20 20 20 20 05 64 00 72 02 00 00 00 00 00 00 | .d.r.......|
00001f90 00 00 00 00 00 00 00 00 03 00 03 00 02 08 18 00 |................|
00001fa0 03 00 00 00 13 33 20 20 20 20 20 20 20 05 63 00 |.....3 .c.|
00001fb0 71 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |q...............|
00001fc0 02 00 03 00 02 08 18 00 02 00 00 00 13 32 20 20 |.............2 |
00001fd0 20 20 20 20 20 05 62 00 70 02 00 00 00 00 00 00 | .b.p.......|
00001fe0 00 00 00 00 00 00 00 00 01 00 03 00 02 08 18 00 |................|
00001ff0 01 00 00 00 13 31 20 20 20 20 20 20 20 05 61 00 |.....1 .a.|
00002000
2) 使用hexdump透析页头PageHeaderData物理存储
pd_lsn(8bytes)
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 0 -n 8
00000000 00 00 00 00 e0 2d 02 0f |.....-..|
00000008
数据文件的8个Bytes存储的是LSN,其中最开始的4个Bytes是逻辑文件ID,在这里是\x0000 0000(即数字0),后面的4个Bytes是\x0F022DE0,组合起来LSN为0/0F022DE0
注意:
A、0000000&0000008是hexdump工具的输出,不是数据内容
B、X86使用小端模式,阅读字节码时注意高低位变换
pd_checksum(2bytes)
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 8 -n 2
00000008 19 b5 |..|
0000000a
checksum为0xb519
pd_flags(2bytes)
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 10 -n 2
0000000a 00 00 |..|
0000000c
flags为0x0000
pd_lower(2bytes)
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 12 -n 2
0000000c 28 00 |(.|
0000000e
lower为0x0028,十进制值为40
pd_upper(2bytes)
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 14 -n 2
0000000e 60 1f |`.|
00000010
[postgres@sndspstdb62 ~]$ echo $((0x1f60))
8032
upper为0x1f60,十进制为8032
pd_special(2bytes)
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 16 -n 2
00000010 00 20 |. |
00000012
Special Space为0x2000,十进制值为8192
pd_pagesize_version(2bytes)
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 18 -n 2
00000012 04 20 |. |
00000014
pagesize_version为0x2004,十进制为8196(即版本4)
pd_prune_xid(4bytes)
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 20 -n 4
00000014 00 00 00 00 |....|
00000018
prune_xid为0x0000,即0
3) 使用hexdump透析行指针Linp(即ItemId)数组
PageHeaderData后面就是Linp数组,每个元素占用4字节。数据结构再写一遍,方便分析:
typedef struct ItemIdData
{
unsigned lp_off:15, /* offset to tuple (from start of page) */
lp_flags:2, /* state of item pointer, see below */
lp_len:15; /* byte length of tuple */
} ItemIdData;
typedef ItemIdData *ItemId;
/*
* lp_flags has these possible states. An UNUSED line pointer is available
* for immediate re-use, the other states are not.
*/
#define LP_UNUSED 0 /* unused (should always have lp_len=0) */
#define LP_NORMAL 1 /* used (should always have lp_len>0) */
#define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */
#define LP_DEAD 3 /* dead, may or may not have storage */
lp_off
元组的偏移量(相对页面起始处)
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 24 -n 2
00000018 d8 9f |..|
0000001a
lp_off 的值为0x9fd8 取低15位
[xdb@localhost utf8db]$ echo $((0x9fd8 & ~$((1<<15))))
8152
lp_off 转换成10进制为8152,表示第1个Item(tuple)的偏移量为8152
lp_len
[postgres@sndspstdb62 ~]$ hexdump -C $PGDATA/base/13451/17020 -s 26 -n 2
0000001a 4e 00 |N.|
0000001c
取高15位
[xdb@localhost utf8db]$ echo $((0x004e >> 1))
39
表示第1个Item(tuple)的长度为39字节
lp_flags
取第17-16位,01,即1, 表示该元组处于正常状态。
4) 使用hexdump透析元组头HeapTupleHeaderData物理存储
t_xmin
t_xmin保存插入此元组的事务的txid
[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8152 -n 4
00001fd8 e2 1b 18 00 |....|
00001fdc
t_xmin = 0x00181be2
将t_xmin转成10进制
[xdb@localhost ~]$ echo $((0x00181be2))
1580002
t_xmax
保存删除或更新此元组的事务的txid。如果尚未删除或更新此元组,则t_xmax设置为0,即无效
[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8156 -n 4
00001fdc 00 00 00 00 |....|
00001fe0
发现t_xmax是0,说明此元组未发生删除或更新。
t_cid/t_xvac
t_cid保存命令标识(command id,cid),cid的意思是在当前事务中,执行当前命令之前针对此元组执行了多少SQL命令,从零开始计数
[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8160 -n 4
00001fe0 00 00 00 00 |....|
00001fe4
发现t_cid = 0,说明在当前事务中,当前命令是针对此元组的第一条SQL命令。
t_ctid
保存着指向自身或新元组的元组标识符(tid)
[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8164 -n 6
00001fe4 00 00 00 00 01 00 |......|
00001fea
ip_blkid=0x0000,即blockid=0
ip_posid=0x0001,即posid=1,第1个tuple
t_infomask2
[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8170 -n 2
00001fea 03 00 |..|
00001fec
t_infomask2=0x0003,前(低)11位为属性的个数,3代表这个元组有三个字段
t_infomask
[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8172 -n 2
00001fec 02 08 |..|
00001fee
[xdb@localhost ~]$ echo $((0x0802))
2050
[xdb@localhost ~]$ echo "obase=2;2050"|bc
100000000010
//t_infomask=0x0802,十进制值为2050,二进制值为100000000010, 标识着元组当前的状态,结合掩码分析,略过了这里
t_hoff
[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8174 -n 1
00001fee 18 |.|
00001fef
[xdb@localhost ~]$ echo $((0x18))
24
用户数据开始偏移为24,即8152+24=8176
5) 使用hexdump透析元组的实际存储数据
说完了Tuple的头部数据,接下来我们看看实际的数据存储。
前面我们得到第一个Tuple总的长度是39,结合t_hoff = 24, 计算得到数据大小为39-24=15。
[xdb@localhost ~]$ hexdump -C $PGDATA/base/16477/24801 -s 8176 -n 15
00001ff0 01 00 00 00 13 31 20 20 20 20 20 20 20 05 61 |.....1 .a|
00001fff
回顾我们的表结构:
create table t_page (id int,c1 char(8),c2 varchar(16));
第1个字段为int,第2个字段为定长字符,第3个字段为变长字符。
相应的数据:
id=\x00000001,数字1
c1=\x133120202020202020,字符串,无需高低位变换,第1个字节\x13为标志位,后面是字符'1'+7个空格
c2=\x0561,字符串,第1个字节\x05为标志位,后面是字符'a'