本文是基于postgresql 14的代码进行分析解读,演示是在centos8系统上进行。
数据库中用SQL可以访问的表,在数据库中实际以文件的形式存储,一个或多个文件与之对应。
postgres数据库,通过 initdb初始化一个数据库集群目录,目录下存放着当前集群的所有数据,在磁盘上以目录和文件的方式来组织。我们下面看一下,集群目录的结构。
./zptest/
├── base
│ ├── 1
│ ├── 4
│ └── 5
├── global
│ ├── 1213
│ ├── 1213_fsm
│ ├── 1213_vm
│ ├── 1214
│ ├── 1232
│ ├── 1233
│ ├── 1260
│ ├── 1260_fsm
│ ├── 1260_vm
│ ├── 1261
│ ├── 1261_fsm
│ ├── 1261_vm
│ ├── 1262
│ ├── 1262_fsm
│ ├── 1262_vm
│ ├── 2396
│ ├── 2396_fsm
│ ├── 2396_vm
│ ├── 2397
│ ├── 2671
│ ├── 2672
│ ├── 2676
│ ├── 2677
│ ├── 2694
│ ├── 2695
│ ├── 2697
│ ├── 2698
│ ├── 2846
│ ├── 2847
│ ├── 2964
│ ├── 2965
│ ├── 2966
│ ├── 2967
│ ├── 3592
│ ├── 3593
│ ├── 4060
│ ├── 4061
│ ├── 4175
│ ├── 4176
│ ├── 4177
│ ├── 4178
│ ├── 4181
│ ├── 4182
│ ├── 4183
│ ├── 4184
│ ├── 4185
│ ├── 4186
│ ├── 6000
│ ├── 6001
│ ├── 6002
│ ├── 6100
│ ├── 6114
│ ├── 6115
│ ├── pg_control
│ └── pg_filenode.map
├── pg_commit_ts
├── pg_dynshmem
├── pg_hba.conf
├── pg_ident.conf
├── pg_logical
│ ├── mappings
│ ├── replorigin_checkpoint
│ └── snapshots
├── pg_multixact
│ ├── members
│ └── offsets
├── pg_notify
├── pg_replslot
├── pg_serial
├── pg_snapshots
├── pg_stat
├── pg_stat_tmp
├── pg_subtrans
│ └── 0000
├── pg_tblspc
├── pg_twophase
├── PG_VERSION
├── pg_wal
│ ├── 000000010000000000000001
│ └── archive_status
├── pg_xact
│ └── 0000
├── postgresql.auto.conf
└── postgresql.conf
数据库的表文件存储在base目录下。我们有很多database,那么base目录下,那个目录是自己的数据库呢?
每个database都有一个OID,目录以OID来命名。
/*
* Object ID is a fundamental type in Postgres.
*/
typedef unsigned int Oid;
postgres=# select oid, datname from pg_database order by oid;
oid | datname
-----+-----------
1 | template1
4 | template0
5 | postgres
(3 rows)
我们当前使用的默认数据库postgres,OID是5,那么路径在base/5/下面,我们来验证一下。
postgres=# select pg_relation_filepath('pg_class');
pg_relation_filepath
----------------------
base/5/1259
(1 row)
当前数据库的表pg_class的表文件在base/5/下面。
(1) 表文件与表OID的关系:
数据库对象都有一个唯一的OID标识,数据表也不例外,一般情况下,表文件名也和数据库的OID相同,如下:
postgres=# create table test(id integer);
CREATE TABLE
postgres=# select oid from pg_class where relname='test';
oid
-------
16384
(1 row)
postgres=# select pg_relation_filepath('test');
pg_relation_filepath
----------------------
base/5/16384
(1 row)
但是这种对应关系也会发生变化,如vaccum full时;所以要找到正确的对应,需要用pg_relation_filepath来查询。
两者如何映射,是由pg_filenode.map文件来维护。
/*
* The map file is critical data: we have no automatic method for recovering
* from loss or corruption of it. We use a CRC so that we can detect
* corruption. To minimize the risk of failed updates, the map file should
* be kept to no more than one standard-size disk sector (ie 512 bytes),
* and we use overwrite-in-place rather than playing renaming games.
* The struct layout below is designed to occupy exactly 512 bytes, which
* might make filesystem updates a bit more efficient.
*
* Entries in the mappings[] array are in no particular order. We could
* speed searching by insisting on OID order, but it really shouldn't be
* worth the trouble given the intended size of the mapping sets.
*/
#define RELMAPPER_FILENAME "pg_filenode.map"
(2) 表文件大小:
表文件大小,由于各操作系统对于文件大小的限制,postgres将每个表文件限制到了1GB。
当表数据超过1GB时,会创建新的表文件,表文件名由oid.1 oid.2 … 编号,来拆分成多个文件。
/* RELSEG_SIZE is the maximum number of blocks allowed in one disk file. Thus,
the maximum size of a single file is RELSEG_SIZE * BLCKSZ; relations bigger
than that are divided into multiple files. RELSEG_SIZE * BLCKSZ must be
less than your OS' limit on file size. This is often 2 GB or 4GB in a
32-bit operating system, unless you have large file support enabled. By
default, we make the limit 1 GB to avoid any possible integer-overflow
problems within the OS. A limit smaller than necessary only means we divide
a large relation into more chunks than necessary, so it seems best to err
in the direction of a small limit. A power-of-2 value is recommended to
save a few cycles in md.c, but is not absolutely required. Changing
RELSEG_SIZE requires an initdb. */
#define RELSEG_SIZE 131072
BLCKSZ * RELSEG_SIZE 来限制每个表文件里的block数量,BLCKSZ 默认为8KB;
作者邮箱:[email protected]
如有错误或者疏漏欢迎指出,互相学习。
注:未经同意,不得转载!