A buffer manager manages data transfers between shared memory and persistent storage and can have a significant impact on the performance of the DBMS. The PostgreSQL buffer manager works very efficiently.(一个缓冲管理器主要是管理在共享内存和**持久性储存(如硬盘, 指代database cluster存放的地方)**之间的数据变化,并且可以对DBMS的性能产生很重要的影响,pg缓冲管理器工作效率非常高)
buffer_tag
buffer_tag
由三个值组成
RelFileNode
,即relation在内存中的实际存储地址buffer tag={(16821,16384,37721),0,7}
,展现了块号为7的页面,这个页面所属的relation的oid是37721,叉号是0;这个关系又属于oid为16384的数据库中,这个数据库又位于oid为16821的表空间中) Similarly, the buffer_tag ‘{(16821, 16384, 37721), 1, 3}’ identifies the page that is in the third block of the free space map whose OID and fork number are 37721 and 1, respectively.(与这个相似的,buffer tag={(16821,16384,37721),1,3}
,展现了块号为3的页面,其oid和叉号分别为37721和1。)When a backend process modifies a page in the buffer pool (e.g., by inserting tuples), the modified page, which has not yet been flushed to storage, is referred to as a dirty page.(当一个会话服务程序修改了一个buffer pool中的页,例如插入了元组,修改过的页还没有被洗刷进固存,这样的页面被称之为脏页)
Section 8.4 describes how buffer manager works.(8.4小节详细地介绍了buffer manager是怎样工作的)
Tag_A, id=1
的槽,表示页面buffer id=1的buffer description存储着用Tag_A标记的页面的元数据)BufferDesc
. While this structure has many fields, mainly ones are shown in the following:(buffer descriptor保存着对应的buffer pool中保存的页的元数据,buffer descriptor 数据结构由BufferDesc
定义(放在源码区了)以下介绍BufferDesc
中主要的几个变量)
BufferTag tag;/* ID of page contained in buffer */
tag holds the buffer_tag of the stored page in the corresponding buffer pool slot (buffer tag is defined in Section 8.1.2).(tag保存存储在对应的buffer pool槽中的页的buffer_tag
,也就是{(16821,16384,37721),1,3}
这个形式的东西)int buf_id; /* buffer's index number (from 0) */
buffer_id identifies the descriptor (equivalent to the buffer_id of the corresponding buffer pool slot).(buffer_id唯一标识了descriptor,和对应的 buffer pool slot中的buffer_id是一一对应的) unsigned refcount; /* # of backends holding pins on buffer */
refcount holds the number of PostgreSQL processes currently accessing the associated stored page. It is also referred to as pin count.(relcount保存了pg正在访问buffer pool中buf_id
对应的这一个页的进程的数量,这个进程数量也被称之为pin count) When a PostgreSQL process accesses the stored page, its refcount must be incremented by 1 (refcount++). After accessing the page, its refcount must be decreased by 1 (refcount–).(当任意一个pg进程访问这一页的时候,这个变量就会+1,完成访问后这个变量就会减1)When the refcount is zero, i.e. the associated stored page is not currently being accessed, the page is unpinned; otherwise it is pinned.(当这个变量为0的时候,buf_id
对应的页就处于unpinned
状态,否则就处于pinned
的状态)uint16 usage_count; /* usage counter for clock sweep code */
usage_count holds the number of times the associated stored page has been accessed since it was loaded into the corresponding buffer pool slot. Note that usage_count is used in the page replacement algorithm (Section 8.4.4).(usage_count保存与自身相关联的存储页从被加载到buffer pool到目前为止被访问过多少次,注意usage_count
主要是用于页面替换算法)LWLockId content_lock; /* to lock access to buffer contents */
content_lock and LWLockId io_in_progress_lock;/* to wait for I/O to complete */
io_in_progress_lock are light-weight locks that are used to control access to the associated stored page. These fields are described in Section 8.3.2.(content_lock和io_in_progress_lock都是用于控制访问当前页面轻型锁,这个变量的详细介绍在8.3.2小节)BufFlags flags; /* see bit definitions above */
flags can hold several states of the associated stored page. The main states are as follows:(被存储在buffer pool中的页有几种状态,都被存储这个变量中,最主要的状态如下:)
int freeNext; /* link in freelist chain */
freeNext is a pointer to the next descriptor to generate a freelist, which is described in the next subsection.(freeNext是一个指向下一个状态为empty的descriptor的指针,以生成freelist,方便在加载新页面的时候被检索到,在下一个章节中被介绍)Fig. 8.5. Buffer manager initial state.
BufferDesc
实例条目到上一步检索到的buffer descriptors的空槽中,条目其中包括第一个页的buffer_tag和检索descriptor的buffer_id之间的联系)Fig. 8.6. Loading the first page.
The buffer table requires many other locks. For example, the buffer table internally uses a spin lock to delete an entry. However, descriptions of these other locks are omitted because they are not required in this document.(buffer table需要很多其他锁,例如:buffer table内部使用一个自旋锁去删除数据项,然而,这些其他锁的具体描述都被省略了因为本文档用不到)
The content_lock is a typical lock that enforces access limits. It can be used in shared and exclusive modes.(内容锁content_lock是一个典型的强制限制访问内容的锁,可以用于共享和互斥模式)
When reading a page, a backend process acquires a shared content_lock of the buffer descriptor that stores the page.(当读取一个页时,一个会话服务程序需要一个buffer descriptor的共享内容锁)
However, an exclusive content_lock is acquired when doing one of the following:(然而,在以下情况下会产生互斥的内容锁:)
t_xmin
或者t_xmax
变量的值,简单来说,就是当你执行删除和更新行的操作的时候,当前页会产生一个互斥的内容锁)The official README file shows more details.
When the flags or other fields (e.g. refcount and usage_count) are checked or changed, a spinlock is used. Two specific examples of spinlock usage are given below:(当标志位(也就是结构体BufferDesc中的变量flag)或者其他变量被核查或者改变时,会使用自旋锁,下面给出了两个关于自旋锁使用的明确的例子:)
(1)The following shows how to pin the buffer descriptor:(下面展示了pin一个buffer descriptor的过程中怎么样加锁和释放锁:
LockBufHdr(bufferdesc); /* Acquire a spinlock,获取一个自旋锁 */
bufferdesc->refcont++; //锁住的页的当前访问+1
bufferdesc->usage_count++; //锁住的页被访问次数+1
UnlockBufHdr(bufferdesc); /* Release the spinlock,释放自旋锁 */
(2) The following shows how to set the dirty bit to ‘1’:(下面的例子中展示了怎么设置 是否为脏页 – dirty bit 这个标志位为1
#define BM_DIRTY (1 << 0) /* data needs writing */
#define BM_VALID (1 << 1) /* data is valid */
#define BM_TAG_VALID (1 << 2) /* tag is assigned */
#define BM_IO_IN_PROGRESS (1 << 3) /* read or write in progress */
#define BM_JUST_DIRTIED (1 << 5) /* dirtied since write started */
LockBufHdr(bufferdesc); //加锁
bufferdesc->flags |= BM_DIRTY; //设置flag为1
UnlockBufHdr(bufferdesc); //释放锁
ReadBufferExtended()
函数)ReadBufferExtended()
函数的功能行为取决于三种不同的情况,每一种情况都被描述成了以下一个子小节,另外在最后一个子小节中介绍了pg的clock sweep页替换算法)BufMappingLock
partition that covers the obtained hash bucket slot in shared mode (this lock will be released in step (5)).(获取BufMappingLock
部分的锁用于锁住共享模式下的获得的散列bucket slot(这个锁将在步骤5时释放))refcount
和usage_count
均加一)Fig. 8.8. Accessing a page stored in the buffer pool.
In this case, assume that all buffer pool slots are occupied by pages but the desired page is not stored. The buffer manager performs the following steps:(在这种情况下,假设所有的buffer pool slots都已经被页占据了,但是想要访问的页不在这里面,buffer manager会执行以下步骤:)
(1) Create the buffer_tag of the desired page and look up the buffer table. In this example, we assume that the buffer_tag is ‘Tag_M’ (the desired page is not found).(创建一个想要访问的页的buffer_tag,然后在buffer table中查找,这种情况下是不可能在buffer table中查找到的,这里假设创建的buffer_tag是Tag_M)
(2) Select a victim buffer pool slot using the clock-sweep algorithm, obtain the old entry, which contains the buffer_id of the victim pool slot, from the buffer table and pin the victim pool slot in the buffer descriptors layer. In this example, the buffer_id of the victim slot is 5 and the old entry is ‘Tag_F, id=5’. The clock sweep is described in the next subsection.(用页面替换算法选择一个受害者buffer pool槽,从buffer table获取这个槽中的旧的条目,条目中包含了受害者buffer pool槽对应的页的buffer_id,并pin这个descriptor 层次中的受害者页,在这个例子中国,受害者页的buffer_id是5,buffer table中的旧条目是Tag_F,buffer_id=5
)
(3) Flush (write and fsync) the victim page data if it is dirty; otherwise proceed to step (4).(如果这个受害者页是脏页,则将其冲洗进固存,否则旧跳过本步骤直接执行下一个步骤)
The dirty page must be written to storage before overwriting with new data. Flushing a dirty page is performed as follows:(脏页所在的buffer pool slot在被新写入新的页数据之前,必须要将脏页的数据写入固存,冲洗脏页执行步骤如下:(本例中的脏页buffer_id=5,也就是上一步中的受害者页的buffer_id=5))
XLogFlush()
函数将WAL(Writing Ahead Logging)缓冲区上的WAL数据写入当前WAL段文件)(4) Acquire the old BufMappingLock partition that covers the slot that contains the old entry, in exclusive mode.(获取旧的BufMappingLock部分给旧条目所在的Buffer table上所在的slot上互斥锁
(5) Acquire the new BufMappingLock partition and insert the new entry to the buffer table:(获取新的BufMappingLock部分锁,将新的条目插入buffer table:)
flag
中的标志位,脏页标志比特位为0)nextVictimBuffer
, an unsigned 32-bit integer, is always pointing to one of the buffer descriptors and rotates clockwise. The pseudocode and description of the algorithm are follows:(将buffer descriptor想象成一个环形列表,下一个受害者nextVictimBuffer
,这个变量是一个无符号的32位int类型的变量,总是指向buffer descriptor其中非空的一个slot,并且总是顺时针旋转;下面是该算法的伪代码和介绍:) WHILE true
(1) Obtain the candidate buffer descriptor pointed by the nextVictimBuffer
(2) IF the candidate descriptor is unpinned THEN
(3) IF the candidate descriptor's usage_count == 0 THEN
BREAK WHILE LOOP /* the corresponding slot of this descriptor is victim slot. */
ELSE
Decrease the candidate descriptpor's usage_count by 1
END IF
END IF
(4) Advance nextVictimBuffer to the next one
END WHILE
(5) RETURN buffer_id of the victim
nextVictimBuffer
.(获取*nextVictimBuffer
*指向的候选buffer descriptor)Fig. 8.12. Clock Sweep.
nextVictimBuffer
指向buffer_id=1的slot,但是buffer_id=1的descriptor的状态是pinned,所以nextVictimBuffer
直接跳转到下一个buffer_id=2的slot)nextVictimBuffer
指向第二个,第二个对应的descriptor状态为unpinned,但是它的usage_count
的值不为0,所以执行usage_count--;
,然后nextVictimBuffer
继续跳转到下一个buffer_id=3的slot)nextVictimBuffer
指向的第三个,descriptor的状态为unpinned,且, usage_count==0
,所以它就是这一次页面替换的受害者页面)nextVictimBuffer
sweeps an unpinned descriptor, its usage_count
is decreased by 1. Therefore, if unpinned descripters exist in the buffer pool, this algorithm can always find a victim, whose usage_count
is 0, by rotating the nextVictimBuffer
.(每当nextVictimBuffer
扫描未固定的描述符时,其usage_count
就会减少1。因此,如果缓冲池中存在未固定的描述符,则该算法总是可以通过旋转nextVictimBuffer
来找到usage_count
为0的受害者)When reading or writing a huge table, PostgreSQL uses a ring buffer rather than the buffer pool. The ring buffer is a small and temporary buffer area. When any condition listed below is met, a ring buffer is allocated to shared memory:(当要读写一个size很大的表时,Pg会使用环形缓冲区,而不是缓冲池pool。环形缓冲区是一个小的临时缓冲区。当满足下面列出的任何条件时,将为共享内存分配一个环形缓冲区:)
Bulk-reading(批量读取)
Bulk-writing(批量写入)
Vacuum-processing(执行vacuum进程时)
The allocated ring buffer is released immediately after use.(分配的环形缓冲区会在用完之后马上释放)
The benefit of the ring buffer is obvious. If a backend process reads a huge table without using a ring buffer, all stored pages in the buffer pool are removed (kicked out); therefore, the cache hit ratio decreases. The ring buffer avoids this issue.(环形缓冲区的优点是显而易见的,如果一个会话服务程序在不使用环形缓冲区的情况下要读取一个很大的表,可能所有buffer pool上的页都会因为要读这张巨大的表而被替换一遍,因此,缓存的命中率就会降低,而临时的环形缓冲区就会避免这种问题)