PostgreSQL内核学习--Chapter 8 Buffer Manager

Chapter 8 Buffer Manager

A buffer manager manages data transfers between shared memory and persistent storage and can have a significant impact on the performance of the DBMS. The PostgreSQL buffer manager works very efficiently.(一个缓冲管理器主要是管理在共享内存和**持久性储存(如硬盘, 指代database cluster存放的地方)**之间的数据变化,并且可以对DBMS的性能产生很重要的影响,pg缓冲管理器工作效率非常高)

  • In this chapter, the PostgreSQL buffer manager is described. The first section provides an overview and the subsequent sections describe the following topics:(这一章中,介绍pg缓冲管理器。第一个部分提供了一个概述和对后续每一个小节的描述:
    • Buffer manager structure
    • Buffer manager locks
    • How the buffer manager works
    • Ring buffer
    • Flushing of dirty pages

PostgreSQL内核学习--Chapter 8 Buffer Manager_第1张图片

8.1 Overview(概述)

  • This section introduces key concepts required to facilitate descriptions in the subsequent sections.(这一个小节介绍了一些关键概念,以便在后续的小节中更详细的介绍
8.1.1 Buffer Manager Structure(缓冲管理器结构)
  • The PostgreSQL buffer manager comprises a buffer table, buffer descriptors, and buffer pool, which are described in the next section. The buffer pool layer stores data file pages, such as tables and indexes, as well as free space maps and visibility maps. The buffer pool is an array, i.e., each slot stores one page of a data file. Indices of a buffer pool array are referred to as buffer_ids.(pg缓冲管理器由
    • ①buffer table
    • ②buffer descriptors
    • ③buffer pool 三个部分构成,在下一部分中详细介绍。buffer pool层存储在数据文件的页中,例如,表和索引,以及自由空间映射fsm和可见性映射vm。buffer pool是一个数组,每一个数组中的都存储了一个数据文件的页。缓冲池数组的索引称为buffer_ids
8.1.2 Buffer Tag(缓冲标签)
  • In PostgreSQL, each page of all data files can be assigned a unique tag, i.e. a buffer tag. When the buffer manager receives a request, PostgreSQL uses the buffer_tag of the desired page.(在pg中,所有数据文件的每一页都可以被分配到一个独一无二的tag将其称之为 buffer tag。当buffer manager收到请求时,pg会用所需的页的buffer_tag
  • The [buffer_tag](javascript:void(0)) comprises three values: the [RelFileNode](javascript:void(0)) and the fork number of the relation to which its page belongs, and the block number of its page. The fork numbers of tables, freespace maps and visibility maps are defined in 0, 1 and 2, respectively.(buffer_tag由三个值组成
    • relation泛指table,index,view等等
    • 每一个页面所属的relation的叉号 WHAT IS FORK NUMBER???Remember the table data page file’s fork number are 0, and the fork number of the free space map file and the visible file is 1,2, respectively.(记住,存储着数据的页表的叉数是0,fsm文件的叉号是1,vm文件的叉号是2)
    • 每一个页面所属的relation的RelFileNode,即relation在内存中的实际存储地址
    • 这个页面的块号,其中表的叉号,fsm和vm分别被定义为0,1,2
  • For example, the buffer_tag ‘{(16821, 16384, 37721), 0, 7}’ identifies the page that is in the seventh block whose relation’s OID and fork number are 37721 and 0, respectively; the relation is contained in the database whose OID is 16384 under the tablespace whose OID is 16821.(例如,buffer tag={(16821,16384,37721),0,7},展现了块号为7的页面,这个页面所属的relation的oid是37721,叉号是0;这个关系又属于oid为16384的数据库中,这个数据库又位于oid为16821的表空间中) Similarly, the buffer_tag ‘{(16821, 16384, 37721), 1, 3}’ identifies the page that is in the third block of the free space map whose OID and fork number are 37721 and 1, respectively.(与这个相似的,buffer tag={(16821,16384,37721),1,3},展现了块号为3的页面,其oid和叉号分别为37721和1。)
8.1.3 How a Backend Process Reads Pages(会话服务程序怎样读取表的页)
  • This subsection describes how a backend process reads a page from the buffer manager (Fig. 8.2).(这一部分描述了一个会话服务程序怎样从buffer manager中读取一个页面
  • Pay attention to distinguish the concept of buffer_Tag and buffer_ID.(注意区分这两个概念)
  • (1) When reading a table or index page, a backend process sends a request that includes the page’s buffer_tag to the buffer manager.(当读取一个表或者一个索引页时,会话服务程序会发送一个包含了页面buffer_tag的请求给buffer manager)
  • (2) The buffer manager returns the buffer_ID of the slot that stores the requested page. If the requested page is not stored in the buffer pool, the buffer manager loads the page from persistent storage to one of the buffer pool slots and then returns the buffer_ID’s slot.(buffer manager返回了存储请求页面的slot插槽的buffer_ID(也就是缓冲pool中这个页的索引),如果请求页面没有被存储在buffer pool中,buffer manager将页面从永久存储器加载到缓冲池插槽之一,然后返回buffer_ ID索引指向的插槽**(buffer pool中的每一个插槽都保存了一个页面的信息,所以返回指向该插槽的索引就是返回这个请求中的buffer tag表示的pg会话服务程序想要访问的页面)**
  • (3) The backend process accesses the buffer_ID’s slot (to read the desired page).(会话服务程序访问buffer ID槽)

When a backend process modifies a page in the buffer pool (e.g., by inserting tuples), the modified page, which has not yet been flushed to storage, is referred to as a dirty page.(当一个会话服务程序修改了一个buffer pool中的页,例如插入了元组,修改过的页还没有被洗刷进固存,这样的页面被称之为脏页)

Section 8.4 describes how buffer manager works.(8.4小节详细地介绍了buffer manager是怎样工作的)

8.1.4. Page Replacement Algorithm(页面替换算法发)
  • When all buffer pool slots are occupied but the requested page is not stored, the buffer manager must select one page in the buffer pool that will be replaced by the requested page.(当所有buffer pool中的槽都已经被写满,但是会话服务程序请求访问的页面不在当前buffer pool中,buffer manager必须选择其中一个页被从固存中加载进来的新页替换) Typically, in the field of computer science, page selection algorithms are called page replacement algorithms and the selected page is referred to as a victim page.(一般,在计算机科学领域,页的选择算法被称之为页的替换算法,当前buffer pool中被选择的将要被替换的页被称之为牺牲页
  • Research on page replacement algorithms has been ongoing since the advent of computer science; thus, many replacement algorithms have been proposed previously. Since version 8.1, PostgreSQL has used clock sweep because it is simpler and more efficient than the LRU(Least recently used,最近最少使用) algorithm used in previous versions.(对页面替换算法的研究从计算机科学出现以来至今仍然在继续,因此,许多替换算法在这以前被提出,直到pg8.1版本,pg一直用的页面替换算法是clock sweep,因为相比于之前版本中用的LRU算法来说,它更简单高效)
  • clock-sweep算法将在8.4.4小节详细介绍
8.1.5. Flushing Dirty Pages(冲洗脏页)
  • Dirty pages should eventually be flushed to storage; however, the buffer manager requires help to perform this task.(buffer pool中被修改过的脏页最后应该被冲洗进固存中,然而,buffer manager需要一些帮助才能完成这项任务) In PostgreSQL, two background processes, checkpointer and background writer, are responsible for this task.(pg中,由两个后台进程checkpointerbackgroud writer共同负责冲洗脏页这项任务)
  • Section 8.6 describes the checkpointer and background writer.(8.6小节会详细介绍这两个冲洗脏页的后台进程)

8.2. Buffer Manager Structure(缓冲管理器结构)

  • The PostgreSQL buffer manager comprises three layers, i.e. the buffer table, buffer descriptors, and buffer pool (Fig. 8.3):(pg的buffer manager由三个层次组成:buffer table,buffer descriptors,buffer pool)
  • Fig. 8.3. Buffer manager’s three-layer structure.
  • The buffer pool is an array. Each slot stores a data file pages. The indices of the array slots are referred to as buffer_ids.(buffer pool是一个数组,数组的每一格都存储着一个数据文件的页。指向每一格的索引被称之为buffer_id
  • The buffer descriptors layer is an array of buffer descriptors. Each descriptor has one-to-one correspondence to a buffer pool slot and holds metadata of the stored page in the corresponding slot.(buffer descriptors是一个存着buffer descriptor的数组,每一个descriptor又是一对一对应一个buffer pool数组中的slot,并且将buffer pool所存储的页面的元数据保持在相应的槽中)Note that the term ‘buffer descriptors layer’ has been adopted for convenience and it is only used in this document.(注意,“buffer descriptor layer”一词是为了方便而采用的,仅在本文档中使用)
  • The buffer table is a hash table that stores the relations between the buffer_tags of stored pages and the buffer_ids of the descriptors that hold the stored pages’ respective metadata.(buffer table是一个索引表,里面存储着目前在buffer pool中的页的buffer tag和目前在buffer descriptors中页的元数据的descriptor的buffer_ids)
  • 以上三个层次会在以下的子章节8.2.1~8.2.4中详细介绍
8.2.1. Buffer Table(缓冲表)
  • A buffer table can be logically divided into three parts: a hash function, hash bucket slots, and data entries (Fig. 8.4).(buffer table可以逻辑上分成三个部分:哈希功能、哈希bucket slots和数据项,见下图:)
  • The built-in hash function maps buffer_tags to the hash bucket slots. Even though the number of hash bucket slots is greater than the number of the buffer pool slots, collisions may occur.(内置的散列函数 会将所有的buffer_tags映射到 散列bucket slots 上面去,即使 散列bucket slots 的数量比 buffer pool数组中的slots中的数量多,在映射时还是可能发生冲突)
  • Therefore, the buffer table uses a separate chaining with linked lists method to resolve collisions. When data entries are mapped to the same bucket slot, this method stores the entries in the same linked list, as shown in Fig. 8.4.(因此,buffer table使用 带链表的单独链接 方法来应对散列冲突,当两个数据项由散列函数映射到同一个hash bucket slots上时,这个方法会将冲突项用链表的方式一同存储在这一个hash bucket slot下面,如下图所示:)

PostgreSQL内核学习--Chapter 8 Buffer Manager_第2张图片

  • A data entry comprises two values: the buffer_tag of a page, and the buffer_id of the descriptor that holds the page’s metadata. For example, a data entry ‘Tag_A, id=1’ means that the buffer descriptor with buffer_id 1 stores metadata of the page tagged with Tag_A.(一个数据项由两个值组成:buffer_tag 和页,保存页面元数据的buffer的descriptor的buffer_id,举例三个不同的数据项都被映射到了hash buckets slot槽编号为Tag_A, id=1的槽,表示页面buffer id=1的buffer description存储着用Tag_A标记的页面的元数据)
8.2.2. Buffer Descriptor(缓冲描述段)
  • The structure of buffer descriptor is described in this subsection, and the buffer descriptors layer in the next subsection.(buffer descriptor的结构在子章节中会详细说明,buffer descriptor 层次也是放在接下来的子小节了)
  • Buffer descriptor holds the metadata of the stored page in the corresponding buffer pool slot. The buffer descriptor structure is defined by the structure BufferDesc. While this structure has many fields, mainly ones are shown in the following:(buffer descriptor保存着对应的buffer pool中保存的页的元数据,buffer descriptor 数据结构由BufferDesc定义(放在源码区了)以下介绍BufferDesc中主要的几个变量)
    • BufferTag tag;/* ID of page contained in buffer */tag holds the buffer_tag of the stored page in the corresponding buffer pool slot (buffer tag is defined in Section 8.1.2).(tag保存存储在对应的buffer pool槽中的页的buffer_tag,也就是{(16821,16384,37721),1,3}这个形式的东西)
    • int buf_id; /* buffer's index number (from 0) */buffer_id identifies the descriptor (equivalent to the buffer_id of the corresponding buffer pool slot).(buffer_id唯一标识了descriptor,和对应的 buffer pool slot中的buffer_id是一一对应的)
    • unsigned refcount; /* # of backends holding pins on buffer */refcount holds the number of PostgreSQL processes currently accessing the associated stored page. It is also referred to as pin count.(relcount保存了pg正在访问buffer pool中buf_id对应的这一个页的进程的数量,这个进程数量也被称之为pin count) When a PostgreSQL process accesses the stored page, its refcount must be incremented by 1 (refcount++). After accessing the page, its refcount must be decreased by 1 (refcount–).(当任意一个pg进程访问这一页的时候,这个变量就会+1,完成访问后这个变量就会减1)When the refcount is zero, i.e. the associated stored page is not currently being accessed, the page is unpinned; otherwise it is pinned.(当这个变量为0的时候,buf_id对应的页就处于unpinned状态,否则就处于pinned的状态)
    • uint16 usage_count; /* usage counter for clock sweep code */usage_count holds the number of times the associated stored page has been accessed since it was loaded into the corresponding buffer pool slot. Note that usage_count is used in the page replacement algorithm (Section 8.4.4).(usage_count保存与自身相关联的存储页从被加载到buffer pool到目前为止被访问过多少次,注意usage_count主要是用于页面替换算法)
    • LWLockId content_lock; /* to lock access to buffer contents */content_lock and LWLockId io_in_progress_lock;/* to wait for I/O to complete */ io_in_progress_lock are light-weight locks that are used to control access to the associated stored page. These fields are described in Section 8.3.2.(content_lockio_in_progress_lock都是用于控制访问当前页面轻型锁,这个变量的详细介绍在8.3.2小节)
    • BufFlags flags; /* see bit definitions above */flags can hold several states of the associated stored page. The main states are as follows:(被存储在buffer pool中的页有几种状态,都被存储这个变量中,最主要的状态如下:)
      • dirty bit indicates whether the stored page is dirty.(dirty bit表明了当前存储页是否是脏页)
      • valid bit indicates whether the stored page can be read or written (valid). For example, if this bit is valid, then the corresponding buffer pool slot stores a page and this descriptor (valid bit) holds the page metadata; thus, the stored page can be read or written. If this bit is invalid, then this descriptor does not hold any metadata; this means that the stored page cannot be read or written or the buffer manager is replacing the stored page.(vaild bit表明了当前的存储页是否可以被访问进行读写操作。例如:如果这个vaild bit状态是有效的,当前数据体实例对应的这个页也就是存储在buffer pool中的页,也是descriptor存储元数据的页,因此,这个页可以被访问用于读写,如果这个vaild bit是无效的,这个descriptor就不带任何这个表相关的元数据,这意味着被存储在buffer pool中的页不能被访问用于读写,或者说buffer manager正在替换当前的存储页)
      • io_in_progress bit indicates whether the buffer manager is reading/writing the associated page from/to storage. In other words, this bit indicates whether a single process holds the io_in_progress_lock of this descriptor.(io_in_progress bit表明buffer manager是否正在读写当前存储页,换一种说法,该比特指示单个进程是否持有该描述符的io_In_progress_lock)
    • int freeNext; /* link in freelist chain */freeNext is a pointer to the next descriptor to generate a freelist, which is described in the next subsection.(freeNext是一个指向下一个状态为empty的descriptor的指针,以生成freelist,方便在加载新页面的时候被检索到,在下一个章节中被介绍)
  • To simplify the following descriptions, three descriptor states are defined:(简化一下描述,介绍三种descriptor的状态:)
    • Empty: When the corresponding buffer pool slot does not store a page (i.e. refcount and usage_count are 0), the state of this descriptor is empty.(Empty:当对应buffer pool池的那一个槽中没有存储页,descriptor的状态为空)
    • Pinned: When the corresponding buffer pool slot stores a page and any PostgreSQL processes are accessing the page (i.e. refcount and usage_count are greater than or equal to 1), the state of this buffer descriptor is pinned.(descriptor对应的buffer pool中的槽中存储的页被pg的任意一个会话服务程序访问(refcount和usage_count均大于等于1时),这一状态下的页对应的buffer descriptor的状态是pinned)
    • Unpinned: When the corresponding buffer pool slot stores a page but no PostgreSQL processes are accessing the page (i.e. usage_count is greater than or equal to 1, but refcount is 0), the state of this buffer descriptor is unpinned.(descriptor对应的buffer pool中的页没有当下没有被任何一个pg会话服务程序访问,usage_count大于等于1,也就是至少要被访问过一次,但是refcount等于0),这时的页面对应的descriptor状态为unpinned)
8.2.3. Buffer Descriptors Layer(缓冲区描述符层)
  • A collection of buffer descriptors forms an array. In this document, the array is referred to as the buffer descriptors layer.(buffer descriptors组成了一个数组,在本文档的描述中我们将 数组 描述成bufferdescriptor layer)
  • When the PostgreSQL server starts, the state of all buffer descriptors is empty. In PostgreSQL, those descriptors comprise a linked list called freelist (Fig. 8.5).(当pg服务器启动,所有buffer descriptor的状态都是 empty,在pg中,这些descriptors组成了一个叫空闲列表的链表,如下图:)

Fig. 8.5. Buffer manager initial state.

  • Figure 8.6 shows that how the first page is loaded.(下图展示了pg服务器启动后,第一个从固存被加载到buffer pool的页的过程)
    • (1) Retrieve an empty descriptor from the top of the freelist, and pin it (i.e. increase its refcount and usage_count by 1).(从freelist的顶部检索一个状态为empty的descriptor槽并pin她,增加它的refcount和usage_count为1)
    • (2) Insert the new entry, which holds the relation between the tag of the first page and the buffer_id of the retrieved descriptor, in the buffer table.(插入新的BufferDesc实例条目到上一步检索到的buffer descriptors的空槽中,条目其中包括第一个页的buffer_tag和检索descriptor的buffer_id之间的联系)
    • (3) Load the new page from storage to the corresponding buffer pool slot.(从固存中加载buffer tag中的页面到buffer descriptor对应的buffer pool槽中)
    • (4) Save the metadata of the new page to the retrieved descriptor.(保存新页面的元数据到检索到的descriptor中)
  • The second and subsequent pages are loaded in a similar manner. Additional details are provided in Section 8.4.2.(第二个和后续的页面加载都与以上步骤类似,额外的细节见8.4.2小节)

Fig. 8.6. Loading the first page.

  • Descriptors that have been retrieved from the freelist always hold page’s metadata. In other words, non-empty descriptors continue to be used do not return to the freelist. However, related descriptors are added to the freelist again and the descriptor state becomes ‘empty’ when one of the following occurs: (从自由列表中检索到的descriptor始终包含页面的元数据,换一种说法,不为空的descriptor会被持续利用,而不会再返回到freeList中变为empty。然而,当出现以下情况时,相关描述符会再次添加到自由列表中,并且描述符状态变为“空”
    1. Tables or indexes have been dropped.(表or索引被删除了)
    2. Databases have been dropped.(数据库被删除了)
    3. Tables or indexes have been cleaned up using the VACUUM FULL command.(表or索引被vacuum full命令清空了)
  • The buffer descriptors layer contains an unsigned 32-bit integer variable, i.e. nextVictimBuffer. This variable is used in the page replacement algorithm described in Section 8.4.4.(buffer descriptors层次包含了一个无符号32位int类型的变量,叫nextVictimBuffer,这个变量用于页面替换算法)
8.2.4. Buffer Pool(缓冲池)
  • The buffer pool is a simple array that stores data file pages, such as tables and indexes. Indices of the buffer pool array are referred to as buffer_ids.(buffer pool是一个用于存储数据文件页(比如table中的页和index)的简单数组,buffer pool中的索引被称之为 buffer_id)
  • The buffer pool slot size is 8 KB, which is equal to the size of a page. Thus, each slot can store an entire page.(每一个buffer pool slot大小都是8k,与页的大小一样,因此,每一个槽都可以存储一个完整的页

8.3. Buffer Manager Locks(缓冲管理器锁)

  • The buffer manager uses many locks for many different purposes. This section describes the locks necessary for the explanations in the subsequent sections.(buffer manager会用很多锁达到不同的执行目标,这一部分描述了后续章节需要使用到的锁)
  • Please note that the locks described in this section are parts of a synchronization mechanism for the buffer manager; they do not relate to any SQL statements and SQL options.(请注意,本节中描述的锁是缓冲区管理器同步机制的一部分;它们与任何SQL语句和SQL选项都无关)
8.3.1. Buffer Table Locks(缓冲表锁)
  • BufMappingLock protects the data integrity of the entire buffer table. It is a light-weight lock that can be used in both shared and exclusive modes. When searching an entry in the buffer table, a backend process holds a shared BufMappingLock. When inserting or deleting entries, a backend process holds an exclusive lock.(buffer table locks保护整个buffer table这一层次的数据完整性,他是一个轻量级的锁,可以用于所有的共享互斥模式。当在buffer table中搜索一个条目时,会话服务程序会持有一个共享的BufMappingLock ,当插入或者删除条目时,会话服务程序会持有一个互斥的锁)
  • The BufMappingLock is split into partitions to reduce the contention in the buffer table (the default is 128 partitions). Each BufMappingLock partition guards the portion of the corresponding hash bucket slots.(BufMappingLock 被分离成了几个部分,用于减少buffer table中的争用(默认是128个部分),每一个 BufMappingLock 部分保护相应的散列bucket slot)
  • Figure 8.7 shows a typical example of the effect of splitting BufMappingLock. Two backend processes can simultaneously hold respective BufMappingLock partitions in exclusive mode in order to insert new data entries. If the BufMappingLock is a single system-wide lock, both processes should wait for the processing of another process, depending on which started processing.(下图展示了一个典型的例子用于说明分隔 BufMappingLock 锁的作用:两个会话服务程序在互斥模式下,持有各自的 BufMappingLock 部分可以完成同时插入新的数据条目的功能,如果 BufMappingLock 是一个没有被分隔的单独的全系统的锁,这两个会话服务程序都应该等另一个程序,这取决于哪个程序先开始处理

PostgreSQL内核学习--Chapter 8 Buffer Manager_第3张图片

The buffer table requires many other locks. For example, the buffer table internally uses a spin lock to delete an entry. However, descriptions of these other locks are omitted because they are not required in this document.(buffer table需要很多其他锁,例如:buffer table内部使用一个自旋锁去删除数据项,然而,这些其他锁的具体描述都被省略了因为本文档用不到)

8.3.2. Locks for Each Buffer Descriptor(每个缓冲区Description的锁)
  • Each buffer descriptor uses two light-weight locks, content_lock and io_in_progress_lock, to control access to the stored page in the corresponding buffer pool slot. When the values of own fields are checked or changed, a spinlock is used.
8.3.2.1 content_lock
  • The content_lock is a typical lock that enforces access limits. It can be used in shared and exclusive modes.(内容锁content_lock是一个典型的强制限制访问内容的锁,可以用于共享和互斥模式)

  • When reading a page, a backend process acquires a shared content_lock of the buffer descriptor that stores the page.(当读取一个页时,一个会话服务程序需要一个buffer descriptor的共享内容锁)

  • However, an exclusive content_lock is acquired when doing one of the following:(然而,在以下情况下会产生互斥的内容锁:)

    • Inserting rows (i.e. tuples) into the stored page or changing the t_xmin/t_xmax fields of tuples within the stored page (t_xmin and t_xmax are described in Section 5.2; simply, when deleting or updating rows, these fields of the associated tuples are changed).(插入行到buffer pool中存储的页中or改变buffer pool中存储的页元组中的t_xmin或者t_xmax变量的值,简单来说,就是当你执行删除和更新行的操作的时候,当前页会产生一个互斥的内容锁)
    • Removing tuples physically or compacting free space on the stored page (performed by vacuum processing and HOT, which are described in Chapters 6 and 7, respectively).(物理地址上移除元组or压缩存储页面上的可用空间(例如执行vacuum和HOT)
    • Freezing tuples within the stored page (freezing is described in Section 5.10.1 and Section 6.3).(冻结存储页中的元组

    The official README file shows more details.

8.3.2.2 io_in_progress_lock
  • The io_in_progress lock is used to wait for I/O on a buffer to complete. When a PostgreSQL process loads/writes page data from/to storage, the process holds an exclusive io_in_progress lock of the corresponding descriptor while accessing the storage.(io_in_progress_lock被用于等待buffer上的I/O操作完成,当一个pg程序加载或写入页中的数据到固存,process就会在对应的页的descriptor生成的io_in_progress锁)
8.3.2.3 spinlock
  • When the flags or other fields (e.g. refcount and usage_count) are checked or changed, a spinlock is used. Two specific examples of spinlock usage are given below:(当标志位(也就是结构体BufferDesc中的变量flag)或者其他变量被核查或者改变时,会使用自旋锁,下面给出了两个关于自旋锁使用的明确的例子:)

  • (1)The following shows how to pin the buffer descriptor:(下面展示了pin一个buffer descriptor的过程中怎么样加锁和释放锁:

    • \1. Acquire a spinlock of the buffer descriptor.
    • \2. Increase the values of its refcount and usage_count by 1.
    • \3. Release the spinlock.
  • LockBufHdr(bufferdesc);    /* Acquire a spinlock,获取一个自旋锁 */
    bufferdesc->refcont++; //锁住的页的当前访问+1
    bufferdesc->usage_count++; //锁住的页被访问次数+1
    UnlockBufHdr(bufferdesc); /* Release the spinlock,释放自旋锁 */
    
  • (2) The following shows how to set the dirty bit to ‘1’:(下面的例子中展示了怎么设置 是否为脏页 – dirty bit 这个标志位为1

    • Acquire a spinlock of the buffer descriptor.
    • Set the dirty bit to ‘1’ using a bitwise operation.
    • Release the spinlock.
  • #define BM_DIRTY             (1 << 0)    /* data needs writing */
    #define BM_VALID             (1 << 1)    /* data is valid */
    #define BM_TAG_VALID         (1 << 2)    /* tag is assigned */
    #define BM_IO_IN_PROGRESS    (1 << 3)    /* read or write in progress */
    #define BM_JUST_DIRTIED      (1 << 5)    /* dirtied since write started */
    
    LockBufHdr(bufferdesc); //加锁
    bufferdesc->flags |= BM_DIRTY; //设置flag为1
    UnlockBufHdr(bufferdesc); //释放锁
    

8.4. How the Buffer Manager Works(缓冲管理器是怎样工作的)

  • This section describes how the buffer manager works. When a backend process wants to access a desired page, it calls the ReadBufferExtended function.(这一部分描述了buffer manager是怎样工作的。当一个会话服务程序想要访问一个页面时,它会调用ReadBufferExtended()函数)
  • The behavior of the ReadBufferExtended function depends on three logical cases. Each case is described in the following subsections. In addition, the PostgreSQL clock sweep page replacement algorithm is described in the final subsection.(ReadBufferExtended()函数的功能行为取决于三种不同的情况,每一种情况都被描述成了以下一个子小节,另外在最后一个子小节中介绍了pg的clock sweep页替换算法)
8.4.1. Accessing a Page Stored in the Buffer Pool(访问存储在Buffer Pool的页)
  • First, the simplest case is described, i.e. the desired page is already stored in the buffer pool. In this case, the buffer manager performs the following steps:(首先,描述最简单的情况,这里假设想要访问的页已经存储在了buffer pool中,在这种情况下,buffer manager会执行以下步骤来访问Buffer Pool中的页:)
    • (1) Create the buffer_tag of the desired page (in this example, the buffer_tag is ‘Tag_C’) and compute the hash bucket slot, which contains the associated entry of the created buffer_tag, using the hash function.(创建想访问的页的buffer_tag(在下图的例子中,buffer_tag是Tag_C),根据散列函数计算该应该是存放在散列bucket slot的哪一格包含了相应的数据项)
    • (2) Acquire the BufMappingLock partition that covers the obtained hash bucket slot in shared mode (this lock will be released in step (5)).(获取BufMappingLock部分的锁用于锁住共享模式下的获得的散列bucket slot(这个锁将在步骤5时释放))
    • (3) Look up the entry whose tag is ‘Tag_C’ and obtain the buffer_id from the entry. In this example, the buffer_id is 2.(在查找buffer_tag为Tag_C的条目,并获取buffer_id,在这个例子中,buffer_id为2(注意:图中Tag_A和Tag_C同属一个hash bucket slot是因为这里的冲突解决方法是拉链法,也称之为链地址法))
    • (4) Pin the buffer descriptor for buffer_id 2, i.e. the refcount and usage_count of the descriptor are increased by 1 ( Section 8.3.2 describes pinning).(Pin这个buffer_id为2的对应的buffer descriptor槽,将其两个变量refcountusage_count均加一)
    • (5) Release the BufMappingLock.(释放BufMappingLock部分锁)
    • (6) Access the buffer pool slot with buffer_id 2.(访问buffer_id为2的buffer descriptor相对应的buffer pool槽中存储的表页)

Fig. 8.8. Accessing a page stored in the buffer pool.

  • Then, when reading rows from the page in the buffer pool slot, the PostgreSQL process acquires the shared content_lock of the corresponding buffer descriptor. Thus, buffer pool slots can be read by multiple processes simultaneously.(然后,当从buffer pool的页中读取行时,pg主进程会获取一个相应的buffer descriptor中的shared content_lock锁(共享内容锁),因此,同一个buffer pool slots就可以被多个pg会话程序同时读取)
  • When inserting (and updating or deleting) rows to the page, a Postgres process acquires the exclusive content_lock of the corresponding buffer descriptor (note that the dirty bit of the page must be set to ‘1’).(当插入行到页中的时候,当前这个pg会话服务程序会获取一个相应的buffer descriptor中的exclusive content_lock(互斥内容锁)(注意脏页标志位将会被设置为1))
  • After accessing the pages, the refcount values of the corresponding buffer descriptors are decreased by 1.(访问页面之后,对应buffer descriptor中refcount的值减1)
8.4.2. Loading a Page from Storage to Empty Slot(将页面从存储器中加载Buffer Pool空插槽中)
  • In this second case, assume that the desired page is not in the buffer pool and the freelist has free elements (empty descriptors). In this case, the buffer manager performs the following steps:(第二种情况下,假设想要访问的页面还没有在buffer pool中,并且freelist还有free的descriptor slot,这种情况下,buffer manager会执行以下步骤:)
    • (1) Look up the buffer table (we assume it is not found).(查找缓冲表,这种情况下buffer table中,也就是散列bucket slots中,肯定是找不到的)
      • \1. Create the buffer_tag of the desired page (in this example, the buffer_tag is ‘Tag_E’) and compute the hash bucket slot.(创建想要访问的页的buffer_tag(在这个例子中,buffer_tag就是Tag_E),并根据散列函数计算Tag_E应该存放在散列bucket中的哪一个slot)
      • \2. Acquire the BufMappingLock partition in shared mode.(获取共享模式的BufMappingLock部分锁)
      • \3. Look up the buffer table (not found according to the assumption).(查找buffer table(根据假设因为之前没有存相应的页在buffer pool,所以buffer table肯定也没有相应的buffer_Tag)
      • \4. Release the BufMappingLock.(释放BufMappingLock部分锁)
    • (2) Obtain the empty buffer descriptor from the freelist, and pin it. In this example, the buffer_id of the obtained descriptor is 4.(从freeList那儿获取空的buffer descriptor槽,然后pin它,这个例子中,获得的buffer_id是4)
    • (3) Acquire the BufMappingLock partition in exclusive mode (this lock will be released in step (6)).(获取互斥的BufMappingLock部分锁)
    • (4) Create a new data entry that comprises the buffer_tag ‘Tag_E’ and buffer_id 4; insert the created entry to the buffer table.(创建一个由buffer_tag和buffer_id组成的新的项,图例中Buffer table都是直观地使用了这个项的形式,创建完成后,将其插入buffer table中)
    • (5) Load the desired page data from storage to the buffer pool slot with buffer_id 4 as follows:(从固存加载想要访问的buffer_id为4的数据页到buffer pool slot)
      • \1. Acquire the exclusive io_in_progress_lock of the corresponding descriptor.(获取相应descriptor互斥IO进行锁)
      • \2. Set the io_in_progress bit of the corresponding descriptor to '1 to prevent access by other processes.(设置将相应descriptor的io_in_progress位设置为“1”,以防止其他进程访问)
      • \3. Load the desired page data from storage to the buffer pool slot.(将想要的数据页从固存加载到buffer pool slot)
      • \4. Change the states of the corresponding descriptor; the io_in_progress bit is set to ‘0’, and the valid bit is set to ‘1’.(修改对应的descriptor的某些值的状态:IO进行锁设置为0,有效位设置为1)
      • \5. Release the io_in_progress_lock.(释放IO进行锁)
    • (6) Release the BufMappingLock.(释放BufMappingLock部分锁)
    • (7) Access the buffer pool slot with buffer_id 4.(访问buffer_id=4的对应的buffer pool槽)
8.4.3. Loading a Page from Storage to a Victim Buffer Pool Slot(将页面从存储器中加载到受害者缓冲池插槽)
  • In this case, assume that all buffer pool slots are occupied by pages but the desired page is not stored. The buffer manager performs the following steps:(在这种情况下,假设所有的buffer pool slots都已经被页占据了,但是想要访问的页不在这里面,buffer manager会执行以下步骤:)

    • (1) Create the buffer_tag of the desired page and look up the buffer table. In this example, we assume that the buffer_tag is ‘Tag_M’ (the desired page is not found).(创建一个想要访问的页的buffer_tag,然后在buffer table中查找,这种情况下是不可能在buffer table中查找到的,这里假设创建的buffer_tag是Tag_M)

    • (2) Select a victim buffer pool slot using the clock-sweep algorithm, obtain the old entry, which contains the buffer_id of the victim pool slot, from the buffer table and pin the victim pool slot in the buffer descriptors layer. In this example, the buffer_id of the victim slot is 5 and the old entry is ‘Tag_F, id=5’. The clock sweep is described in the next subsection.(用页面替换算法选择一个受害者buffer pool槽,从buffer table获取这个槽中的旧的条目,条目中包含了受害者buffer pool槽对应的页的buffer_id,并pin这个descriptor 层次中的受害者页,在这个例子中国,受害者页的buffer_id是5,buffer table中的旧条目是Tag_F,buffer_id=5

    • (3) Flush (write and fsync) the victim page data if it is dirty; otherwise proceed to step (4).(如果这个受害者页是脏页,则将其冲洗进固存,否则旧跳过本步骤直接执行下一个步骤)

      The dirty page must be written to storage before overwriting with new data. Flushing a dirty page is performed as follows:(脏页所在的buffer pool slot在被新写入新的页数据之前,必须要将脏页的数据写入固存,冲洗脏页执行步骤如下:(本例中的脏页buffer_id=5,也就是上一步中的受害者页的buffer_id=5))

      • \1. Acquire the shared content_lock and the exclusive io_in_progress lock of the descriptor with buffer_id 5 (released in step 6).(获取脏页的buffer_id=5的descriptor上的 共享内容锁互斥IO进行锁 ,两个锁将会在6.(而不是第(6)步)中被释放)
      • \2. Change the states of the corresponding descriptor; the io_in_progress bit is set to ‘1’ and the just_dirtied bit is set to ‘0’.(改变buffer_id=5这个受害者页相应的descriptor的flag变量中的io_in_progress bit比特位为1以及just_dirtied bit比特位为0)
      • \3. Depending on the situation, the XLogFlush() function is invoked to write WAL data on the WAL buffer to the current WAL segment file (details are omitted; WAL and the XLogFlush function are described in Chapter 9).(根据情况,调用XLogFlush()函数将WAL(Writing Ahead Logging)缓冲区上的WAL数据写入当前WAL段文件)
      • \4. Flush the victim page data to storage.(洗刷受害者页面数据到固存)
      • \5. Change the states of the corresponding descriptor; the io_in_progress bit is set to ‘0’ and the valid bit is set to ‘1’.(改变对应的descriptor的flag状态位io_in_progress为0,valid比特位为1)
      • \6. Release the io_in_progress and content_lock locks.(释放IO进行锁和内容锁)
    • (4) Acquire the old BufMappingLock partition that covers the slot that contains the old entry, in exclusive mode.(获取旧的BufMappingLock部分给旧条目所在的Buffer table上所在的slot上互斥锁

    • (5) Acquire the new BufMappingLock partition and insert the new entry to the buffer table:(获取新的BufMappingLock部分锁,将新的条目插入buffer table:)

      • \1. Create the new entry comprised of the new buffer_tag ‘Tag_M’ and the victim’s buffer_id.(创建新的包含新buffer_tag为Tag_M和受害者buffer_id的条目,这里注意,受害者的buffer_id可以重复利用的,因为buffer_id只用于在Buffer Descriptor和Buffer Pool中作标识符用,与其他表页相关的信息无关
      • \2. Acquire the new BufMappingLock partition that covers the slot containing the new entry in exclusive mode.(获取新的BufMappingLock锁部分,给新条目的所在的slot上互斥锁)
      • \3. Insert the new entry to the buffer table.(将新条目插入buffer table)

    PostgreSQL内核学习--Chapter 8 Buffer Manager_第4张图片

    • (6) Delete the old entry from the buffer table, and release the old BufMappingLock partition.(从buffer table中删除旧条目,释放buffer table旧的条目所在slot的BufMappingLock锁)
    • (7) Load the desired page data from the storage to the victim buffer slot. Then, update the flags of the descriptor with buffer_id 5; the dirty bit is set to '0 and initialize other bits.(从固存加载想要访问的页数据到受害者的buffer pool slot,然后,更新buffer_id=5的descriptor中的flag中的标志位,脏页标志比特位为0)
    • (8) Release the new BufMappingLock partition.(释放新的BufMappingLock部分锁)
    • (9) Access the buffer pool slot with buffer_id 5.(访问buffer_id=5的buffer pool)

PostgreSQL内核学习--Chapter 8 Buffer Manager_第5张图片

8.4.4. Page Replacement Algorithm: Clock Sweep(页面替换算法:时钟扫描)
  • The rest of this section describes the clock-sweep algorithm. This algorithm is a variant of NFU (Not Frequently Used) with low overhead; it selects less frequently used pages efficiently.(剩余的这一部分将描述clock-sweep算法,这个算法是一个NFU算法的变种形式,具有更低的开销;他选择buffer pool中较低频次被访问的页)
  • Imagine buffer descriptors as a circular list (Fig. 8.12). The nextVictimBuffer, an unsigned 32-bit integer, is always pointing to one of the buffer descriptors and rotates clockwise. The pseudocode and description of the algorithm are follows:(将buffer descriptor想象成一个环形列表,下一个受害者nextVictimBuffer,这个变量是一个无符号的32位int类型的变量,总是指向buffer descriptor其中非空的一个slot,并且总是顺时针旋转;下面是该算法的伪代码和介绍:)
  WHILE true
(1)     Obtain the candidate buffer descriptor pointed by the nextVictimBuffer
(2)     IF the candidate descriptor is unpinned THEN
(3)	       IF the candidate descriptor's usage_count == 0 THEN
	            BREAK WHILE LOOP  /* the corresponding slot of this descriptor is victim slot. */
	       ELSE
		    Decrease the candidate descriptpor's usage_count by 1
               END IF
         END IF
(4)     Advance nextVictimBuffer to the next one
      END WHILE 
(5) RETURN buffer_id of the victim
  • (1) Obtain the candidate buffer descriptor pointed to by nextVictimBuffer.(获取*nextVictimBuffer*指向的候选buffer descriptor)
  • (2) If the candidate buffer descriptor is unpinned, proceed to step (3); otherwise, proceed to step (4).(如果候选的buffer descriptor状态为unpinned,即暂时没有任何其他pg会话程序访问它,那么就继续执行第(3)步,否则,执行第(4)步)
  • (3) If the usage_count of the candidate descriptor is 0, select the corresponding slot of this descriptor as a victim and proceed to step (5); otherwise, decrease this descriptor’s usage_count by 1 and proceed to step (4).(如果候选buffer descriptor的usage_count=0,选择当前descriptor slot对应的buffer pool作为受害者pool slot,然后执行第(5)步,否则,将descriptor的usage_count变量的值减一,然后继续执行第(4)步)
  • (4) Advance the nextVictimBuffer to the next descriptor (if at the end, wrap around) and return to step (1). Repeat until a victim is found.(将nextVictimBuffer前进一格,指向下一个descriptor)
  • (5) Return the buffer_id of the victim.(返回受害者页的buffer_id)

Fig. 8.12. Clock Sweep.

PostgreSQL内核学习--Chapter 8 Buffer Manager_第6张图片

  • A specific example is shown in Fig. 8.12. The buffer descriptors are shown as blue or cyan boxes, and the numbers in the boxes show the usage_count of each descriptor.
    • \1) The nextVictimBuffer points to the first descriptor (buffer_id 1); however, this descriptor is skipped because it is pinned.(一开始nextVictimBuffer指向buffer_id=1的slot,但是buffer_id=1的descriptor的状态是pinned,所以nextVictimBuffer直接跳转到下一个buffer_id=2的slot)
    • \2) The nextVictimBuffer points to the second descriptor (buffer_id 2). This descriptor is unpinned but its usage_count is 2; thus, the usage_count is decreased by 1 and the nextVictimBuffer advances to the third candidate.(nextVictimBuffer指向第二个,第二个对应的descriptor状态为unpinned,但是它的usage_count的值不为0,所以执行usage_count--;,然后nextVictimBuffer继续跳转到下一个buffer_id=3的slot)
    • \3) The nextVictimBuffer points to the third descriptor (buffer_id 3). This descriptor is unpinned and its usage_count is 0; thus, this is the victim in this round.(nextVictimBuffer指向的第三个,descriptor的状态为unpinned,且, usage_count==0,所以它就是这一次页面替换的受害者页面)
  • Whenever the nextVictimBuffer sweeps an unpinned descriptor, its usage_count is decreased by 1. Therefore, if unpinned descripters exist in the buffer pool, this algorithm can always find a victim, whose usage_count is 0, by rotating the nextVictimBuffer.(每当nextVictimBuffer扫描未固定的描述符时,其usage_count就会减少1。因此,如果缓冲池中存在未固定的描述符,则该算法总是可以通过旋转nextVictimBuffer来找到usage_count为0的受害者)

8.5. Ring Buffer(环形缓冲)

  • When reading or writing a huge table, PostgreSQL uses a ring buffer rather than the buffer pool. The ring buffer is a small and temporary buffer area. When any condition listed below is met, a ring buffer is allocated to shared memory:(当要读写一个size很大的表时,Pg会使用环形缓冲区而不是缓冲池pool。环形缓冲区是一个小的临时缓冲区。当满足下面列出的任何条件时,将为共享内存分配一个环形缓冲区:)

    • Bulk-reading(批量读取)

      • When a relation whose size exceeds one-quarter of the buffer pool size (shared_buffers/4) is scanned. In this case, the ring buffer size is 256 KB.(当扫描大小超过buffer pool大小(shared_buffers/4)四分之一的关系(包括table,index等)时,这种情况下,环形缓冲区的大小为256KB)
    • Bulk-writing(批量写入)

      • When the SQL commands listed below are executed. In this case, the ring buffer size is 16 MB.(当SQL命令是下面列出来的其一时,环形缓冲区为16MB)
      • COPY FROM command.
      • CREATE TABLE AS command.
      • CREATE MATERIALIZED VIEW or REFRESH MATERIALIZED VIEW command.
      • ALTER TABLE command.
    • Vacuum-processing(执行vacuum进程时)

      • When an autovacuum performs a vacuum processing. In this case, the ring buffer size is 256 KB.(当执行自动vacuum时,环形缓冲区为256KB)
  • The allocated ring buffer is released immediately after use.(分配的环形缓冲区会在用完之后马上释放)

  • The benefit of the ring buffer is obvious. If a backend process reads a huge table without using a ring buffer, all stored pages in the buffer pool are removed (kicked out); therefore, the cache hit ratio decreases. The ring buffer avoids this issue.(环形缓冲区的优点是显而易见的,如果一个会话服务程序在不使用环形缓冲区的情况下要读取一个很大的表,可能所有buffer pool上的页都会因为要读这张巨大的表而被替换一遍,因此,缓存的命中率就会降低,而临时的环形缓冲区就会避免这种问题)

8.6. Flushing Dirty Pages(冲洗脏页)

  • In addition to replacing victim pages, the checkpointer and background writer processes flush dirty pages to storage. Both processes have the same function (flushing dirty pages); however, they have different roles and behaviours.(除了替换受害者页面外,checkpointer和background writer还会将脏页冲洗到固存中;两个进程具有相同的功能(冲洗脏页);然而,他们在冲刷脏页时有不同的角色和行为。)
  • The checkpointer process writes a checkpoint record to the WAL segment file and flushes dirty pages whenever checkpointing starts. Section 9.7 describes checkpointing and when it begins.(checkpointer进程将checkpoint记录写入WAL段文件,并在checkpoint开始时洗刷脏页。9.7小节描述了checkpoint设置及其开始时间)
  • The role of the background writer is to reduce the influence of the intensive writing of checkpointing. The background writer continues to flush dirty pages little by little with minimal impact on database activity. By default, the background writer wakes every 200 msec (defined by bgwriter_delay) and flushes bgwriter_lru_maxpages (the default is 100 pages) at maximum.(background writer的作用是减少checkpoint密集编写的影响。background writer一点一点地冲洗脏页面,这样冲洗对数据库活动的影响最小。默认情况下,会话服务程序每200毫秒唤醒一次(由bgwriter_delay定义)并刷新BGWRITER_lru_maxpages(默认值为100页))

你可能感兴趣的:(postgresql,学习,数据库)