本节简单介绍了PostgreSQL缓存管理(Buffer Manager)中的实现函数ReadBuffer_common->BufferAlloc,该函数是ReadBuffer的子过程.处理共享缓存的搜索。
BufferDesc
共享缓冲区的共享描述符(状态)数据
/*
* Flags for buffer descriptors
* buffer描述器标记
*
* Note: TAG_VALID essentially means that there is a buffer hashtable
* entry associated with the buffer's tag.
* 注意:TAG_VALID本质上意味着有一个与缓冲区的标记相关联的缓冲区散列表条目。
*/
//buffer header锁定
#define BM_LOCKED (1U << 22) /* buffer header is locked */
//数据需要写入(标记为DIRTY)
#define BM_DIRTY (1U << 23) /* data needs writing */
//数据是有效的
#define BM_VALID (1U << 24) /* data is valid */
//已分配buffer tag
#define BM_TAG_VALID (1U << 25) /* tag is assigned */
//正在R/W
#define BM_IO_IN_PROGRESS (1U << 26) /* read or write in progress */
//上一个I/O出现错误
#define BM_IO_ERROR (1U << 27) /* previous I/O failed */
//开始写则变DIRTY
#define BM_JUST_DIRTIED (1U << 28) /* dirtied since write started */
//存在等待sole pin的其他进程
#define BM_PIN_COUNT_WAITER (1U << 29) /* have waiter for sole pin */
//checkpoint发生,必须刷到磁盘上
#define BM_CHECKPOINT_NEEDED (1U << 30) /* must write for checkpoint */
//持久化buffer(不是unlogged或者初始化fork)
#define BM_PERMANENT (1U << 31) /* permanent buffer (not unlogged,
* or init fork) */
/*
* BufferDesc -- shared descriptor/state data for a single shared buffer.
* BufferDesc -- 共享缓冲区的共享描述符(状态)数据
*
* Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
* the tag, state or wait_backend_pid fields. In general, buffer header lock
* is a spinlock which is combined with flags, refcount and usagecount into
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
* after initialization, so does not need locking. freeNext is protected by
* the buffer_strategy_lock not buffer header lock. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
* 注意:必须持有Buffer header锁(BM_LOCKED标记)才能检查或修改tag/state/wait_backend_pid字段.
* 通常来说,buffer header lock是spinlock,它与标记位/参考计数/使用计数组合到单个原子变量中.
* 这个布局设计允许我们执行原子操作,而不需要实际获得或者释放spinlock(比如,增加或者减少参考计数).
* buf_id字段在初始化后不会出现变化,因此不需要锁定.
* freeNext通过buffer_strategy_lock锁而不是buffer header lock保护.
* LWLock可以很好的处理自己的状态.
* 务请注意的是:buffer header lock不用于控制buffer中的数据访问!
*
* It's assumed that nobody changes the state field while buffer header lock
* is held. Thus buffer header lock holder can do complex updates of the
* state variable in single write, simultaneously with lock release (cleaning
* BM_LOCKED flag). On the other hand, updating of state without holding
* buffer header lock is restricted to CAS, which insure that BM_LOCKED flag
* is not set. Atomic increment/decrement, OR/AND etc. are not allowed.
* 假定在持有buffer header lock的情况下,没有人改变状态字段.
* 持有buffer header lock的进程可以执行在单个写操作中执行复杂的状态变量更新,
* 同步的释放锁(清除BM_LOCKED标记).
* 换句话说,如果没有持有buffer header lock的状态更新,会受限于CAS,
* 这种情况下确保BM_LOCKED没有被设置.
* 比如原子的增加/减少(AND/OR)等操作是不允许的.
*
* An exception is that if we have the buffer pinned, its tag can't change
* underneath us, so we can examine the tag without locking the buffer header.
* Also, in places we do one-time reads of the flags without bothering to
* lock the buffer header; this is generally for situations where we don't
* expect the flag bit being tested to be changing.
* 一种例外情况是如果我们已有buffer pinned,该buffer的tag不能改变(在本进程之下),
* 因此不需要锁定buffer header就可以检查tag了.
* 同时,在执行一次性的flags读取时不需要锁定buffer header.
* 这种情况通常用于我们不希望正在测试的flag bit将被改变.
*
* We can't physically remove items from a disk page if another backend has
* the buffer pinned. Hence, a backend may need to wait for all other pins
* to go away. This is signaled by storing its own PID into
* wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER. At present,
* there can be only one such waiter per buffer.
* 如果其他进程有buffer pinned,那么进程不能物理的从磁盘页面中删除items.
* 因此,后台进程需要等待其他pins清除.这可以通过存储它自己的PID到wait_backend_pid中,
* 并设置标记位BM_PIN_COUNT_WAITER.
* 目前,每个缓冲区只能由一个等待进程.
*
* We use this same struct for local buffer headers, but the locks are not
* used and not all of the flag bits are useful either. To avoid unnecessary
* overhead, manipulations of the state field should be done without actual
* atomic operations (i.e. only pg_atomic_read_u32() and
* pg_atomic_unlocked_write_u32()).
* 本地缓冲头部使用同样的结构,但并不需要使用locks,而且并不是所有的标记位都使用.
* 为了避免不必要的负载,状态域的维护不需要实际的原子操作
* (比如只有pg_atomic_read_u32() and pg_atomic_unlocked_write_u32())
*
* Be careful to avoid increasing the size of the struct when adding or
* reordering members. Keeping it below 64 bytes (the most common CPU
* cache line size) is fairly important for performance.
* 在增加或者记录成员变量时,小心避免增加结构体的大小.
* 保持结构体大小在64字节内(通常的CPU缓存线大小)对于性能是非常重要的.
*/
typedef struct BufferDesc
{
//buffer tag
BufferTag tag; /* ID of page contained in buffer */
//buffer索引编号(0开始)
int buf_id; /* buffer's index number (from 0) */
/* state of the tag, containing flags, refcount and usagecount */
//tag状态,包括flags/refcount和usagecount
pg_atomic_uint32 state;
//pin-count等待进程ID
int wait_backend_pid; /* backend PID of pin-count waiter */
//空闲链表链中下一个空闲的buffer
int freeNext; /* link in freelist chain */
//缓冲区内容锁
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
BufferTag
Buffer tag标记了buffer存储的是磁盘中哪个block
/*
* Buffer tag identifies which disk block the buffer contains.
* Buffer tag标记了buffer存储的是磁盘中哪个block
*
* Note: the BufferTag data must be sufficient to determine where to write the
* block, without reference to pg_class or pg_tablespace entries. It's
* possible that the backend flushing the buffer doesn't even believe the
* relation is visible yet (its xact may have started before the xact that
* created the rel). The storage manager must be able to cope anyway.
* 注意:BufferTag必须足以确定如何写block而不需要参照pg_class或者pg_tablespace数据字典信息.
* 有可能后台进程在刷新缓冲区的时候深圳不相信关系是可见的(事务可能在创建rel的事务之前).
* 存储管理器必须可以处理这些事情.
*
* Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have
* to be fixed to zero them, since this struct is used as a hash key.
* 注意:如果在结构体中有填充的字节,INIT_BUFFERTAG必须将它们固定为零,因为这个结构体用作散列键.
*/
typedef struct buftag
{
//物理relation标识符
RelFileNode rnode; /* physical relation identifier */
ForkNumber forkNum;
//相对于relation起始的块号
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;
SMgrRelation
smgr.c维护一个包含SMgrRelation对象的hash表,SMgrRelation对象本质上是缓存的文件句柄.
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
* by smgropen(), and destroyed by smgrclose(). Note that neither of these
* operations imply I/O, they just create or destroy a hashtable entry.
* (But smgrclose() may release associated resources, such as OS-level file
* descriptors.)
* smgr.c维护一个包含SMgrRelation对象的hash表,SMgrRelation对象本质上是缓存的文件句柄.
* SMgrRelation对象(如非现成)通过smgropen()方法创建,通过smgrclose()方法销毁.
* 注意:这些操作都不会执行I/O操作,只会创建或者销毁哈希表条目.
* (但是smgrclose()方法可能会释放相关的资源,比如OS基本的文件描述符)
*
* An SMgrRelation may have an "owner", which is just a pointer to it from
* somewhere else; smgr.c will clear this pointer if the SMgrRelation is
* closed. We use this to avoid dangling pointers from relcache to smgr
* without having to make the smgr explicitly aware of relcache. There
* can't be more than one "owner" pointer per SMgrRelation, but that's
* all we need.
* SMgrRelation可能会有"宿主",这个宿主可能只是从某个地方指向它的指针而已;
* 如SMgrRelationsmgr.c会清除该指针.这样做可以避免从relcache到smgr的悬空指针,
* 而不必要让smgr显式的感知relcache(也就是隔离了smgr了relcache).
* 每个SMgrRelation不能跟多个"owner"指针关联,但这就是我们所需要的.
*
* SMgrRelations that do not have an "owner" are considered to be transient,
* and are deleted at end of transaction.
* SMgrRelations如无owner指针,则被视为临时对象,在事务的最后被删除.
*/
typedef struct SMgrRelationData
{
/* rnode is the hashtable lookup key, so it must be first! */
//-------- rnode是哈希表的搜索键,因此在结构体的首位
//关系物理定义ID
RelFileNodeBackend smgr_rnode; /* relation physical identifier */
/* pointer to owning pointer, or NULL if none */
//--------- 指向拥有的指针,如无则为NULL
struct SMgrRelationData **smgr_owner;
/*
* These next three fields are not actually used or manipulated by smgr,
* except that they are reset to InvalidBlockNumber upon a cache flush
* event (in particular, upon truncation of the relation). Higher levels
* store cached state here so that it will be reset when truncation
* happens. In all three cases, InvalidBlockNumber means "unknown".
* 接下来的3个字段实际上并不用于或者由smgr管理,
* 除非这些表里在cache flush event发生时被重置为InvalidBlockNumber
* (特别是在关系被截断时).
* 在这里,更高层的存储缓存了状态因此在截断发生时会被重置.
* 在这3种情况下,InvalidBlockNumber都意味着"unknown".
*/
//当前插入的目标bloc
BlockNumber smgr_targblock; /* current insertion target block */
//最后已知的fsm fork大小
BlockNumber smgr_fsm_nblocks; /* last known size of fsm fork */
//最后已知的vm fork大小
BlockNumber smgr_vm_nblocks; /* last known size of vm fork */
/* additional public fields may someday exist here */
//------- 未来可能新增的公共域
/*
* Fields below here are intended to be private to smgr.c and its
* submodules. Do not touch them from elsewhere.
* 下面的字段是smgr.c及其子模块私有的,不要从其他模块接触这些字段.
*/
//存储管理器选择器
int smgr_which; /* storage manager selector */
/*
* for md.c; per-fork arrays of the number of open segments
* (md_num_open_segs) and the segments themselves (md_seg_fds).
* 用于md.c,打开段(md_num_open_segs)和段自身(md_seg_fds)的数组(每个fork一个)
*/
int md_num_open_segs[MAX_FORKNUM + 1];
struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
/* if unowned, list link in list of all unowned SMgrRelations */
//如没有宿主,未宿主的SMgrRelations链表的链表链接.
struct SMgrRelationData *next_unowned_reln;
} SMgrRelationData;
typedef SMgrRelationData *SMgrRelation;
RelFileNodeBackend
组合relfilenode和后台进程ID,用于提供需要定位物理存储的所有信息.
/*
* Augmenting a relfilenode with the backend ID provides all the information
* we need to locate the physical storage. The backend ID is InvalidBackendId
* for regular relations (those accessible to more than one backend), or the
* owning backend's ID for backend-local relations. Backend-local relations
* are always transient and removed in case of a database crash; they are
* never WAL-logged or fsync'd.
* 组合relfilenode和后台进程ID,用于提供需要定位物理存储的所有信息.
* 对于普通的关系(可通过多个后台进程访问),后台进程ID是InvalidBackendId;
* 如为临时表,则为自己的后台进程ID.
* 临时表(backend-local relations)通常是临时存在的,在数据库崩溃时删除,无需WAL-logged或者fsync.
*/
typedef struct RelFileNodeBackend
{
RelFileNode node;//节点
BackendId backend;//后台进程
} RelFileNodeBackend;
BufferAlloc是ReadBuffer的子过程.处理共享缓存的搜索.如果已无buffer可用,则选择一个可替换的buffer并删除旧页面,但注意不要读入新页面.
该函数的主要处理逻辑如下:
1.初始化,根据Tag确定hash值和分区锁定ID
2.检查block是否已在buffer pool中
3.在缓冲区中找到该buffer(buf_id >= 0)
3.1获取buffer描述符并Pin buffer
3.2如PinBuffer返回F,则执行StartBufferIO,如该函数返回F,则设置标记*foundPtr为F
3.3返回buf
4.在缓冲区中找不到该buffer(buf_id < 0)
4.1释放newPartitionLock
4.2执行循环,寻找合适的buffer
4.2.1确保在自旋锁尚未持有时,有一个空闲的refcount入口(条目)
4.2.2选择一个待淘汰的buffer
4.2.3拷贝buffer flags到oldFlags中
4.2.4Pin buffer,然后释放buffer自旋锁
4.2.5如buffer标记位BM_DIRTY,FlushBuffer
4.2.6如buffer标记为BM_TAG_VALID,计算原tag的hashcode和partition lock ID,并锁定新旧分区锁
否则需要新的分区,锁定新分区锁,重置原分区锁和原hash值
4.2.7尝试使用buffer新的tag构造hash表入口
4.2.8存在冲突(buf_id >= 0),在这里只需要像一开始处理的那样,视为已在缓冲池发现该buffer
4.2.9不存在冲突(buf_id < 0),锁定buffer header,如缓冲区没有变脏或者被pinned,则已找到buf,跳出循环
否则,解锁buffer header,删除hash表入口,释放锁,重新寻找buffer
4.3可以重新设置buffer tag,完成后解锁buffer header,删除原有的hash表入口,释放分区锁
4.4执行StartBufferIO,设置*foundPtr标记
4.5返回buf
/*
* BufferAlloc -- subroutine for ReadBuffer. Handles lookup of a shared
* buffer. If no buffer exists already, selects a replacement
* victim and evicts the old page, but does NOT read in new page.
* BufferAlloc -- ReadBuffer的子过程.处理共享缓存的搜索.
* 如果已无buffer可用,则选择一个可替换的buffer并删除旧页面,但注意不要读入新页面.
*
* "strategy" can be a buffer replacement strategy object, or NULL for
* the default strategy. The selected buffer's usage_count is advanced when
* using the default strategy, but otherwise possibly not (see PinBuffer).
* "strategy"可以是缓存替换策略对象,如为默认策略,则为NULL.
* 如使用默认读取策略,则选中的缓冲buffer的usage_count会加一,但也可能不会增加(详细参见PinBuffer).
*
* The returned buffer is pinned and is already marked as holding the
* desired page. If it already did have the desired page, *foundPtr is
* set true. Otherwise, *foundPtr is set false and the buffer is marked
* as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
* 返回的buffer已pinned并已标记为持有指定的页面.
* 如果确实已持有指定的页面,*foundPtr设置为T.
* 否则的话,*foundPtr设置为F,buffer标记为IO_IN_PROGRESS,ReadBuffer将会执行I/O操作.
*
* *foundPtr is actually redundant with the buffer's BM_VALID flag, but
* we keep it for simplicity in ReadBuffer.
* *foundPtr跟buffer的BM_VALID标记是重复的,但为了ReadBuffer中的简化,仍然保持这个参数.
*
* No locks are held either at entry or exit.
* 在进入或者退出的时候,不需要持有任何的Locks.
*/
static BufferDesc *
BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
bool *foundPtr)
{
//请求block的ID
BufferTag newTag; /* identity of requested block */
//newTag的Hash值
uint32 newHash; /* hash value for newTag */
//缓冲区分区锁
LWLock *newPartitionLock; /* buffer partition lock for it */
//选中缓冲区对应的上一个ID
BufferTag oldTag; /* previous identity of selected buffer */
//oldTag的hash值
uint32 oldHash; /* hash value for oldTag */
//原缓冲区分区锁
LWLock *oldPartitionLock; /* buffer partition lock for it */
//原标记位
uint32 oldFlags;
//buffer ID编号
int buf_id;
//buffer描述符
BufferDesc *buf;
//是否有效
bool valid;
//buffer状态
uint32 buf_state;
/* create a tag so we can lookup the buffer */
//创建一个tag,用于检索buffer
INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
/* determine its hash code and partition lock ID */
//根据Tag确定hash值和分区锁定ID
newHash = BufTableHashCode(&newTag);
newPartitionLock = BufMappingPartitionLock(newHash);
/* see if the block is in the buffer pool already */
//检查block是否已在buffer pool中
LWLockAcquire(newPartitionLock, LW_SHARED);
buf_id = BufTableLookup(&newTag, newHash);
if (buf_id >= 0)
{
//---- 在缓冲区中找到该buffer
/*
* Found it. Now, pin the buffer so no one can steal it from the
* buffer pool, and check to see if the correct data has been loaded
* into the buffer.
* 找到了!现在pin缓冲区,确保没有进程可以从缓冲区中删除
* 检查正确的数据是否已装载到缓冲区中.
*/
buf = GetBufferDescriptor(buf_id);
//Pin缓冲区
valid = PinBuffer(buf, strategy);
/* Can release the mapping lock as soon as we've pinned it */
//一旦pinned,立即释放newPartitionLock
LWLockRelease(newPartitionLock);
//设置返回参数
*foundPtr = true;
if (!valid)
{
//如无效
/*
* We can only get here if (a) someone else is still reading in
* the page, or (b) a previous read attempt failed. We have to
* wait for any active read attempt to finish, and then set up our
* own read attempt if the page is still not BM_VALID.
* StartBufferIO does it all.
* 程序执行到这里原因是(a)有其他进程仍然读入了该page,或者(b)上一次读取尝试失败.
* 在这里必须等到其他活动的读取完成,然后在page状态仍然不是BM_VALID时设置读取尝试.
* StartBufferIO过程执行这些工作.
*/
if (StartBufferIO(buf, true))
{
/*
* If we get here, previous attempts to read the buffer must
* have failed ... but we shall bravely try again.
*/
//上一次尝试读取已然失败,这里还是需要勇敢的再试一次!
*foundPtr = false;//设置为F
}
}
//返回buf
return buf;
}
/*
* Didn't find it in the buffer pool. We'll have to initialize a new
* buffer. Remember to unlock the mapping lock while doing the work.
* 没有在缓冲池中发现该buffer.
* 这时候不得不初始化一个buffer.
* 记住:在执行工作的时候,记得首先解锁mapping lock.
*/
LWLockRelease(newPartitionLock);
/* Loop here in case we have to try another victim buffer */
//循环,寻找合适的buffer
for (;;)
{
/*
* Ensure, while the spinlock's not yet held, that there's a free
* refcount entry.
* 确保在自旋锁尚未持有时,有一个空闲的refcount入口(条目).
*/
ReservePrivateRefCountEntry();
/*
* Select a victim buffer. The buffer is returned with its header
* spinlock still held!
* 选择一个待淘汰的buffer.
* 返回的buffer,仍然持有其header的自旋锁.
*/
buf = StrategyGetBuffer(strategy, &buf_state);
Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
/* Must copy buffer flags while we still hold the spinlock */
//在仍持有自旋锁的情况下必须拷贝buffer flags
oldFlags = buf_state & BUF_FLAG_MASK;
/* Pin the buffer and then release the buffer spinlock */
//Pin buffer,然后释放buffer自旋锁
PinBuffer_Locked(buf);
/*
* If the buffer was dirty, try to write it out. There is a race
* condition here, in that someone might dirty it after we released it
* above, or even while we are writing it out (since our share-lock
* won't prevent hint-bit updates). We will recheck the dirty bit
* after re-locking the buffer header.
* 如果buffer已脏,尝试刷新到磁盘上.
* 这里有一个竞争条件,那就是某些进程可能在我们在上面释放它(或者甚至在我们正在刷新时)之后使该缓冲区变脏.
* 在再次锁定buffer header后,我们会重新检查相应的dirty标记位.
*/
if (oldFlags & BM_DIRTY)
{
/*
* We need a share-lock on the buffer contents to write it out
* (else we might write invalid data, eg because someone else is
* compacting the page contents while we write). We must use a
* conditional lock acquisition here to avoid deadlock. Even
* though the buffer was not pinned (and therefore surely not
* locked) when StrategyGetBuffer returned it, someone else could
* have pinned and exclusive-locked it by the time we get here. If
* we try to get the lock unconditionally, we'd block waiting for
* them; if they later block waiting for us, deadlock ensues.
* (This has been observed to happen when two backends are both
* trying to split btree index pages, and the second one just
* happens to be trying to split the page the first one got from
* StrategyGetBuffer.)
* 需要持有buffer内容的共享锁来刷出该缓冲区.
* (否则的话,我们可能会写入无效的数据,原因比如是其他进程在我们写入时压缩page).
* 在这里,必须使用条件锁来避免死锁.
* 在StrategyGetBuffer返回时虽然buffer尚未pinned,
* 其他进程可能已经pinned该buffer并且同时已持有独占锁.
* 如果我们尝试无条件的锁定,那么因为等待而阻塞.其他进程稍后又会等待本进程,那么死锁就会发生.
* (在实际中,两个后台进程在尝试分裂B树索引pages,
* 而第二个正好尝试分裂第一个进程通过StrategyGetBuffer获取的page时,会发生这种情况).
*/
if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
LW_SHARED))
{
//---- 执行有条件锁定请求(buffer内容共享锁)
/*
* If using a nondefault strategy, and writing the buffer
* would require a WAL flush, let the strategy decide whether
* to go ahead and write/reuse the buffer or to choose another
* victim. We need lock to inspect the page LSN, so this
* can't be done inside StrategyGetBuffer.
* 如使用非默认的策略,则写缓冲会请求WAL flush,让策略确定如何继续以及写入/重用
* 缓冲或者选择另外一个待淘汰的buffer.
* 我们需要锁定,检查page的LSN,因此不能在StrategyGetBuffer中完成.
*/
if (strategy != NULL)
{
//非默认策略
XLogRecPtr lsn;
/* Read the LSN while holding buffer header lock */
//在持有buffer header lock时读取LSN
buf_state = LockBufHdr(buf);
lsn = BufferGetLSN(buf);
UnlockBufHdr(buf, buf_state);
if (XLogNeedsFlush(lsn) &&
StrategyRejectBuffer(strategy, buf))
{
//需要flush WAL并且StrategyRejectBuffer
/* Drop lock/pin and loop around for another buffer */
//清除lock/pin并循环到另外一个buffer
LWLockRelease(BufferDescriptorGetContentLock(buf));
UnpinBuffer(buf, true);
continue;
}
}
/* OK, do the I/O */
//现在可以执行I/O了
TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
smgr->smgr_rnode.node.spcNode,
smgr->smgr_rnode.node.dbNode,
smgr->smgr_rnode.node.relNode);
FlushBuffer(buf, NULL);
LWLockRelease(BufferDescriptorGetContentLock(buf));
ScheduleBufferTagForWriteback(&BackendWritebackContext,
&buf->tag);
TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
smgr->smgr_rnode.node.spcNode,
smgr->smgr_rnode.node.dbNode,
smgr->smgr_rnode.node.relNode);
}
else
{
/*
* Someone else has locked the buffer, so give it up and loop
* back to get another one.
* 其他进程已经锁定了buffer,放弃,获取另外一个
*/
UnpinBuffer(buf, true);
continue;
}
}
/*
* To change the association of a valid buffer, we'll need to have
* exclusive lock on both the old and new mapping partitions.
* 修改有效缓冲区的相关性,需要在原有和新的映射分区上持有独占锁
*/
if (oldFlags & BM_TAG_VALID)
{
//----------- buffer标记为BM_TAG_VALID
/*
* Need to compute the old tag's hashcode and partition lock ID.
* XXX is it worth storing the hashcode in BufferDesc so we need
* not recompute it here? Probably not.
* 需要计算原tag的hashcode和partition lock ID.
* 这里是否值得存储hashcode在BufferDesc中而无需再次计算?可能不值得.
*/
oldTag = buf->tag;
oldHash = BufTableHashCode(&oldTag);
oldPartitionLock = BufMappingPartitionLock(oldHash);
/*
* Must lock the lower-numbered partition first to avoid
* deadlocks.
* 必须首先锁定更低一级编号的分区以避免死锁
*/
if (oldPartitionLock < newPartitionLock)
{
//按顺序锁定
LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
}
else if (oldPartitionLock > newPartitionLock)
{
//按顺序锁定
LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
}
else
{
/* only one partition, only one lock */
//只有一个分区,只需要一个锁
LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
}
}
else
{
//----------- buffer未标记为BM_TAG_VALID
/* if it wasn't valid, we need only the new partition */
//buffer无效,需要新的分区
LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
/* remember we have no old-partition lock or tag */
//不需要原有分区的锁&tag
oldPartitionLock = NULL;
/* this just keeps the compiler quiet about uninit variables */
//这行代码的目的是让编译器"闭嘴"
oldHash = 0;
}
/*
* Try to make a hashtable entry for the buffer under its new tag.
* This could fail because while we were writing someone else
* allocated another buffer for the same block we want to read in.
* Note that we have not yet removed the hashtable entry for the old
* tag.
* 尝试使用buffer新的tag构造hash表入口.
* 这可能会失败,因为在我们写入时其他进程可能已为我们希望读入的同一个block分配了另外一个buffer.
* 注意我们还没有删除原有tag的hash表入口.
*/
buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
if (buf_id >= 0)
{
/*
* Got a collision. Someone has already done what we were about to
* do. We'll just handle this as if it were found in the buffer
* pool in the first place. First, give up the buffer we were
* planning to use.
* 存在冲突.某个进程已完成了我们准备做的事情.
* 在这里只需要像一开始处理的那样,视为已在缓冲池发现该buffer.
* 首先,放弃计划使用的buffer.
*/
UnpinBuffer(buf, true);
/* Can give up that buffer's mapping partition lock now */
//放弃原有的partition lock
if (oldPartitionLock != NULL &&
oldPartitionLock != newPartitionLock)
LWLockRelease(oldPartitionLock);
/* remaining code should match code at top of routine */
//剩余的代码应匹配上面的处理过程
//详细参见以上代码注释
buf = GetBufferDescriptor(buf_id);
valid = PinBuffer(buf, strategy);
/* Can release the mapping lock as soon as we've pinned it */
//是否新partition lock
LWLockRelease(newPartitionLock);
//设置标记
*foundPtr = true;
if (!valid)
{
/*
* We can only get here if (a) someone else is still reading
* in the page, or (b) a previous read attempt failed. We
* have to wait for any active read attempt to finish, and
* then set up our own read attempt if the page is still not
* BM_VALID. StartBufferIO does it all.
*/
if (StartBufferIO(buf, true))
{
/*
* If we get here, previous attempts to read the buffer
* must have failed ... but we shall bravely try again.
*/
*foundPtr = false;
}
}
return buf;
}
/*
* Need to lock the buffer header too in order to change its tag.
* 需要锁定缓冲头部,目的是修改tag
*/
buf_state = LockBufHdr(buf);
/*
* Somebody could have pinned or re-dirtied the buffer while we were
* doing the I/O and making the new hashtable entry. If so, we can't
* recycle this buffer; we must undo everything we've done and start
* over with a new victim buffer.
* 在我们执行I/O和标记新的hash表入口时,某些进程可能已经pinned或者重新弄脏了buffer.
* 如出现这样的情况,不能回收该缓冲区;必须回滚我们所做的所有事情,并重新寻找新的待淘汰的缓冲区.
*/
oldFlags = buf_state & BUF_FLAG_MASK;
if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
//已经OK了
break;
//解锁buffer header
UnlockBufHdr(buf, buf_state);
//删除hash表入口
BufTableDelete(&newTag, newHash);
//释放锁
if (oldPartitionLock != NULL &&
oldPartitionLock != newPartitionLock)
LWLockRelease(oldPartitionLock);
LWLockRelease(newPartitionLock);
UnpinBuffer(buf, true);
//重新寻找buffer
}
/*
* Okay, it's finally safe to rename the buffer.
* 现在终于可以安全的给buffer重命名了
*
* Clearing BM_VALID here is necessary, clearing the dirtybits is just
* paranoia. We also reset the usage_count since any recency of use of
* the old content is no longer relevant. (The usage_count starts out at
* 1 so that the buffer can survive one clock-sweep pass.)
* 如需要,清除BM_VALID标记,清除脏标记位.
* 我们还需要重置usage_count,因为使用旧内容的recency不再相关.
* (usage_count从1开始,因此buffer可以在一个时钟周期经过后仍能存活)
*
* Make sure BM_PERMANENT is set for buffers that must be written at every
* checkpoint. Unlogged buffers only need to be written at shutdown
* checkpoints, except for their "init" forks, which need to be treated
* just like permanent relations.
* 确保标记为BM_PERMANENT的buffer必须在每次checkpoint时刷到磁盘上.
* Unlogged缓冲只需要在shutdown checkpoint时才需要写入,除非它们"init" forks,
* 这些操作需要类似持久化关系一样处理.
*/
buf->tag = newTag;
buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
BUF_USAGECOUNT_MASK);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
else
buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
UnlockBufHdr(buf, buf_state);
if (oldPartitionLock != NULL)
{
BufTableDelete(&oldTag, oldHash);
if (oldPartitionLock != newPartitionLock)
LWLockRelease(oldPartitionLock);
}
LWLockRelease(newPartitionLock);
/*
* Buffer contents are currently invalid. Try to get the io_in_progress
* lock. If StartBufferIO returns false, then someone else managed to
* read it before we did, so there's nothing left for BufferAlloc() to do.
* 缓冲区内存已无效.
* 尝试获取io_in_progress lock.如StartBufferIO返回F,意味着其他进程已在我们完成前读取该缓冲区,
* 因此对于BufferAlloc()来说,已无事可做.
*/
if (StartBufferIO(buf, true))
*foundPtr = false;
else
*foundPtr = true;
return buf;
}
测试脚本,查询数据表:
10:01:54 (xdb@[local]:5432)testdb=# select * from t1 limit 10;
启动gdb,设置断点
(gdb) b BufferAlloc
Breakpoint 1 at 0x8778ad: file bufmgr.c, line 1005.
(gdb) c
Continuing.
Breakpoint 1, BufferAlloc (smgr=0x2267430, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, strategy=0x0,
foundPtr=0x7ffcc97fb4f3) at bufmgr.c:1005
1005 INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
(gdb)
输入参数
smgr-SMgrRelationData结构体指针
relpersistence-关系是否持久化
forkNum-fork类型,MAIN_FORKNUM对应数据文件,还有fsm/vm文件
blockNum-块号
strategy-buffer访问策略,为NULL
*foundPtr-输出参数
(gdb) p *smgr
$1 = {smgr_rnode = {node = {spcNode = 1663, dbNode = 16402, relNode = 51439}, backend = -1}, smgr_owner = 0x7f86133f3778,
smgr_targblock = 4294967295, smgr_fsm_nblocks = 4294967295, smgr_vm_nblocks = 4294967295, smgr_which = 0,
md_num_open_segs = {0, 0, 0, 0}, md_seg_fds = {0x0, 0x0, 0x0, 0x0}, next_unowned_reln = 0x0}
(gdb) p *smgr->smgr_owner
$2 = (struct SMgrRelationData *) 0x2267430
(gdb) p **smgr->smgr_owner
$3 = {smgr_rnode = {node = {spcNode = 1663, dbNode = 16402, relNode = 51439}, backend = -1}, smgr_owner = 0x7f86133f3778,
smgr_targblock = 4294967295, smgr_fsm_nblocks = 4294967295, smgr_vm_nblocks = 4294967295, smgr_which = 0,
md_num_open_segs = {0, 0, 0, 0}, md_seg_fds = {0x0, 0x0, 0x0, 0x0}, next_unowned_reln = 0x0}
(gdb)
1.初始化,根据Tag确定hash值和分区锁定ID
(gdb) n
1008 newHash = BufTableHashCode(&newTag);
(gdb) p newTag
$4 = {rnode = {spcNode = 1663, dbNode = 16402, relNode = 51439}, forkNum = MAIN_FORKNUM, blockNum = 0}
(gdb) n
1009 newPartitionLock = BufMappingPartitionLock(newHash);
(gdb)
1012 LWLockAcquire(newPartitionLock, LW_SHARED);
(gdb)
1013 buf_id = BufTableLookup(&newTag, newHash);
(gdb) p newHash
$5 = 1398580903
(gdb) p newPartitionLock
$6 = (LWLock *) 0x7f85e5db9600
(gdb) p *newPartitionLock
$7 = {tranche = 59, state = {value = 536870913}, waiters = {head = 2147483647, tail = 2147483647}}
(gdb)
2.检查block是否已在buffer pool中
(gdb) n
1014 if (buf_id >= 0)
(gdb) p buf_id
$8 = -1
4.在缓冲区中找不到该buffer(buf_id < 0)
4.1释放newPartitionLock
4.2执行循环,寻找合适的buffer
4.2.1确保在自旋锁尚未持有时,有一个空闲的refcount入口(条目) —-> ReservePrivateRefCountEntry
(gdb) n
1056 LWLockRelease(newPartitionLock);
(gdb)
1065 ReservePrivateRefCountEntry();
(gdb)
4.2.2选择一个待淘汰的buffer
(gdb) n
1071 buf = StrategyGetBuffer(strategy, &buf_state);
(gdb) n
1073 Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
(gdb) p buf
$9 = (BufferDesc *) 0x7f85e705fd80
(gdb) p *buf
$10 = {tag = {rnode = {spcNode = 0, dbNode = 0, relNode = 0}, forkNum = InvalidForkNumber, blockNum = 4294967295},
buf_id = 104, state = {value = 4194304}, wait_backend_pid = 0, freeNext = -2, content_lock = {tranche = 54, state = {
value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}
(gdb)
4.2.3拷贝buffer flags到oldFlags中
(gdb) n
1076 oldFlags = buf_state & BUF_FLAG_MASK;
(gdb)
4.2.4Pin buffer,然后释放buffer自旋锁
(gdb)
1079 PinBuffer_Locked(buf);
(gdb)
4.2.5如buffer标记位BM_DIRTY,FlushBuffer
1088 if (oldFlags & BM_DIRTY)
(gdb)
4.2.6如buffer标记为BM_TAG_VALID,计算原tag的hashcode和partition lock ID,并锁定新旧分区锁
否则需要新的分区,锁定新分区锁,重置原分区锁和原hash值
(gdb)
1166 if (oldFlags & BM_TAG_VALID)
(gdb)
1200 LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
(gdb)
1202 oldPartitionLock = NULL;
(gdb)
1204 oldHash = 0;
(gdb) p oldFlags
$11 = 4194304
(gdb)
4.2.7尝试使用buffer新的tag构造hash表入口
(gdb)
1214 buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
(gdb) n
1216 if (buf_id >= 0)
(gdb) p buf_id
$12 = -1
(gdb)
4.2.9不存在冲突(buf_id < 0),锁定buffer header,如缓冲区没有变脏或者被pinned,则已找到buf,跳出循环
否则,解锁buffer header,删除hash表入口,释放锁,重新寻找buffer
(gdb) n
1267 buf_state = LockBufHdr(buf);
(gdb)
1275 oldFlags = buf_state & BUF_FLAG_MASK;
(gdb)
1276 if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
(gdb)
1277 break;
(gdb)
4.3可以重新设置buffer tag,完成后解锁buffer header,删除原有的hash表入口,释放分区锁
1301 buf->tag = newTag;
(gdb)
1302 buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
(gdb)
1305 if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
(gdb)
1306 buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
(gdb)
1310 UnlockBufHdr(buf, buf_state);
(gdb)
1312 if (oldPartitionLock != NULL)
(gdb)
1319 LWLockRelease(newPartitionLock);
(gdb) p *buf
$13 = {tag = {rnode = {spcNode = 1663, dbNode = 16402, relNode = 51439}, forkNum = MAIN_FORKNUM, blockNum = 0},
buf_id = 104, state = {value = 2181300225}, wait_backend_pid = 0, freeNext = -2, content_lock = {tranche = 54, state = {
value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}
(gdb)
4.4执行StartBufferIO,设置*foundPtr标记
(gdb)
1326 if (StartBufferIO(buf, true))
(gdb) n
1327 *foundPtr = false;
(gdb)
4.5返回buf
(gdb)
1331 return buf;
(gdb)
1332 }
(gdb)
执行完成
(gdb)
ReadBuffer_common (smgr=0x2267430, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0,
hit=0x7ffcc97fb5eb) at bufmgr.c:747
747 if (found)
(gdb)
750 pgBufferUsage.shared_blks_read++;
(gdb)
DONE!
PG Source Code
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/6906/viewspace-2636434/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/6906/viewspace-2636434/