《PostgreSQL 流程—全表遍历》
《PostgreSQL重启恢复—Checkpoint&Redo》
在《PostgreSQL 流程—全表遍历》中我们讲解过一个函数heapgetpage,该函数用于获取页面中或有可见的元组。在全表遍历中,我们遗留了一个问题,就是如何判断元组的可见性,本文就来重点描述关于可见性的相关问题。
考虑这样一个场景,当一个进程P1正在修改某元组R,进程P2希望读取R,此时应该如何处理?这是一个典型的多进程并发控制的场景。对于多进程并发控制,常用的方式就是采用读写锁,即读加共享锁,写加互斥锁,从而使读写操作互斥。但在数据库中,这样的方式会极大的降低系统的并发性。为此数据库采用了Mutil-Version Concurrency Control(MVCC,多版本并发控制),为元组保留多个不同的版本,读写操作作用于不同的版本,从而可以并行执行,解决读写互斥的问题。关于MVCC更详细的内容可以参见相关资料。
在MVCC中,对某条元组执行update操作后,就会存在两个版本,修改前的版本和修改后的版本。多次修改后,就会存在多个版本。然而对于一个查询操作,最终只可能返回其中的一个版本。那么应该返回哪个版本呢?这也就是所谓的可见性问题。即多个版本中,只有一个版本对于当前事务可见。那么如何判断元组是否可见呢?元组可见的准则如下:
在当前事务开启之前提交的最新版本,对当前事务可见。
该准则包含了两个要点:
下面分别阐述PostgreSQL是如何校验这两个要点的。
PostgreSQL会为每个事务生成一个事务ID,即xid。xid是一个32位的整数,按照事务开启的先后顺序递增,所以通过xid就可以判断事务开启的先后顺序。当事务提交后,PostgreSQL会记录clog,表明该事务已经提交,所以通过clog也可以判断事务是否提交。通过xid和clog可以很方便的判断事务何时开启以及是否提交。此外PostgreSQL还维护了全局的活跃事务数组,里面存放了所有当前活跃事务的xid。所以当一个事务开启时,我们可以很方便的获取两个重要的信息:
此时有哪些事务是活跃的
获取这些活跃事务中最小的xid,即xmin。那么所有< xmin的事务一定是当前事务开启之前提交的。这些事务所作的操作对于当前事务都是可见的。
这些活跃事务本身在当前事务开启后依然是活跃的,所以这些事务的操作对当前事务是不可见的。
此时有哪些事务是提交的
获取这些提交事务中最大的xid,即xmax。那么所有> xmax的事务一定是当前事务开启之后还未提交的。这些事务所作的操作对于当前事务都是不可见的。
这里还有一个细节,xmax实际是最大提交事务的xid+1,代码如下:
/* procarray.c 1565行 */ /* xmax is always latestCompletedXid + 1 */ xmax = ShmemVariableCache->latestCompletedXid; Assert(TransactionIdIsNormal(xmax)); TransactionIdAdvance(xmax); /* xmax++ */
然后所有 >= xmax的事务所作的操作,对当前事务不可见。
设某事务T的事务id为xid,根据上述性质,我们可以很方便的得出事务T可见性的判断流程:
比较xid与xmin
如果xid < xmin,则T对当前事务可见,否则执行步骤2。
比较xid与xmax
如果xid >= xmax,则T对当前事务不可见,否则执行步骤3。
在活跃事务数组中查找xid
如果xid存在于活跃事务数组,则T对当前事务不可见,否则T对当前事务可见。
对于步骤1,是一个优化步骤,去掉步骤1,不会出现任何正确性问题,只不过对于xid < xmin的事务都需要到在步骤3中通过遍历活跃数组的方式来判断可见性,这样性能会比较低。对于步骤2,是一个影响正确性的步骤,因为对于在当前事务开启之后再开启的事务,可能不会出现在活跃事务数组中(后面会说明为什么会出现这样的情况),显然这些事务对当前事务都不可见,所以需要步骤2来过滤。(那些未出现在活跃事务数组中的xid一定>= xmax,后面会具体说明)。
下面我们来看看事务可见性判断的实现函数,XidInMVCCSnapshot。
bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
uint32 i;
/*
* Make a quick range check to eliminate most XIDs without looking at the
* xip arrays. Note that this is OK even if we convert a subxact XID to
* its parent below, because a subxact with XID < xmin has surely also got
* a parent with XID < xmin, while one with XID >= xmax must belong to a
* parent that was not yet committed at the time of this snapshot.
*/
/*
* Any xid < xmin is not in-progress
* 步骤1:比较xid与xmin
*/
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
/*
* Any xid >= xmax is in-progress
* 步骤2:比较xid与xmax
*/
if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
return true;
/*
* Snapshot information is stored slightly differently in snapshots taken
* during recovery.
*/
if (!snapshot->takenDuringRecovery)
{
/*
* If the snapshot contains full subxact data, the fastest way to
* check things is just to compare the given XID against both subxact
* XIDs and top-level XIDs. If the snapshot overflowed, we have to
* use pg_subtrans to convert a subxact XID to its parent XID, but
* then we need only look at top-level XIDs not subxacts.
*/
if (!snapshot->suboverflowed)
{
/* we have full data, so search subxip */
int32 j;
for (j = 0; j < snapshot->subxcnt; j++)
{
if (TransactionIdEquals(xid, snapshot->subxip[j]))
return true;
}
/* not there, fall through to search xip[] */
}
else
{
/*
* Snapshot overflowed, so convert xid to top-level. This is safe
* because we eliminated too-old XIDs above.
*/
xid = SubTransGetTopmostTransaction(xid);
/*
* If xid was indeed a subxact, we might now have an xid < xmin,
* so recheck to avoid an array scan. No point in rechecking
* xmax.
*/
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
}
/* 步骤3:在活跃事务数组中查找xid */
for (i = 0; i < snapshot->xcnt; i++)
{
if (TransactionIdEquals(xid, snapshot->xip[i]))
return true;
}
}
else
{
int32 j;
/*
* In recovery we store all xids in the subxact array because it is by
* far the bigger array, and we mostly don't know which xids are
* top-level and which are subxacts. The xip array is empty.
*
* We start by searching subtrans, if we overflowed.
*/
if (snapshot->suboverflowed)
{
/*
* Snapshot overflowed, so convert xid to top-level. This is safe
* because we eliminated too-old XIDs above.
*/
xid = SubTransGetTopmostTransaction(xid);
/*
* If xid was indeed a subxact, we might now have an xid < xmin,
* so recheck to avoid an array scan. No point in rechecking
* xmax.
*/
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
}
/*
* We now have either a top-level xid higher than xmin or an
* indeterminate xid. We don't know whether it's top level or subxact
* but it doesn't matter. If it's present, the xid is visible.
*/
for (j = 0; j < snapshot->subxcnt; j++)
{
if (TransactionIdEquals(xid, snapshot->subxip[j]))
return true;
}
}
return false;
}
这里需要注意一下XidInMVCCSnapshot的返回值,XidInMVCCSnapshot返回true表示xid对当前事务不可见。
说完事务可见性,我们来看看元组可见性,元组可见性是基于事务可见性的。在PostgreSQL中,每条元组上记录了两个xid,t_xmin和t_xmax,注意这里的t_xmin、t_xmax和前面事务可见性的xmin、xmax完全不是一回事!!在元组中,t_xmin表示对该元组执行插入操作的事务的xid,t_xmax表示对该元组执行删除操作的事务的xid,如果元组没有被删除t_xmax为0。在PostgreSQL中,update操作删除+插入操作,先删除原始的元组,再插入新的元组,所以t_xmin和t_xmax逻辑也完全适用于更新操作。下面我们先从源代码的角度来调试下t_xmin和t_xmax。
用例1:元组插入
-- 建表
drop table if exists t1;
create table t1(a int);
-- 插入
begin;
select txid_current(); -- 查看当前事务id
insert into t1 values(1); -- 该元组的t_xmin等于txid_current(),t_xmax为0
commit;
-- 查询
select * from t1; -- 验证t_xmin和t_xmax
插入操作的事务id如下:
执行查询操作,我们可以看到元组的xmin和xmax的情况,如下图:
HeapTupleSatisfiesMVCC是判断元组可见性的函数,后面我们会详细阐述。从图中可以看出,元组的t_xmin为688,就是前面插入事务的xid,而由于此时元组没有被删除,所以t_xmax为0。
用例2:元组更新
接着我们把用例1插入的元组进行更新。
-- 更新
begin;
select txid_current(); -- 查看当前事务id
update t1 set a = 2; -- 该操作会将a = 1的元组的t_xmax变为txid_current(),
-- 然后插入a = 2的元组,该元组的t_xmin为txid_current(),t_xmax为0
commit;
-- 查询
select * from t1; -- 验证t_xmin和t_xmax
执行查询操作,此时我们会看到两条元组,由于是全表遍历,所以会先看到原始元组(即a = 1的元组),然后看到更新后的元组(即a = 2的元组)。
可以看到原始元组的t_xmax被改为了698,正是此次update操作的事务id。
这是update操作插入的新元组,新元组的t_xmin为此次update操作的事务id,新元组的t_xmax为0。
用例3:元组删除
接着我们把用例2更新的元组进行删除。
-- 删除
begin;
select txid_current(); -- 查看当前事务id
delete from t1; -- 该操作会将元组的t_xmax变为txid_current(),
commit;
-- 查询
select * from t1; -- 验证t_xmin和t_xmax
明白了更新操作,删除其实就非常简单了,删除操作会修改元组的t_xmax。
明白了元组的t_xmin和t_xmax,再来谈元组的可见性其实就非常简单了。
如果某元组的t_xmin对当前事务不可见,那么该元组对当前事务不可见。
这是显而易见的,t_xmin不可见意味着,插入这条元组的事务对当前事务不可见,自然该元组对当前事务也就不可见。
如果元组的t_xmax对当前事务可见,那么该元组对当前事务不可见。
t_xmax对当前事务可见,就意味着删除对当前事务可见,删除既然可见,就说明了对当前事务而言,该元组已经被删除了,元组既然被删除自然也就不可见了。
下面我们来看看HeapTupleSatisfiesMVCC函数的实现。
在文件tqual.h和tqual.c中定义了一系列HeapTupleSatisfies函数,如下图所示:
这些函数用于在不同情况下判断元组的可见性。而在查询时用于判断元组可见性的函数为HeapTupleSatisfiesMVCC,其实现如下:
bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
Buffer buffer)
{
HeapTupleHeader tuple = htup->t_data;
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
/* 关键函数,暂时忽略 */
if (!HeapTupleHeaderXminCommitted(tuple))
{
/* 关键函数,暂时忽略 */
if (HeapTupleHeaderXminInvalid(tuple))
return false;
/*
* Used by pre-9.0 binary upgrades
* 兼容老版本,可以忽略
*/
if (tuple->t_infomask & HEAP_MOVED_OFF)
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
if (TransactionIdIsCurrentTransactionId(xvac))
return false;
if (!XidInMVCCSnapshot(xvac, snapshot))
{
if (TransactionIdDidCommit(xvac))
{
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
return false;
}
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
InvalidTransactionId);
}
}
/*
* Used by pre-9.0 binary upgrades
* 兼容老版本,可以忽略
*/
else if (tuple->t_infomask & HEAP_MOVED_IN)
{
TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
if (!TransactionIdIsCurrentTransactionId(xvac))
{
if (XidInMVCCSnapshot(xvac, snapshot))
return false;
if (TransactionIdDidCommit(xvac))
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
InvalidTransactionId);
else
{
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
return false;
}
}
}
/* 判断该元组是否由当前事务插入 */
else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
{
if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
return false; /* inserted after scan started */
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid */
return true;
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)) /* not deleter */
return true;
if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
{
TransactionId xmax;
xmax = HeapTupleGetUpdateXid(tuple);
/* not LOCKED_ONLY, so it has to have an xmax */
Assert(TransactionIdIsValid(xmax));
/* updating subtransaction must have aborted */
if (!TransactionIdIsCurrentTransactionId(xmax))
return true;
else if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
return true; /* updated after scan started */
else
return false; /* updated before scan started */
}
if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
{
/* deleting subtransaction must have aborted */
SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
InvalidTransactionId);
return true;
}
if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
return true; /* deleted after scan started */
else
return false; /* deleted before scan started */
}
/* 判断元组t_xmin的可见性 */
else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false; /* 如果t_xmin不可见,则元组不可见 */
/* 关键函数,暂时忽略 */
else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetRawXmin(tuple));
else
{
/* it must have aborted or crashed */
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
return false;
}
}
else
{
/*
* xmin is committed, but maybe not according to our snapshot
* 判断元组t_xmin的可见性,如果t_xmin不可见,则元组不可见
*/
if (!HeapTupleHeaderXminFrozen(tuple) &&
XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false; /* treat as still in progress */
}
/* by here, the inserting transaction has committed */
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
return true;
if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
{
TransactionId xmax;
/* already checked above */
Assert(!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
xmax = HeapTupleGetUpdateXid(tuple);
/* not LOCKED_ONLY, so it has to have an xmax */
Assert(TransactionIdIsValid(xmax));
if (TransactionIdIsCurrentTransactionId(xmax))
{
if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
return true; /* deleted after scan started */
else
return false; /* deleted before scan started */
}
if (XidInMVCCSnapshot(xmax, snapshot))
return true;
if (TransactionIdDidCommit(xmax))
return false; /* updating transaction committed */
/* it must have aborted or crashed */
return true;
}
if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED))
{
if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
{
if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
return true; /* deleted after scan started */
else
return false; /* deleted before scan started */
}
/* 判断t_xmax的可见性, t_xmax不可见,则元组可见*/
if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
return true;
if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
{
/* it must have aborted or crashed */
SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
InvalidTransactionId);
return true;
}
/* xmax transaction committed */
SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
HeapTupleHeaderGetRawXmax(tuple));
}
else
{
/*
* xmax is committed, but maybe not according to our snapshot
* 判断t_xmax的可见性, t_xmax不可见,则元组可见
*/
if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
return true; /* treat as still in progress */
}
/* xmax transaction committed */
return false;
}
我们忽略了HeapTupleSatisfiesMVCC中很多重要的问题,比如HeapTupleHeaderXminCommitted、HeapTupleHeaderXminInvalid以及TransactionIdDidCommit的作用,关于这几个函数,会在本文最后进行说明,现在我们只关注对于t_xmin和t_xmax的判断。回到前面在MVCC中我们提出的可见性的两个要素:
其实前面的描述,通过事务的可见性判断规则,我们可以判断出一个事务是否在当前事务开启前提交。那么如何获取元组的最新版本呢?其实从元组的可见性部分,我们不难看出根本不存在这个问题,对于一个事务来讲,只可能有一个版本的元组对其可见。为什么?首先,元组存在多个版本的原因只可能是对改元组执行了多次更新,从前面的描述中不难发现,更新操作有一个特点,原始元组的t_xmax与新元组的t_xmin相同,而t_xmax和t_xmin对于元组的可见性有相反的作用。所以,要么原始元组与新元组都不可见,要么只有一个可见!
前面我们讲解了XidInMVCCSnapshot作用与实现,现在我们来讨论一下XidInMVCCSnapshot的名字,哈哈!我们分解下这个函数名Xid_In_MVCC_Snapshot。xid已经介绍过,in就是字面意思,MVCC也已经介绍过,那么什么是Snapshot?PostgreSQL可是出了名的见名知意啊。
在解释Snapshot之前,我们来回顾下heapgetpage的流程:获取页面中一条元组,判断元组可见性,获取下一条元组,判断元组可见性,循环往复直到页面尽头。通过前面的描述我们知道,元组可见性的判断需要依赖三个元素:
xmin、xmax、活跃事务数组。这三个元组会随着数据库事务的开启和提交实时变化。那么如果我们在查询时直接使用全局xmin、xmax和活跃事务数组来判断可见性可以么?考虑一个简单的场景:
-- 事务T1
begin;
insert into table1 values(1),(2),(3),(4),(5);
commit;
-- 事务T2
select * from table1;
事务T1向table1中插入了5条元组。假设事务T2在全表遍历时使用全局事务数组来进行可见性判断,T2遍历到1,2两条元组时,事务T1还没有提交,所以1,2元组不可见,当遍历到元组3时,T1提交了,所以全局xmin、xmax和活跃事务数组发生了变化,T1不再是活跃的,于是元组3,4,5变为了可见!这显然违背了事务的原子性。所以在判断元组可见性时一定不能使用全局xmin、xmax和活跃事务数组。而是需要将xmin、xmax和活跃事务数组做一个快照,这也就是Snapshot!如此才能保证同一个事务产生的元组具有相同的可见性!
明白了Snapshot的作用之后,其实就非常好理解事务的隔离级别了。
read-uncommit
脏读:不判断元组可见性,只要存在就读。这个比较非主流,一般没有数据库会支持这个级别,也没这种需求。
read-commit
提交读:在每次执行SQL语句之前,做一次Snapshot,保证语句执行过程中可见性是一致的,但同一个事务语句与语句之前的Snapshot可能不一样。
repeatable-read
可重读:在事务执行第一条语句时,做一次Snapshot,后面执行的所有语句都使用这个Snapshot,保证事务执行过程中可见性是一致的。
serializable
串行化:事务之前串行执行,这个也就不用考虑可见性问题了,串行是由锁来实现。
下面我们来看看Snapshot是如何产生的。
Snapshot是通过GetTransactionSnapshot函数产生的,代码如下:
Snapshot
GetTransactionSnapshot(void)
{
/*
* Return historic snapshot if doing logical decoding. We'll never need a
* non-historic transaction snapshot in this (sub-)transaction, so there's
* no need to be careful to set one up for later calls to
* GetTransactionSnapshot().
*/
if (HistoricSnapshotActive())
{
Assert(!FirstSnapshotSet);
return HistoricSnapshot;
}
/* First call in transaction? */
if (!FirstSnapshotSet)
{
/*
* Don't allow catalog snapshot to be older than xact snapshot. Must
* do this first to allow the empty-heap Assert to succeed.
*/
InvalidateCatalogSnapshot();
Assert(pairingheap_is_empty(&RegisteredSnapshots));
Assert(FirstXactSnapshot == NULL);
if (IsInParallelMode())
elog(ERROR,
"cannot take query snapshot during a parallel operation");
/*
* In transaction-snapshot mode, the first snapshot must live until
* end of xact regardless of what the caller does with it, so we must
* make a copy of it rather than returning CurrentSnapshotData
* directly. Furthermore, if we're running in serializable mode,
* predicate.c needs to wrap the snapshot fetch in its own processing.
*/
if (IsolationUsesXactSnapshot())
{
/* First, create the snapshot in CurrentSnapshotData */
if (IsolationIsSerializable())
CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
else
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
/* Make a saved copy */
CurrentSnapshot = CopySnapshot(CurrentSnapshot);
FirstXactSnapshot = CurrentSnapshot;
/* Mark it as "registered" in FirstXactSnapshot */
FirstXactSnapshot->regd_count++;
pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
}
else
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
FirstSnapshotSet = true;
return CurrentSnapshot;
}
if (IsolationUsesXactSnapshot())
return CurrentSnapshot;
/* Don't allow catalog snapshot to be older than xact snapshot. */
InvalidateCatalogSnapshot();
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
return CurrentSnapshot;
}
该函数有两个要点:
FirstSnapshotSet变量
该变量用来控制是直接只用当前的snapshot还是重新生成snapshot。初始值为false,生成snapshot后设置为true。如果隔离级别是read-commit,则当sql执行完成后会将FirstSnapshotSet设为false,如果是repeatable-read只有在事务结束后才会将FirstSnapshotSet设为false。通过监视FirstSnapshotSet变量的变化(添加数据断点&FirstSnapshotSet)可以观察语句结束和事务提交的情况。
GetSnapshotData函数
该函数用于生成snapshot,下面会详细讲解。
GetSnapshotData是实际生成snapshot的函数,代码如下:
Snapshot
GetSnapshotData(Snapshot snapshot)
{
ProcArrayStruct *arrayP = procArray;
TransactionId xmin;
TransactionId xmax;
TransactionId globalxmin;
int index;
int count = 0;
int subcount = 0;
bool suboverflowed = false;
volatile TransactionId replication_slot_xmin = InvalidTransactionId;
volatile TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
Assert(snapshot != NULL);
/*
* Allocating space for maxProcs xids is usually overkill; numProcs would
* be sufficient. But it seems better to do the malloc while not holding
* the lock, so we can't look at numProcs. Likewise, we allocate much
* more subxip storage than is probably needed.
*
* This does open a possibility for avoiding repeated malloc/free: since
* maxProcs does not change at runtime, we can simply reuse the previous
* xip arrays if any. (This relies on the fact that all callers pass
* static SnapshotData structs.)
*
* 要点1
*/
if (snapshot->xip == NULL)
{
/*
* First call for this snapshot. Snapshot is same size whether or not
* we are in recovery, see later comments.
*/
snapshot->xip = (TransactionId *)
malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
if (snapshot->xip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
Assert(snapshot->subxip == NULL);
snapshot->subxip = (TransactionId *)
malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
if (snapshot->subxip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
}
/*
* It is sufficient to get shared lock on ProcArrayLock, even if we are
* going to set MyPgXact->xmin.
* 要点2
*/
LWLockAcquire(ProcArrayLock, LW_SHARED);
/*
* xmax is always latestCompletedXid + 1
* 要点3
*/
xmax = ShmemVariableCache->latestCompletedXid;
Assert(TransactionIdIsNormal(xmax));
TransactionIdAdvance(xmax);
/* initialize xmin calculation with xmax */
globalxmin = xmin = xmax;
snapshot->takenDuringRecovery = RecoveryInProgress();
if (!snapshot->takenDuringRecovery)
{
int *pgprocnos = arrayP->pgprocnos;
int numProcs;
/*
* Spin over procArray checking xid, xmin, and subxids. The goal is
* to gather all active xids, find the lowest xmin, and try to record
* subxids.
* 要点4
*/
numProcs = arrayP->numProcs;
for (index = 0; index < numProcs; index++)
{
int pgprocno = pgprocnos[index];
/* 要点5 */
volatile PGXACT *pgxact = &allPgXact[pgprocno];
TransactionId xid;
/*
* Backend is doing logical decoding which manages xmin
* separately, check below.
*/
if (pgxact->vacuumFlags & PROC_IN_LOGICAL_DECODING)
continue;
/* Ignore procs running LAZY VACUUM */
if (pgxact->vacuumFlags & PROC_IN_VACUUM)
continue;
/* Update globalxmin to be the smallest valid xmin */
xid = pgxact->xmin; /* fetch just once */
if (TransactionIdIsNormal(xid) &&
NormalTransactionIdPrecedes(xid, globalxmin))
globalxmin = xid;
/*
* Fetch xid just once - see GetNewTransactionId
* 要点6
*/
xid = pgxact->xid;
/*
* If the transaction has no XID assigned, we can skip it; it
* won't have sub-XIDs either. If the XID is >= xmax, we can also
* skip it; such transactions will be treated as running anyway
* (and any sub-XIDs will also be >= xmax).
*/
if (!TransactionIdIsNormal(xid)
|| !NormalTransactionIdPrecedes(xid, xmax))
continue;
/*
* We don't include our own XIDs (if any) in the snapshot, but we
* must include them in xmin.
* 要点7
*/
if (NormalTransactionIdPrecedes(xid, xmin))
xmin = xid;
if (pgxact == MyPgXact)
continue;
/*
* Add XID to snapshot.
* 要点8
*/
snapshot->xip[count++] = xid;
/*
* Save subtransaction XIDs if possible (if we've already
* overflowed, there's no point). Note that the subxact XIDs must
* be later than their parent, so no need to check them against
* xmin. We could filter against xmax, but it seems better not to
* do that much work while holding the ProcArrayLock.
*
* The other backend can add more subxids concurrently, but cannot
* remove any. Hence it's important to fetch nxids just once.
* Should be safe to use memcpy, though. (We needn't worry about
* missing any xids added concurrently, because they must postdate
* xmax.)
*
* Again, our own XIDs are not included in the snapshot.
*/
if (!suboverflowed)
{
if (pgxact->overflowed)
suboverflowed = true;
else
{
int nxids = pgxact->nxids;
if (nxids > 0)
{
volatile PGPROC *proc = &allProcs[pgprocno];
memcpy(snapshot->subxip + subcount,
(void *) proc->subxids.xids,
nxids * sizeof(TransactionId));
subcount += nxids;
}
}
}
}
}
else
{
/*
* We're in hot standby, so get XIDs from KnownAssignedXids.
*
* We store all xids directly into subxip[]. Here's why:
*
* In recovery we don't know which xids are top-level and which are
* subxacts, a design choice that greatly simplifies xid processing.
*
* It seems like we would want to try to put xids into xip[] only, but
* that is fairly small. We would either need to make that bigger or
* to increase the rate at which we WAL-log xid assignment; neither is
* an appealing choice.
*
* We could try to store xids into xip[] first and then into subxip[]
* if there are too many xids. That only works if the snapshot doesn't
* overflow because we do not search subxip[] in that case. A simpler
* way is to just store all xids in the subxact array because this is
* by far the bigger array. We just leave the xip array empty.
*
* Either way we need to change the way XidInMVCCSnapshot() works
* depending upon when the snapshot was taken, or change normal
* snapshot processing so it matches.
*
* Note: It is possible for recovery to end before we finish taking
* the snapshot, and for newly assigned transaction ids to be added to
* the ProcArray. xmax cannot change while we hold ProcArrayLock, so
* those newly added transaction ids would be filtered away, so we
* need not be concerned about them.
*/
subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
xmax);
if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
suboverflowed = true;
}
/* fetch into volatile var while ProcArrayLock is held */
replication_slot_xmin = procArray->replication_slot_xmin;
replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
if (!TransactionIdIsValid(MyPgXact->xmin))
MyPgXact->xmin = TransactionXmin = xmin;
LWLockRelease(ProcArrayLock);
/*
* Update globalxmin to include actual process xids. This is a slightly
* different way of computing it than GetOldestXmin uses, but should give
* the same result.
*/
if (TransactionIdPrecedes(xmin, globalxmin))
globalxmin = xmin;
/* Update global variables too */
RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(RecentGlobalXmin))
RecentGlobalXmin = FirstNormalTransactionId;
/* Check whether there's a replication slot requiring an older xmin. */
if (TransactionIdIsValid(replication_slot_xmin) &&
NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
RecentGlobalXmin = replication_slot_xmin;
/* Non-catalog tables can be vacuumed if older than this xid */
RecentGlobalDataXmin = RecentGlobalXmin;
/*
* Check whether there's a replication slot requiring an older catalog
* xmin.
*/
if (TransactionIdIsNormal(replication_slot_catalog_xmin) &&
NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin))
RecentGlobalXmin = replication_slot_catalog_xmin;
RecentXmin = xmin;
snapshot->xmin = xmin;
snapshot->xmax = xmax;
snapshot->xcnt = count;
snapshot->subxcnt = subcount;
snapshot->suboverflowed = suboverflowed;
snapshot->curcid = GetCurrentCommandId(false);
/*
* This is a new snapshot, so set both refcounts are zero, and mark it as
* not copied in persistent memory.
*/
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->copied = false;
if (old_snapshot_threshold < 0)
{
/*
* If not using "snapshot too old" feature, fill related fields with
* dummy values that don't require any locking.
*/
snapshot->lsn = InvalidXLogRecPtr;
snapshot->whenTaken = 0;
}
else
{
/*
* Capture the current time and WAL stream location in case this
* snapshot becomes old enough to need to fall back on the special
* "old snapshot" logic.
*/
snapshot->lsn = GetXLogInsertRecPtr();
snapshot->whenTaken = GetSnapshotCurrentTimestamp();
MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
}
return snapshot;
}
该函数较长,我们只关注其中的要点。
要点1:if (snapshot->xip == NULL)
xip是活跃事务链表,在客户端连接服务端后,创建服务进程时进行初始化。具体可以调试服务进程的创建流程。在snapshot->xip创建时会分配GetMaxSnapshotXidCount()个xid的空间,GetMaxSnapshotXidCount()是事务数的上限,如此初始化后就可以直接使用这块空间,而不必每次创建时都需要分配空间,从而提升了性能。
要点2:LWLockAcquire
既然要产生snapshot,显然就要对全局活动事务链进行上锁。
要点3:获取xmax
这个步骤前面已经提到过了。
要点4:numProcs
numProcs当前PostgreSQL的用户连接数,不管该用户连接是否开启了事务。
要点5:allPgXact
存储用户的事务信息,allPgXact中元组个数与numProcs相同。
要点6:pgxact->xid
如果当前连接没有创建事务,那么pgxact->xid就为0。
要点7:xmin = xid
获取xmin,snapshot的活动事务数组中不包含自己,但xmin可能会是自己。
要点8:snapshot->xip
向活动事务数组中添加xid。
最后,我们来解决前面遗留的问题:HeapTupleHeaderXminCommitted、HeapTupleHeaderXminInvalid以及TransactionIdDidCommit的作用。在《PostgreSQL重启恢复—Checkpoint&Redo》中,我们提到过,PostgreSQL在重启恢复时只会无脑地redo所有记录在XLOG中的数据,而不管这些数据对应的事务是否提交。所以,完成重启恢复后,数据库中就包含了一些未提交的事务产生的数据。在我们执行查询操作时,需要过滤掉这些数据。所以现在的问题就变成了:如何在查询时,过滤掉未提交事务产生的数据。所以我们需要一种机制。来判断一条元组上t_xmin和t_xmax是否提交。于是就有了CLOG。
CLOG是一种用于记录事务状态的日志,有如下四种状态:
#define TRANSACTION_STATUS_IN_PROGRESS 0x00
#define TRANSACTION_STATUS_COMMITTED 0x01
#define TRANSACTION_STATUS_ABORTED 0x02
#define TRANSACTION_STATUS_SUB_COMMITTED 0x03
当开启一个事务时,事务的状态为TRANSACTION_STATUS_IN_PROGRESS,当事务提交后状态为TRANSACTION_STATUS_COMMITTED,当用户执行了rollback后事务的状态为TRANSACTION_STATUS_ABORTED。所以通过CLOG可以非常方便的判断事务是否提交,从而判断事务的可见性。而TransactionIdDidCommit
就是通过CLOG来判断事务是否提交的。
然而,CLOG毕竟是日志,访问日志通常是很耗时的,所以如果每次需要判断元组可见性时,都去访问CLOG,那么效率是十分低下的。所以在PostgreSQL中,只有第一次判断元组可见性时,才调用TransactionIdDidCommit读取CLOG,获取可见性之后,就调用SetHintBits
将结果存储到元组的t_infomask
成员中。在后面的判断中只需要访问t_infomask就可以判断一条元组上t_xmin和t_xmax是否提交,而这就是HeapTupleHeaderXminCommitted、HeapTupleHeaderXminInvalid函数的作用。
延伸
前面说到,在查询时,会通过SetHintBits来设置元组的t_infomask成员。而这个操作,是不会产生XLOG的,原因是,我们不需要保证这个操作的持久性。因为即便是系统发生故障,导致修改后的t_infomask没有落盘,也没有关系,判断可见性时再从CLOG中取一次便好。
那如果CLOG不见了呢?我测试过的情况是这样,如果直接把CLOG删了,数据库是启不起来的。但是可以自己构建一个CLOG让数据库启起来,但是元组的可见性肯定就会出问题。