PostgreSQL 事务—MVCC

MVCC

预备知识

《PostgreSQL 流程—全表遍历》

《PostgreSQL重启恢复—Checkpoint&Redo》

概述

在《PostgreSQL 流程—全表遍历》中我们讲解过一个函数heapgetpage,该函数用于获取页面中或有可见的元组。在全表遍历中,我们遗留了一个问题,就是如何判断元组的可见性,本文就来重点描述关于可见性的相关问题。

MVCC

考虑这样一个场景,当一个进程P1正在修改某元组R,进程P2希望读取R,此时应该如何处理?这是一个典型的多进程并发控制的场景。对于多进程并发控制,常用的方式就是采用读写锁,即读加共享锁,写加互斥锁,从而使读写操作互斥。但在数据库中,这样的方式会极大的降低系统的并发性。为此数据库采用了Mutil-Version Concurrency Control(MVCC,多版本并发控制),为元组保留多个不同的版本,读写操作作用于不同的版本,从而可以并行执行,解决读写互斥的问题。关于MVCC更详细的内容可以参见相关资料。

在MVCC中,对某条元组执行update操作后,就会存在两个版本,修改前的版本和修改后的版本。多次修改后,就会存在多个版本。然而对于一个查询操作,最终只可能返回其中的一个版本。那么应该返回哪个版本呢?这也就是所谓的可见性问题。即多个版本中,只有一个版本对于当前事务可见。那么如何判断元组是否可见呢?元组可见的准则如下:

在当前事务开启之前提交最新版本,对当前事务可见。

该准则包含了两个要点:

  • 在当前事务开启之前提交。
  • 最新版本。

下面分别阐述PostgreSQL是如何校验这两个要点的。

事务可见性判断

PostgreSQL会为每个事务生成一个事务ID,即xid。xid是一个32位的整数,按照事务开启的先后顺序递增,所以通过xid就可以判断事务开启的先后顺序。当事务提交后,PostgreSQL会记录clog,表明该事务已经提交,所以通过clog也可以判断事务是否提交。通过xid和clog可以很方便的判断事务何时开启以及是否提交。此外PostgreSQL还维护了全局的活跃事务数组,里面存放了所有当前活跃事务的xid。所以当一个事务开启时,我们可以很方便的获取两个重要的信息:

  1. 此时有哪些事务是活跃的

    获取这些活跃事务中最小的xid,即xmin。那么所有< xmin的事务一定是当前事务开启之前提交的。这些事务所作的操作对于当前事务都是可见的。

    这些活跃事务本身在当前事务开启后依然是活跃的,所以这些事务的操作对当前事务是不可见的。

  2. 此时有哪些事务是提交的

    获取这些提交事务中最大的xid,即xmax。那么所有> xmax的事务一定是当前事务开启之后还提交的。这些事务所作的操作对于当前事务都是可见的。

    这里还有一个细节,xmax实际是最大提交事务的xid+1,代码如下:

    /* procarray.c 1565行 */
    /* xmax is always latestCompletedXid + 1 */
    xmax = ShmemVariableCache->latestCompletedXid;
    Assert(TransactionIdIsNormal(xmax));
    TransactionIdAdvance(xmax);  /* xmax++ */
    

    然后所有 >= xmax的事务所作的操作,对当前事务不可见。

设某事务T的事务id为xid,根据上述性质,我们可以很方便的得出事务T可见性的判断流程:

  1. 比较xid与xmin

    如果xid < xmin,则T对当前事务可见,否则执行步骤2。

  2. 比较xid与xmax

    如果xid >= xmax,则T对当前事务不可见,否则执行步骤3。

  3. 在活跃事务数组中查找xid

    如果xid存在于活跃事务数组,则T对当前事务不可见,否则T对当前事务可见。

对于步骤1,是一个优化步骤,去掉步骤1,不会出现任何正确性问题,只不过对于xid < xmin的事务都需要到在步骤3中通过遍历活跃数组的方式来判断可见性,这样性能会比较低。对于步骤2,是一个影响正确性的步骤,因为对于在当前事务开启之后再开启的事务,可能不会出现在活跃事务数组中(后面会说明为什么会出现这样的情况),显然这些事务对当前事务都不可见,所以需要步骤2来过滤。(那些未出现在活跃事务数组中的xid一定>= xmax,后面会具体说明)。

下面我们来看看事务可见性判断的实现函数,XidInMVCCSnapshot。

XidInMVCCSnapshot

bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
	uint32		i;

	/*
	 * Make a quick range check to eliminate most XIDs without looking at the
	 * xip arrays.  Note that this is OK even if we convert a subxact XID to
	 * its parent below, because a subxact with XID < xmin has surely also got
	 * a parent with XID < xmin, while one with XID >= xmax must belong to a
	 * parent that was not yet committed at the time of this snapshot.
	 */

	/* 
	 * Any xid < xmin is not in-progress 
	 * 步骤1:比较xid与xmin
	 */
	if (TransactionIdPrecedes(xid, snapshot->xmin))
		return false;
	/* 
	 * Any xid >= xmax is in-progress 
	 * 步骤2:比较xid与xmax
	 */
	if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
		return true;

	/*
	 * Snapshot information is stored slightly differently in snapshots taken
	 * during recovery.
	 */
	if (!snapshot->takenDuringRecovery)
	{
		/*
		 * If the snapshot contains full subxact data, the fastest way to
		 * check things is just to compare the given XID against both subxact
		 * XIDs and top-level XIDs.  If the snapshot overflowed, we have to
		 * use pg_subtrans to convert a subxact XID to its parent XID, but
		 * then we need only look at top-level XIDs not subxacts.
		 */
		if (!snapshot->suboverflowed)
		{
			/* we have full data, so search subxip */
			int32		j;

			for (j = 0; j < snapshot->subxcnt; j++)
			{
				if (TransactionIdEquals(xid, snapshot->subxip[j]))
					return true;
			}

			/* not there, fall through to search xip[] */
		}
		else
		{
			/*
			 * Snapshot overflowed, so convert xid to top-level.  This is safe
			 * because we eliminated too-old XIDs above.
			 */
			xid = SubTransGetTopmostTransaction(xid);

			/*
			 * If xid was indeed a subxact, we might now have an xid < xmin,
			 * so recheck to avoid an array scan.  No point in rechecking
			 * xmax.
			 */
			if (TransactionIdPrecedes(xid, snapshot->xmin))
				return false;
		}
		
        /* 步骤3:在活跃事务数组中查找xid */
		for (i = 0; i < snapshot->xcnt; i++)
		{
			if (TransactionIdEquals(xid, snapshot->xip[i]))
				return true;
		}
	}
	else
	{
		int32		j;

		/*
		 * In recovery we store all xids in the subxact array because it is by
		 * far the bigger array, and we mostly don't know which xids are
		 * top-level and which are subxacts. The xip array is empty.
		 *
		 * We start by searching subtrans, if we overflowed.
		 */
		if (snapshot->suboverflowed)
		{
			/*
			 * Snapshot overflowed, so convert xid to top-level.  This is safe
			 * because we eliminated too-old XIDs above.
			 */
			xid = SubTransGetTopmostTransaction(xid);

			/*
			 * If xid was indeed a subxact, we might now have an xid < xmin,
			 * so recheck to avoid an array scan.  No point in rechecking
			 * xmax.
			 */
			if (TransactionIdPrecedes(xid, snapshot->xmin))
				return false;
		}

		/*
		 * We now have either a top-level xid higher than xmin or an
		 * indeterminate xid. We don't know whether it's top level or subxact
		 * but it doesn't matter. If it's present, the xid is visible.
		 */
		for (j = 0; j < snapshot->subxcnt; j++)
		{
			if (TransactionIdEquals(xid, snapshot->subxip[j]))
				return true;
		}
	}

	return false;
}

这里需要注意一下XidInMVCCSnapshot的返回值,XidInMVCCSnapshot返回true表示xid对当前事务不可见

元组可见性

说完事务可见性,我们来看看元组可见性,元组可见性是基于事务可见性的。在PostgreSQL中,每条元组上记录了两个xid,t_xmint_xmax注意这里的t_xmin、t_xmax和前面事务可见性的xmin、xmax完全不是一回事!!在元组中,t_xmin表示对该元组执行插入操作的事务的xid,t_xmax表示对该元组执行删除操作的事务的xid,如果元组没有被删除t_xmax为0。在PostgreSQL中,update操作删除+插入操作,先删除原始的元组,再插入新的元组,所以t_xmin和t_xmax逻辑也完全适用于更新操作。下面我们先从源代码的角度来调试下t_xmin和t_xmax。

用例1:元组插入

-- 建表
drop table if exists t1;
create table t1(a int);
-- 插入
begin;
select txid_current(); 		-- 查看当前事务id
insert into t1 values(1);	-- 该元组的t_xmin等于txid_current(),t_xmax为0
commit;
-- 查询
select * from t1;			-- 验证t_xmin和t_xmax

插入操作的事务id如下:

PostgreSQL 事务—MVCC_第1张图片

执行查询操作,我们可以看到元组的xmin和xmax的情况,如下图:

PostgreSQL 事务—MVCC_第2张图片

HeapTupleSatisfiesMVCC是判断元组可见性的函数,后面我们会详细阐述。从图中可以看出,元组的t_xmin为688,就是前面插入事务的xid,而由于此时元组没有被删除,所以t_xmax为0。

用例2:元组更新

接着我们把用例1插入的元组进行更新。

-- 更新
begin;
select txid_current(); 		-- 查看当前事务id
update t1 set a = 2;		-- 该操作会将a = 1的元组的t_xmax变为txid_current(),
							-- 然后插入a = 2的元组,该元组的t_xmin为txid_current(),t_xmax为0
commit;
-- 查询
select * from t1;			-- 验证t_xmin和t_xmax

执行查询操作,此时我们会看到两条元组,由于是全表遍历,所以会先看到原始元组(即a = 1的元组),然后看到更新后的元组(即a = 2的元组)。

PostgreSQL 事务—MVCC_第3张图片

  • 原始元组

PostgreSQL 事务—MVCC_第4张图片

可以看到原始元组的t_xmax被改为了698,正是此次update操作的事务id。

  • 新元组

PostgreSQL 事务—MVCC_第5张图片

这是update操作插入的新元组,新元组的t_xmin为此次update操作的事务id,新元组的t_xmax为0。

用例3:元组删除

接着我们把用例2更新的元组进行删除。

-- 删除
begin;
select txid_current(); 		-- 查看当前事务id
delete from t1;				-- 该操作会将元组的t_xmax变为txid_current(),
commit;
-- 查询
select * from t1;			-- 验证t_xmin和t_xmax

明白了更新操作,删除其实就非常简单了,删除操作会修改元组的t_xmax。

PostgreSQL 事务—MVCC_第6张图片
PostgreSQL 事务—MVCC_第7张图片

明白了元组的t_xmin和t_xmax,再来谈元组的可见性其实就非常简单了。

  1. 如果某元组的t_xmin对当前事务不可见,那么该元组对当前事务不可见。

    这是显而易见的,t_xmin不可见意味着,插入这条元组的事务对当前事务不可见,自然该元组对当前事务也就不可见。

  2. 如果元组的t_xmax对当前事务可见,那么该元组对当前事务不可见。

    t_xmax对当前事务可见,就意味着删除对当前事务可见,删除既然可见,就说明了对当前事务而言,该元组已经被删除了,元组既然被删除自然也就不可见了。

下面我们来看看HeapTupleSatisfiesMVCC函数的实现。

HeapTupleSatisfiesMVCC

在文件tqual.h和tqual.c中定义了一系列HeapTupleSatisfies函数,如下图所示:

PostgreSQL 事务—MVCC_第8张图片

这些函数用于在不同情况下判断元组的可见性。而在查询时用于判断元组可见性的函数为HeapTupleSatisfiesMVCC,其实现如下:

bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
					   Buffer buffer)
{
	HeapTupleHeader tuple = htup->t_data;

	Assert(ItemPointerIsValid(&htup->t_self));
	Assert(htup->t_tableOid != InvalidOid);

    /* 关键函数,暂时忽略 */
	if (!HeapTupleHeaderXminCommitted(tuple))
	{
        /* 关键函数,暂时忽略 */
		if (HeapTupleHeaderXminInvalid(tuple))
			return false;

		/* 
		 * Used by pre-9.0 binary upgrades 
		 * 兼容老版本,可以忽略
		 */
		if (tuple->t_infomask & HEAP_MOVED_OFF)
		{
			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);

			if (TransactionIdIsCurrentTransactionId(xvac))
				return false;
			if (!XidInMVCCSnapshot(xvac, snapshot))
			{
				if (TransactionIdDidCommit(xvac))
				{
					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
								InvalidTransactionId);
					return false;
				}
				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
							InvalidTransactionId);
			}
		}
		/* 
		 * Used by pre-9.0 binary upgrades 
		 * 兼容老版本,可以忽略
		 */
		else if (tuple->t_infomask & HEAP_MOVED_IN)
		{
			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);

			if (!TransactionIdIsCurrentTransactionId(xvac))
			{
				if (XidInMVCCSnapshot(xvac, snapshot))
					return false;
				if (TransactionIdDidCommit(xvac))
					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
								InvalidTransactionId);
				else
				{
					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
								InvalidTransactionId);
					return false;
				}
			}
		}
        /* 判断该元组是否由当前事务插入 */
		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
		{
			if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
				return false;	/* inserted after scan started */

			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
				return true;

			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))	/* not deleter */
				return true;

			if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
			{
				TransactionId xmax;

				xmax = HeapTupleGetUpdateXid(tuple);

				/* not LOCKED_ONLY, so it has to have an xmax */
				Assert(TransactionIdIsValid(xmax));

				/* updating subtransaction must have aborted */
				if (!TransactionIdIsCurrentTransactionId(xmax))
					return true;
				else if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
					return true;	/* updated after scan started */
				else
					return false;		/* updated before scan started */
			}

			if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
			{
				/* deleting subtransaction must have aborted */
				SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
							InvalidTransactionId);
				return true;
			}

			if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
				return true;	/* deleted after scan started */
			else
				return false;	/* deleted before scan started */
		}
        /* 判断元组t_xmin的可见性 */
		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
			return false; /* 如果t_xmin不可见,则元组不可见 */
         /* 关键函数,暂时忽略 */
		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
						HeapTupleHeaderGetRawXmin(tuple));
		else
		{
			/* it must have aborted or crashed */
			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
						InvalidTransactionId);
			return false;
		}
	}
	else
	{
		/* 
		 * xmin is committed, but maybe not according to our snapshot 
		 * 判断元组t_xmin的可见性,如果t_xmin不可见,则元组不可见
		 */
		if (!HeapTupleHeaderXminFrozen(tuple) &&
			XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
			return false;		/* treat as still in progress */
	}

	/* by here, the inserting transaction has committed */

	if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid or aborted */
		return true;

	if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
		return true;

	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
	{
		TransactionId xmax;

		/* already checked above */
		Assert(!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));

		xmax = HeapTupleGetUpdateXid(tuple);

		/* not LOCKED_ONLY, so it has to have an xmax */
		Assert(TransactionIdIsValid(xmax));

		if (TransactionIdIsCurrentTransactionId(xmax))
		{
			if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
				return true;	/* deleted after scan started */
			else
				return false;	/* deleted before scan started */
		}
		if (XidInMVCCSnapshot(xmax, snapshot))
			return true;
		if (TransactionIdDidCommit(xmax))
			return false;		/* updating transaction committed */
		/* it must have aborted or crashed */
		return true;
	}

	if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED))
	{
		if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
		{
			if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
				return true;	/* deleted after scan started */
			else
				return false;	/* deleted before scan started */
		}
		/* 判断t_xmax的可见性, t_xmax不可见,则元组可见*/
		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
			return true;

		if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
		{
			/* it must have aborted or crashed */
			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
						InvalidTransactionId);
			return true;
		}

		/* xmax transaction committed */
		SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
					HeapTupleHeaderGetRawXmax(tuple));
	}
	else
	{
		/* 
		 * xmax is committed, but maybe not according to our snapshot 
		 * 判断t_xmax的可见性, t_xmax不可见,则元组可见
		 */
		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
			return true;		/* treat as still in progress */
	}

	/* xmax transaction committed */

	return false;
}

我们忽略了HeapTupleSatisfiesMVCC中很多重要的问题,比如HeapTupleHeaderXminCommitted、HeapTupleHeaderXminInvalid以及TransactionIdDidCommit的作用,关于这几个函数,会在本文最后进行说明,现在我们只关注对于t_xmin和t_xmax的判断。回到前面在MVCC中我们提出的可见性的两个要素:

  • 在当前事务开启之前提交。
  • 最新版本。

其实前面的描述,通过事务的可见性判断规则,我们可以判断出一个事务是否在当前事务开启前提交。那么如何获取元组的最新版本呢?其实从元组的可见性部分,我们不难看出根本不存在这个问题,对于一个事务来讲,只可能有一个版本的元组对其可见。为什么?首先,元组存在多个版本的原因只可能是对改元组执行了多次更新,从前面的描述中不难发现,更新操作有一个特点,原始元组的t_xmax与新元组的t_xmin相同,而t_xmax和t_xmin对于元组的可见性有相反的作用。所以,要么原始元组与新元组都不可见,要么只有一个可见!

再议XidInMVCCSnapshot

前面我们讲解了XidInMVCCSnapshot作用与实现,现在我们来讨论一下XidInMVCCSnapshot的名字,哈哈!我们分解下这个函数名Xid_In_MVCC_Snapshot。xid已经介绍过,in就是字面意思,MVCC也已经介绍过,那么什么是Snapshot?PostgreSQL可是出了名的见名知意啊。

在解释Snapshot之前,我们来回顾下heapgetpage的流程:获取页面中一条元组,判断元组可见性,获取下一条元组,判断元组可见性,循环往复直到页面尽头。通过前面的描述我们知道,元组可见性的判断需要依赖三个元素:

xmin、xmax、活跃事务数组。这三个元组会随着数据库事务的开启和提交实时变化。那么如果我们在查询时直接使用全局xmin、xmax和活跃事务数组来判断可见性可以么?考虑一个简单的场景:

-- 事务T1
begin;
insert into table1 values(1),(2),(3),(4),(5);
commit;
-- 事务T2
select * from table1;

事务T1向table1中插入了5条元组。假设事务T2在全表遍历时使用全局事务数组来进行可见性判断,T2遍历到1,2两条元组时,事务T1还没有提交,所以1,2元组不可见,当遍历到元组3时,T1提交了,所以全局xmin、xmax和活跃事务数组发生了变化,T1不再是活跃的,于是元组3,4,5变为了可见!这显然违背了事务的原子性。所以在判断元组可见性时一定不能使用全局xmin、xmax和活跃事务数组。而是需要将xmin、xmax和活跃事务数组做一个快照,这也就是Snapshot!如此才能保证同一个事务产生的元组具有相同的可见性!

明白了Snapshot的作用之后,其实就非常好理解事务的隔离级别了。

  1. read-uncommit

    脏读:不判断元组可见性,只要存在就读。这个比较非主流,一般没有数据库会支持这个级别,也没这种需求。

  2. read-commit

    提交读:在每次执行SQL语句之前,做一次Snapshot,保证语句执行过程中可见性是一致的,但同一个事务语句与语句之前的Snapshot可能不一样。

  3. repeatable-read

    可重读:在事务执行第一条语句时,做一次Snapshot,后面执行的所有语句都使用这个Snapshot,保证事务执行过程中可见性是一致的。

  4. serializable

    串行化:事务之前串行执行,这个也就不用考虑可见性问题了,串行是由锁来实现。

下面我们来看看Snapshot是如何产生的。

GetTransactionSnapshot

Snapshot是通过GetTransactionSnapshot函数产生的,代码如下:

Snapshot
GetTransactionSnapshot(void)
{
	/*
	 * Return historic snapshot if doing logical decoding. We'll never need a
	 * non-historic transaction snapshot in this (sub-)transaction, so there's
	 * no need to be careful to set one up for later calls to
	 * GetTransactionSnapshot().
	 */
	if (HistoricSnapshotActive())
	{
		Assert(!FirstSnapshotSet);
		return HistoricSnapshot;
	}

	/* First call in transaction? */
	if (!FirstSnapshotSet)
	{
		/*
		 * Don't allow catalog snapshot to be older than xact snapshot.  Must
		 * do this first to allow the empty-heap Assert to succeed.
		 */
		InvalidateCatalogSnapshot();

		Assert(pairingheap_is_empty(&RegisteredSnapshots));
		Assert(FirstXactSnapshot == NULL);

		if (IsInParallelMode())
			elog(ERROR,
				 "cannot take query snapshot during a parallel operation");

		/*
		 * In transaction-snapshot mode, the first snapshot must live until
		 * end of xact regardless of what the caller does with it, so we must
		 * make a copy of it rather than returning CurrentSnapshotData
		 * directly.  Furthermore, if we're running in serializable mode,
		 * predicate.c needs to wrap the snapshot fetch in its own processing.
		 */
		if (IsolationUsesXactSnapshot())
		{
			/* First, create the snapshot in CurrentSnapshotData */
			if (IsolationIsSerializable())
				CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
			else
				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
			/* Make a saved copy */
			CurrentSnapshot = CopySnapshot(CurrentSnapshot);
			FirstXactSnapshot = CurrentSnapshot;
			/* Mark it as "registered" in FirstXactSnapshot */
			FirstXactSnapshot->regd_count++;
			pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
		}
		else
			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);

		FirstSnapshotSet = true;
		return CurrentSnapshot;
	}

	if (IsolationUsesXactSnapshot())
		return CurrentSnapshot;

	/* Don't allow catalog snapshot to be older than xact snapshot. */
	InvalidateCatalogSnapshot();

	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);

	return CurrentSnapshot;
}

该函数有两个要点:

  1. FirstSnapshotSet变量

    该变量用来控制是直接只用当前的snapshot还是重新生成snapshot。初始值为false,生成snapshot后设置为true。如果隔离级别是read-commit,则当sql执行完成后会将FirstSnapshotSet设为false,如果是repeatable-read只有在事务结束后才会将FirstSnapshotSet设为false。通过监视FirstSnapshotSet变量的变化(添加数据断点&FirstSnapshotSet)可以观察语句结束和事务提交的情况。

  2. GetSnapshotData函数

    该函数用于生成snapshot,下面会详细讲解。

GetSnapshotData

GetSnapshotData是实际生成snapshot的函数,代码如下:

Snapshot
GetSnapshotData(Snapshot snapshot)
{
	ProcArrayStruct *arrayP = procArray;
	TransactionId xmin;
	TransactionId xmax;
	TransactionId globalxmin;
	int			index;
	int			count = 0;
	int			subcount = 0;
	bool		suboverflowed = false;
	volatile TransactionId replication_slot_xmin = InvalidTransactionId;
	volatile TransactionId replication_slot_catalog_xmin = InvalidTransactionId;

	Assert(snapshot != NULL);

	/*
	 * Allocating space for maxProcs xids is usually overkill; numProcs would
	 * be sufficient.  But it seems better to do the malloc while not holding
	 * the lock, so we can't look at numProcs.  Likewise, we allocate much
	 * more subxip storage than is probably needed.
	 *
	 * This does open a possibility for avoiding repeated malloc/free: since
	 * maxProcs does not change at runtime, we can simply reuse the previous
	 * xip arrays if any.  (This relies on the fact that all callers pass
	 * static SnapshotData structs.)
	 *
	 * 要点1
	 */
	if (snapshot->xip == NULL)
	{
		/*
		 * First call for this snapshot. Snapshot is same size whether or not
		 * we are in recovery, see later comments.
		 */
		snapshot->xip = (TransactionId *)
			malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
		if (snapshot->xip == NULL)
			ereport(ERROR,
					(errcode(ERRCODE_OUT_OF_MEMORY),
					 errmsg("out of memory")));
		Assert(snapshot->subxip == NULL);
		snapshot->subxip = (TransactionId *)
			malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
		if (snapshot->subxip == NULL)
			ereport(ERROR,
					(errcode(ERRCODE_OUT_OF_MEMORY),
					 errmsg("out of memory")));
	}

	/*
	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
	 * going to set MyPgXact->xmin.
	 * 要点2
	 */
	LWLockAcquire(ProcArrayLock, LW_SHARED);

	/* 
	 * xmax is always latestCompletedXid + 1 
	 * 要点3
	 */
	xmax = ShmemVariableCache->latestCompletedXid;
	Assert(TransactionIdIsNormal(xmax));
	TransactionIdAdvance(xmax);

	/* initialize xmin calculation with xmax */
	globalxmin = xmin = xmax;

	snapshot->takenDuringRecovery = RecoveryInProgress();

	if (!snapshot->takenDuringRecovery)
	{
		int		   *pgprocnos = arrayP->pgprocnos;
		int			numProcs;

		/*
		 * Spin over procArray checking xid, xmin, and subxids.  The goal is
		 * to gather all active xids, find the lowest xmin, and try to record
		 * subxids.
		 * 要点4
		 */
		numProcs = arrayP->numProcs;
		for (index = 0; index < numProcs; index++)
		{
			int			pgprocno = pgprocnos[index];
            /* 要点5 */
			volatile PGXACT *pgxact = &allPgXact[pgprocno];
			TransactionId xid;

			/*
			 * Backend is doing logical decoding which manages xmin
			 * separately, check below.
			 */
			if (pgxact->vacuumFlags & PROC_IN_LOGICAL_DECODING)
				continue;

			/* Ignore procs running LAZY VACUUM */
			if (pgxact->vacuumFlags & PROC_IN_VACUUM)
				continue;

			/* Update globalxmin to be the smallest valid xmin */
			xid = pgxact->xmin; /* fetch just once */
			if (TransactionIdIsNormal(xid) &&
				NormalTransactionIdPrecedes(xid, globalxmin))
				globalxmin = xid;

			/* 
			 * Fetch xid just once - see GetNewTransactionId 
			 * 要点6
			 */
			xid = pgxact->xid;

			/*
			 * If the transaction has no XID assigned, we can skip it; it
			 * won't have sub-XIDs either.  If the XID is >= xmax, we can also
			 * skip it; such transactions will be treated as running anyway
			 * (and any sub-XIDs will also be >= xmax).
			 */
			if (!TransactionIdIsNormal(xid)
				|| !NormalTransactionIdPrecedes(xid, xmax))
				continue;

			/*
			 * We don't include our own XIDs (if any) in the snapshot, but we
			 * must include them in xmin.
			 * 要点7
			 */
			if (NormalTransactionIdPrecedes(xid, xmin))
				xmin = xid;
			if (pgxact == MyPgXact)
				continue;

			/* 
			 * Add XID to snapshot. 
			 * 要点8
			 */
			snapshot->xip[count++] = xid;

			/*
			 * Save subtransaction XIDs if possible (if we've already
			 * overflowed, there's no point).  Note that the subxact XIDs must
			 * be later than their parent, so no need to check them against
			 * xmin.  We could filter against xmax, but it seems better not to
			 * do that much work while holding the ProcArrayLock.
			 *
			 * The other backend can add more subxids concurrently, but cannot
			 * remove any.  Hence it's important to fetch nxids just once.
			 * Should be safe to use memcpy, though.  (We needn't worry about
			 * missing any xids added concurrently, because they must postdate
			 * xmax.)
			 *
			 * Again, our own XIDs are not included in the snapshot.
			 */
			if (!suboverflowed)
			{
				if (pgxact->overflowed)
					suboverflowed = true;
				else
				{
					int			nxids = pgxact->nxids;

					if (nxids > 0)
					{
						volatile PGPROC *proc = &allProcs[pgprocno];

						memcpy(snapshot->subxip + subcount,
							   (void *) proc->subxids.xids,
							   nxids * sizeof(TransactionId));
						subcount += nxids;
					}
				}
			}
		}
	}
	else
	{
		/*
		 * We're in hot standby, so get XIDs from KnownAssignedXids.
		 *
		 * We store all xids directly into subxip[]. Here's why:
		 *
		 * In recovery we don't know which xids are top-level and which are
		 * subxacts, a design choice that greatly simplifies xid processing.
		 *
		 * It seems like we would want to try to put xids into xip[] only, but
		 * that is fairly small. We would either need to make that bigger or
		 * to increase the rate at which we WAL-log xid assignment; neither is
		 * an appealing choice.
		 *
		 * We could try to store xids into xip[] first and then into subxip[]
		 * if there are too many xids. That only works if the snapshot doesn't
		 * overflow because we do not search subxip[] in that case. A simpler
		 * way is to just store all xids in the subxact array because this is
		 * by far the bigger array. We just leave the xip array empty.
		 *
		 * Either way we need to change the way XidInMVCCSnapshot() works
		 * depending upon when the snapshot was taken, or change normal
		 * snapshot processing so it matches.
		 *
		 * Note: It is possible for recovery to end before we finish taking
		 * the snapshot, and for newly assigned transaction ids to be added to
		 * the ProcArray.  xmax cannot change while we hold ProcArrayLock, so
		 * those newly added transaction ids would be filtered away, so we
		 * need not be concerned about them.
		 */
		subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
												  xmax);

		if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
			suboverflowed = true;
	}


	/* fetch into volatile var while ProcArrayLock is held */
	replication_slot_xmin = procArray->replication_slot_xmin;
	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;

	if (!TransactionIdIsValid(MyPgXact->xmin))
		MyPgXact->xmin = TransactionXmin = xmin;

	LWLockRelease(ProcArrayLock);

	/*
	 * Update globalxmin to include actual process xids.  This is a slightly
	 * different way of computing it than GetOldestXmin uses, but should give
	 * the same result.
	 */
	if (TransactionIdPrecedes(xmin, globalxmin))
		globalxmin = xmin;

	/* Update global variables too */
	RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
	if (!TransactionIdIsNormal(RecentGlobalXmin))
		RecentGlobalXmin = FirstNormalTransactionId;

	/* Check whether there's a replication slot requiring an older xmin. */
	if (TransactionIdIsValid(replication_slot_xmin) &&
		NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
		RecentGlobalXmin = replication_slot_xmin;

	/* Non-catalog tables can be vacuumed if older than this xid */
	RecentGlobalDataXmin = RecentGlobalXmin;

	/*
	 * Check whether there's a replication slot requiring an older catalog
	 * xmin.
	 */
	if (TransactionIdIsNormal(replication_slot_catalog_xmin) &&
		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin))
		RecentGlobalXmin = replication_slot_catalog_xmin;

	RecentXmin = xmin;

	snapshot->xmin = xmin;
	snapshot->xmax = xmax;
	snapshot->xcnt = count;
	snapshot->subxcnt = subcount;
	snapshot->suboverflowed = suboverflowed;

	snapshot->curcid = GetCurrentCommandId(false);

	/*
	 * This is a new snapshot, so set both refcounts are zero, and mark it as
	 * not copied in persistent memory.
	 */
	snapshot->active_count = 0;
	snapshot->regd_count = 0;
	snapshot->copied = false;

	if (old_snapshot_threshold < 0)
	{
		/*
		 * If not using "snapshot too old" feature, fill related fields with
		 * dummy values that don't require any locking.
		 */
		snapshot->lsn = InvalidXLogRecPtr;
		snapshot->whenTaken = 0;
	}
	else
	{
		/*
		 * Capture the current time and WAL stream location in case this
		 * snapshot becomes old enough to need to fall back on the special
		 * "old snapshot" logic.
		 */
		snapshot->lsn = GetXLogInsertRecPtr();
		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
	}

	return snapshot;
}

该函数较长,我们只关注其中的要点。

  1. 要点1:if (snapshot->xip == NULL)

    xip是活跃事务链表,在客户端连接服务端后,创建服务进程时进行初始化。具体可以调试服务进程的创建流程。在snapshot->xip创建时会分配GetMaxSnapshotXidCount()个xid的空间,GetMaxSnapshotXidCount()是事务数的上限,如此初始化后就可以直接使用这块空间,而不必每次创建时都需要分配空间,从而提升了性能。

  2. 要点2:LWLockAcquire

    既然要产生snapshot,显然就要对全局活动事务链进行上锁。

  3. 要点3:获取xmax

    这个步骤前面已经提到过了。

  4. 要点4:numProcs

    numProcs当前PostgreSQL的用户连接数,不管该用户连接是否开启了事务。

  5. 要点5:allPgXact

    存储用户的事务信息,allPgXact中元组个数与numProcs相同。

  6. 要点6:pgxact->xid

    如果当前连接没有创建事务,那么pgxact->xid就为0。

  7. 要点7:xmin = xid

    获取xmin,snapshot的活动事务数组中不包含自己,但xmin可能会是自己。

  8. 要点8:snapshot->xip

    向活动事务数组中添加xid。

CLOG

最后,我们来解决前面遗留的问题:HeapTupleHeaderXminCommitted、HeapTupleHeaderXminInvalid以及TransactionIdDidCommit的作用。在《PostgreSQL重启恢复—Checkpoint&Redo》中,我们提到过,PostgreSQL在重启恢复时只会无脑地redo所有记录在XLOG中的数据,而不管这些数据对应的事务是否提交。所以,完成重启恢复后,数据库中就包含了一些未提交的事务产生的数据。在我们执行查询操作时,需要过滤掉这些数据。所以现在的问题就变成了:如何在查询时,过滤掉未提交事务产生的数据。所以我们需要一种机制。来判断一条元组上t_xmin和t_xmax是否提交。于是就有了CLOG。

CLOG是一种用于记录事务状态的日志,有如下四种状态:

#define TRANSACTION_STATUS_IN_PROGRESS		0x00
#define TRANSACTION_STATUS_COMMITTED		0x01
#define TRANSACTION_STATUS_ABORTED			0x02
#define TRANSACTION_STATUS_SUB_COMMITTED	0x03

当开启一个事务时,事务的状态为TRANSACTION_STATUS_IN_PROGRESS,当事务提交后状态为TRANSACTION_STATUS_COMMITTED,当用户执行了rollback后事务的状态为TRANSACTION_STATUS_ABORTED。所以通过CLOG可以非常方便的判断事务是否提交,从而判断事务的可见性。而TransactionIdDidCommit就是通过CLOG来判断事务是否提交的。

然而,CLOG毕竟是日志,访问日志通常是很耗时的,所以如果每次需要判断元组可见性时,都去访问CLOG,那么效率是十分低下的。所以在PostgreSQL中,只有第一次判断元组可见性时,才调用TransactionIdDidCommit读取CLOG,获取可见性之后,就调用SetHintBits将结果存储到元组的t_infomask成员中。在后面的判断中只需要访问t_infomask就可以判断一条元组上t_xmin和t_xmax是否提交,而这就是HeapTupleHeaderXminCommitted、HeapTupleHeaderXminInvalid函数的作用。

延伸

前面说到,在查询时,会通过SetHintBits来设置元组的t_infomask成员。而这个操作,是不会产生XLOG的,原因是,我们不需要保证这个操作的持久性。因为即便是系统发生故障,导致修改后的t_infomask没有落盘,也没有关系,判断可见性时再从CLOG中取一次便好。

那如果CLOG不见了呢?我测试过的情况是这样,如果直接把CLOG删了,数据库是启不起来的。但是可以自己构建一个CLOG让数据库启起来,但是元组的可见性肯定就会出问题。

你可能感兴趣的:(postgresql,postgresql,数据库)