Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:
|
|
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed 256 */#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted 512 */#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed 1024 */#define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted 2048 */
If neither of the XMIN bits is set, then either:1. The creating transaction is still in progress, which you can check by examining the list of running transactions in shared memory;2. You are the first one to check since it ended, in which case you need to consult pg_clog to know the transaction's status, and you can update the hint bits if you find out its final state.
If the tuple has been marked deleted, then similar remarks apply to the XMAX bits.
* tqual.c* POSTGRES "time qualification" code, ie, tuple visibility rules.** NOTE: all the HeapTupleSatisfies routines will update the tuple's* "hint" status bits if we see that the inserting or deleting transaction* has now committed or aborted (and it is safe to set the hint bits).* If the hint bits are changed, MarkBufferDirtyHint is called on* the passed-in buffer. The caller must hold not only a pin, but at least* shared buffer content lock on the buffer containing the tuple.** NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)* before TransactionIdDidCommit/TransactionIdDidAbort (which look in* pg_clog). Otherwise we have a race condition: we might decide that a* just-committed transaction crashed, because none of the tests succeed.* xact.c is careful to record commit/abort in pg_clog before it unsets* MyPgXact->xid in PGXACT array. That fixes that problem, but it also* means there is a window where TransactionIdIsInProgress and* TransactionIdDidCommit will both return true. If we check only* TransactionIdDidCommit, we could consider a tuple committed when a* later GetSnapshotData call will still think the originating transaction* is in progress, which leads to application-level inconsistency. The* upshot is that we gotta check TransactionIdIsInProgress first in all* code paths, except for a few cases where we are looking at* subtransactions of our own main transaction and so there can't be any* race condition.** Summary of visibility functions:** HeapTupleSatisfiesMVCC()* visible to supplied snapshot, excludes current command* HeapTupleSatisfiesUpdate()* visible to instant snapshot, with user-supplied command* counter and more complex result* HeapTupleSatisfiesSelf()* visible to instant snapshot and current command* HeapTupleSatisfiesDirty()* like HeapTupleSatisfiesSelf(), but includes open transactions* HeapTupleSatisfiesVacuum()* visible to any running transaction, used by VACUUM* HeapTupleSatisfiesToast()* visible unless part of interrupted vacuum, used for TOAST* HeapTupleSatisfiesAny()* all tuples are visible** Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group* Portions Copyright (c) 1994, Regents of the University of California** IDENTIFICATION* src/backend/utils/time/tqual.c
/** SetHintBits()** Set commit/abort hint bits on a tuple, if appropriate at this time.** It is only safe to set a transaction-committed hint bit if we know the* transaction's commit record has been flushed to disk, or if the table is* temporary or unlogged and will be obliterated by a crash anyway. We* cannot change the LSN of the page here because we may hold only a share* lock on the buffer, so we can't use the LSN to interlock this; we have to* just refrain from setting the hint bit until some future re-examination* of the tuple.** We can always set hint bits when marking a transaction aborted. (Some* code in heapam.c relies on that!)** Also, if we are cleaning up HEAP_MOVED_IN or HEAP_MOVED_OFF entries, then* we can always set the hint bits, since pre-9.0 VACUUM FULL always used* synchronous commits and didn't move tuples that weren't previously* hinted. (This is not known by this subroutine, but is applied by its* callers.) Note: old-style VACUUM FULL is gone, but we have to keep this* module's support for MOVED_OFF/MOVED_IN flag bits for as long as we* support in-place update from pre-9.0 databases.** Normal commits may be asynchronous, so for those we need to get the LSN* of the transaction and then check whether this is flushed.** The caller should pass xid as the XID of the transaction to check, or* InvalidTransactionId if no check is needed.*/static inline voidSetHintBits(HeapTupleHeader tuple, Buffer buffer,uint16 infomask, TransactionId xid){if (TransactionIdIsValid(xid)){/* NB: xid must be known committed here! */XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid); // 获取事务对应的commitLSN
if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)) // 在设置hint bits前,必须确保事务对应的xlog 已经flush到磁盘,否则可能出现不一致的情况。例如数据恢复时xlog没有,但是CLOG显示已提交。return; /* not flushed yet, so don't set hint */}
tuple->t_infomask |= infomask; // 设置hint bitsMarkBufferDirtyHint(buffer, true); // 将buffer标记为dirty,当initdb 打开了checksum或者使用了wal_log_hints时,如果它刚好是checkpoint后的第一个脏页,则写full page到WAL。}
> truncate t;postgres=# select pg_backend_pid();pg_backend_pid----------------5497(1 row)
[root@digoal ~]# cat trc.stpglobal f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c").call {f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )}
probe process("/opt/pgsql/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c").return {t=gettimeofday_ms()a=execname()b=cpu()c=pid()d=pp()e=tid()if (f_start[a,c,e,b]) {# printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $$params$$)printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)}}
[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 5497
postgres=# insert into t values (1);INSERT 0 1
postgres=# select * from t;id----1(1 row)
71259406 postgres(5497): <- time:1441448520839, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734170, .t_xmax=0, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734170, .datum_typmod=0, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=1}, .t_infomask2=1, .t_infomask=2048, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=256 xid=39073417071259458 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
postgres=# insert into t values (2);INSERT 0 1
postgres=# update t set id=3;UPDATE 2
5356459357 postgres(5497): <- time:1441453806039, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734178, .t_xmax=0, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734178, .datum_typmod=0, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=2}, .t_infomask2=1, .t_infomask=2048, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=256 xid=3907341785356459410 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
postgres=# select * from t;id----33(2 rows)5464475078 postgres(5497): <- time:1441453914055, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734177, .t_xmax=390734179, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734177, .datum_typmod=390734179, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=3}, .t_infomask2=16385, .t_infomask=256, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=1024 xid=3907341795464475132 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return5464475156 postgres(5497): <- time:1441453914055, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734178, .t_xmax=390734179, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734178, .datum_typmod=390734179, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=4}, .t_infomask2=16385, .t_infomask=256, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=1024 xid=3907341795464475190 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
5464475210 postgres(5497): <- time:1441453914055, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734179, .t_xmax=0, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734179, .datum_typmod=0, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=3}, .t_infomask2=32769, .t_infomask=10240, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=256 xid=3907341795464475243 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return5464475263 postgres(5497): <- time:1441453914055, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734179, .t_xmax=0, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734179, .datum_typmod=0, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=4}, .t_infomask2=32769, .t_infomask=10240, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=256 xid=3907341795464475294 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
/** MarkBufferDirtyHint** Mark a buffer dirty for non-critical changes.** This is essentially the same as MarkBufferDirty, except:** 1. The caller does not write WAL; so if checksums are enabled, we may need* to write an XLOG_HINT WAL record to protect against torn pages.* 2. The caller might have only share-lock instead of exclusive-lock on the* buffer's content lock.* 3. This function does not guarantee that the buffer is always marked dirty* (due to a race condition), so it cannot be used for important changes.*/voidMarkBufferDirtyHint(Buffer buffer, bool buffer_std)
if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT)){...MyPgXact->delayChkpt = delayChkpt = true;lsn = XLogSaveBufferForHint(buffer, buffer_std);......
#define XLogHintBitIsNeeded() (DataChecksumsEnabled() || wal_log_hints)
/** Are checksums enabled for data pages?*/boolDataChecksumsEnabled(void){Assert(ControlFile != NULL);return (ControlFile->data_checksum_version > 0);}
/** Write a backup block if needed when we are setting a hint. Note that* this may be called for a variety of page types, not just heaps.** Callable while holding just share lock on the buffer content.** We can't use the plain backup block mechanism since that relies on the* Buffer being exclusively locked. Since some modifications (setting LSN, hint* bits) are allowed in a sharelocked buffer that can lead to wal checksum* failures. So instead we copy the page and insert the copied data as normal* record data.** We only need to do something if page has not yet been full page written in* this checkpoint round. The LSN of the inserted wal record is returned if we* had to write, InvalidXLogRecPtr otherwise.** It is possible that multiple concurrent backends could attempt to write WAL* records. In that case, multiple copies of the same block would be recorded* in separate WAL records by different backends, though that is still OK from* a correctness perspective. // 可能写多次哦*/XLogRecPtrXLogSaveBufferForHint(Buffer buffer, bool buffer_std){
/** Determine whether the buffer referenced by an XLogRecData item has to* be backed up, and if so fill a BkpBlock struct for it. In any case* save the buffer's LSN at *lsn.*/static boolXLogCheckBuffer(XLogRecData *rdata, bool holdsExclusiveLock,XLogRecPtr *lsn, BkpBlock *bkpb){
[root@digoal ~]# cat trc.stpglobal f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c").call {f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()# printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )}
probe process("/opt/pgsql/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c").return {t=gettimeofday_ms()a=execname()b=cpu()c=pid()d=pp()e=tid()if (f_start[a,c,e,b]) {printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)}}
postgres=# update t set id=4;UPDATE 2postgres=# checkpoint;CHECKPOINTpostgres=# select * from t;id----44(2 rows)
0 postgres(5497): -> time:1441457600685, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").call30 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").return, par:'\001'
postgres=# update t set id=5;UPDATE 2
0 postgres(5497): -> time:1441457627431, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").call27 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").return, par:'\000'0 postgres(5497): -> time:1441457627431, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").call20 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").return, par:'\000'
/** UpdateXmaxHintBits - update tuple hint bits after xmax transaction ends** This is called after we have waited for the XMAX transaction to terminate.* If the transaction aborted, we guarantee the XMAX_INVALID hint bit will* be set on exit. If the transaction committed, we set the XMAX_COMMITTED* hint bit if possible --- but beware that that may not yet be possible,* if the transaction committed asynchronously.** Note that if the transaction was a locker only, we set HEAP_XMAX_INVALID* even if it commits.** Hence callers should look only at XMAX_INVALID.** Note this is not allowed for tuples whose xmax is a multixact.*/static voidUpdateXmaxHintBits(HeapTupleHeader tuple, Buffer buffer, TransactionId xid){Assert(TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple), xid));Assert(!(tuple->t_infomask & HEAP_XMAX_IS_MULTI));
if (!(tuple->t_infomask & (HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID))){if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&TransactionIdDidCommit(xid))HeapTupleSetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,xid);elseHeapTupleSetHintBits(tuple, buffer, HEAP_XMAX_INVALID,InvalidTransactionId);}}