MVCC(Multi-Version Concurrency Control)多版本并发控制,这个玩意儿当初大意过,竟然理解成了源代码的版本控制。傻了巴唧的。MVCC其实是用来做数据安全性的,有过多线程的共享数据控制的编写经验的开发人员,理解起来会更容易一些。后来在区块链中的提高交易速度时,有一些链采用了并行交易,而这其中,对交易的控制管理也使用了MVCC的控制方式。在MySql数据库数据的访问中,多个客户端访问服务端时,如果有读有写,就可能产生数据不一致的现象(脏读和幻读,而具体到为RC和RR即Read Committed和Repeatable Read两个事务,MySql默认是RR事务隔离级别),而此时就需要用到MVCC版本控制 。不同版本的MySql对MVCC的应用,可能会有所不同,这时请关注相关版本的官方说明文档,一切以官方文档或者源码为基准,不要想当然。如果想进一步对数据库中的相关数据安全性有兴趣,推荐看一下《数据密集型应用系统设计》,其中不但MVCC讲的清晰还有更深层次的各种剖析。
在MySql中,读取已提交和可重复读这两个事务中MVCC是有效的,也就是说,只有在这两种情况下,才有讨论MVCC的意义。在MySql中为了实现MVCC,InnoDB引擎默认为每一行添加了三个隐藏列(Oracle等数据库也有类似的动作),这三个列分别为:
DB_ROW_ID:6字节长的ID,MySQL中如果没有主键会默认创建这个,当初Oracle也有一个类似的ROWID;
DB_TRX_ID:6字节长的事务ID,存储了当前事务在做INSERT或UPDATE语句操作时的最后一个事务ID;
DB_ROLL_PTR:7字节长的回滚指针,其指向写入回滚段的undo log记录,通过它可以将不同的版本串联起来,形成版本链。这个如果不定期提交事务,那么会使回滚部分占满空间。
在MVCC中读操作有两种,快照读(snapshot read)和当前读(current read),快照读不加锁,只读可见版本;当前读即增删改,需要加锁,至于为啥叫读,你增删改不也得先读到指定的位置才能写!
在MySql中有两种实现事务隔离的方案,除了今天重点说的MVCC,另外简单说明一下MySql中LBCC方案,其有两个锁:
Record lock: 只锁索引而不是记录。如果没有指定主键索引,如上所述InnoDB会创建一个隐藏的主键索引。
Gap lock: 间隙锁,它创建在指定记录前或后条记录之间间隙的锁,它只要是用于解决RR隔离级别下的幻读问题。
提到MVCC就得提到Read View(这玩意儿和PBFT中的场景有点类似),在不同的事务级别下(前面提到的RC和RR),Read View的产生机制也有不同,比如RR下会创建使用同一个事务创建的快照,而RC则每次生成一个新Read View。
在查询的过程中,有两种情况,一种查询是在本事务中,一种不是在本事务中。在MySql中,单纯的查询不会产生事务ID,只有更新(增删改)操作后才会有,而且ID不是更新开始就创建而是这个语句完成后才会创建。
这里面的不同在于,如果在相同事务中,是可以看到相关的更新的数据内容的。
那么什么是Read View?前面提到过undo log,Read View其实就是通过这些快照数据产生的读视图,视图中的每条数据,可以通过上面提到的DB_TRX_ID和DB_ROLL_PTR来标识版本和指向下一个版本的指针。如果有C语言中的链表的经验那么这个说法非常容易理解。通常,这个DB_TRX_ID,即事务ID是自动+1的。所以最新的事务其ID值是最大的。弄明白了Read View,就可以理解MVCC的流程了:
1、将当前存在的事务分成三部分:已提交事务;未提交事务和已提交事务;未开始事务。这三部分通过目前已知活动的事务ID中找出最小ID,最大ID(Read View来维护)。
2、三段的意义是:小于最小ID的,表明已经提交成功,在查询时数据是可见的,也就是可以查询出来的;大于最大ID的,说明事务尚未启动,数据不可见;这里面需要说明的是“未提交事务和已提交事务”,它指的是,在Read View中,如果这个事务ID处于未提交事务数组中,那么这个数据不可见;如果不在这个数组中,则可见。记住噢,只有一个未提交事务数组。通过它来判断。
3、通过这三段ID来判断Read View中的事务ID,小于最小ID的,归为已提交事务;大于最大ID的归为未开始事务;余下的为未提交事务和已提交事务。
4、根据具体的判断结果,来决定采取使用哪个版本中的具体的数据。
5、处理版本数据并返回。
通过上面的具体分析,来看一下源码相关具体的实现:
1、基本的数据结构
基本的数据结构包括事务、MVCC和Read View:
//storage/innobase/include
/** The transaction system central memory data structure. */
struct trx_sys_t {
TrxSysMutex mutex; /*!< mutex protecting most fields in
this structure except when noted
otherwise */
MVCC *mvcc; /*!< Multi version concurrency control
manager */
volatile trx_id_t max_trx_id; /*!< The smallest number not yet
assigned as a transaction id or
transaction number. This is declared
volatile because it can be accessed
without holding any mutex during
AC-NL-RO view creation. */
std::atomic<trx_id_t> min_active_id;
/*!< Minimal transaction id which is
still in active state. */
trx_ut_list_t serialisation_list;
/*!< Ordered on trx_t::no of all the
currenrtly active RW transactions */
#ifdef UNIV_DEBUG
trx_id_t rw_max_trx_no; /*!< Max trx number of read-write
transactions added for purge. */
#endif /* UNIV_DEBUG */
char pad1[64]; /*!< To avoid false sharing */
trx_ut_list_t rw_trx_list; /*!< List of active and committed in
memory read-write transactions, sorted
on trx id, biggest first. Recovered
transactions are always on this list. */
char pad2[64]; /*!< To avoid false sharing */
trx_ut_list_t mysql_trx_list; /*!< List of transactions created
for MySQL. All user transactions are
on mysql_trx_list. The rw_trx_list
can contain system transactions and
recovered transactions that will not
be in the mysql_trx_list.
mysql_trx_list may additionally contain
transactions that have not yet been
started in InnoDB. */
trx_ids_t rw_trx_ids; /*!< Array of Read write transaction IDs
for MVCC snapshot. A ReadView would take
a snapshot of these transactions whose
changes are not visible to it. We should
remove transactions from the list before
committing in memory and releasing locks
to ensure right order of removal and
consistent snapshot. */
char pad3[64]; /*!< To avoid false sharing */
Rsegs rsegs; /*!< Vector of pointers to rollback
segments. These rsegs are iterated
and added to the end under a read
lock. They are deleted under a write
lock while the vector is adjusted.
They are created and destroyed in
single-threaded mode. */
Rsegs tmp_rsegs; /*!< Vector of pointers to rollback
segments within the temp tablespace;
This vector is created and destroyed
in single-threaded mode so it is not
protected by any mutex because it is
read-only during multi-threaded
operation. */
/** Length of the TRX_RSEG_HISTORY list (update undo logs for committed
* transactions). */
std::atomic<uint64_t> rseg_history_len;
TrxIdSet rw_trx_set; /*!< Mapping from transaction id
to transaction instance */
ulint n_prepared_trx; /*!< Number of transactions currently
in the XA PREPARED state */
bool found_prepared_trx; /*!< True if XA PREPARED trxs are
found. */
};
/** The MVCC read view manager */
//storage/innobase/include/read0read.h
class MVCC {
public:
/** Constructor
@param size Number of views to pre-allocate */
explicit MVCC(ulint size);
/** Destructor.
Free all the views in the m_free list */
~MVCC();
/** Allocate and create a view.
@param view View owned by this class created for the caller. Must be
freed by calling view_close()
@param trx Transaction instance of caller */
void view_open(ReadView *&view, trx_t *trx);
/**
Close a view created by the above function.
@param view view allocated by trx_open.
@param own_mutex true if caller owns trx_sys_t::mutex */
void view_close(ReadView *&view, bool own_mutex);
/**
Release a view that is inactive but not closed. Caller must own
the trx_sys_t::mutex.
@param view View to release */
void view_release(ReadView *&view);
/** Clones the oldest view and stores it in view. No need to
call view_close(). The caller owns the view that is passed in.
It will also move the closed views from the m_views list to the
m_free list. This function is called by Purge to determine whether it should
purge the delete marked record or not.
@param view Preallocated view, owned by the caller */
void clone_oldest_view(ReadView *view);
/**
@return the number of active views */
ulint size() const;
/**
@return true if the view is active and valid */
static bool is_view_active(ReadView *view) {
ut_a(view != reinterpret_cast<ReadView *>(0x1));
return (view != nullptr && !(intptr_t(view) & 0x1));
}
/**
Set the view creator transaction id. Note: This shouldbe set only
for views created by RW transactions. */
static void set_view_creator_trx_id(ReadView *view, trx_id_t id);
private:
/**
Validates a read view list. */
bool validate() const;
/**
Find a free view from the active list, if none found then allocate
a new view. This function will also attempt to move delete marked
views from the active list to the freed list.
@return a view to use */
inline ReadView *get_view();
/**
Get the oldest view in the system. It will also move the delete
marked read views from the views list to the freed list.
@return oldest view if found or NULL */
inline ReadView *get_oldest_view() const;
ReadView *get_view_created_by_trx_id(trx_id_t trx_id) const;
private:
// Prevent copying
MVCC(const MVCC &);
MVCC &operator=(const MVCC &);
private:
typedef UT_LIST_BASE_NODE_T(ReadView) view_list_t;
/** Free views ready for reuse. */
view_list_t m_free;
/** Active and closed views, the closed views will have the
creator trx id set to TRX_ID_MAX */
view_list_t m_views;
};
/** Mapping read-write transactions from id to transaction instance, for
creating read views and during trx id lookup for MVCC and locking. */
struct TrxTrack {
explicit TrxTrack(trx_id_t id, trx_t *trx = nullptr) : m_id(id), m_trx(trx) {
// Do nothing
}
trx_id_t m_id;
trx_t *m_trx;
};
struct TrxTrackHash {
size_t operator()(const TrxTrack &key) const { return (size_t(key.m_id)); }
};
/**
Comparator for TrxMap */
struct TrxTrackHashCmp {
bool operator()(const TrxTrack &lhs, const TrxTrack &rhs) const {
return (lhs.m_id == rhs.m_id);
}
};
/**
Comparator for TrxMap */
struct TrxTrackCmp {
bool operator()(const TrxTrack &lhs, const TrxTrack &rhs) const {
return (lhs.m_id < rhs.m_id);
}
};
// typedef std::unordered_set TrxIdSet;
typedef std::set<TrxTrack, TrxTrackCmp, ut_allocator<TrxTrack>> TrxIdSet;
//storage/innobase/include
// Friend declaration
class MVCC;
/** Read view lists the trx ids of those transactions for which a consistent
read should not see the modifications to the database. */
class ReadView {
/** This is similar to a std::vector but it is not a drop
in replacement. It is specific to ReadView. */
class ids_t {
typedef trx_ids_t::value_type value_type;
/**
Constructor */
ids_t() : m_ptr(), m_size(), m_reserved() {}
/**
Destructor */
~ids_t() { UT_DELETE_ARRAY(m_ptr); }
/** Try and increase the size of the array. Old elements are copied across.
It is a no-op if n is < current size.
@param n Make space for n elements */
void reserve(ulint n);
/**
Resize the array, sets the current element count.
@param n new size of the array, in elements */
void resize(ulint n) {
ut_ad(n <= capacity());
m_size = n;
}
/**
Reset the size to 0 */
void clear() { resize(0); }
/**
@return the capacity of the array in elements */
ulint capacity() const { return (m_reserved); }
/**
Copy and overwrite the current array contents
@param start Source array
@param end Pointer to end of array */
void assign(const value_type *start, const value_type *end);
/**
Insert the value in the correct slot, preserving the order.
Doesn't check for duplicates. */
void insert(value_type value);
/**
@return the value of the first element in the array */
value_type front() const {
ut_ad(!empty());
return (m_ptr[0]);
}
/**
@return the value of the last element in the array */
value_type back() const {
ut_ad(!empty());
return (m_ptr[m_size - 1]);
}
/**
Append a value to the array.
@param value the value to append */
void push_back(value_type value);
/**
@return a pointer to the start of the array */
trx_id_t *data() { return (m_ptr); }
/**
@return a const pointer to the start of the array */
const trx_id_t *data() const { return (m_ptr); }
/**
@return the number of elements in the array */
ulint size() const { return (m_size); }
/**
@return true if size() == 0 */
bool empty() const { return (size() == 0); }
private:
// Prevent copying
ids_t(const ids_t &);
ids_t &operator=(const ids_t &);
private:
/** Memory for the array */
value_type *m_ptr;
/** Number of active elements in the array */
ulint m_size;
/** Size of m_ptr in elements */
ulint m_reserved;
friend class ReadView;
};
public:
ReadView();
~ReadView();
/** Check whether transaction id is valid.
@param[in] id transaction id to check
@param[in] name table name */
static void check_trx_id_sanity(trx_id_t id, const table_name_t &name);
/** Check whether the changes by id are visible.
@param[in] id transaction id to check against the view
@param[in] name table name
@return whether the view sees the modifications of id. */
bool changes_visible(trx_id_t id, const table_name_t &name) const
MY_ATTRIBUTE((warn_unused_result)) {
ut_ad(id > 0);
if (id < m_up_limit_id || id == m_creator_trx_id) {
return (true);
}
check_trx_id_sanity(id, name);
if (id >= m_low_limit_id) {
return (false);
} else if (m_ids.empty()) {
return (true);
}
const ids_t::value_type *p = m_ids.data();
return (!std::binary_search(p, p + m_ids.size(), id));
}
/**
@param id transaction to check
@return true if view sees transaction id */
bool sees(trx_id_t id) const { return (id < m_up_limit_id); }
/**
Mark the view as closed */
void close() {
ut_ad(m_creator_trx_id != TRX_ID_MAX);
m_creator_trx_id = TRX_ID_MAX;
}
/**
@return true if the view is closed */
bool is_closed() const { return (m_closed); }
/**
Write the limits to the file.
@param file file to write to */
void print_limits(FILE *file) const {
fprintf(file,
"Trx read view will not see trx with"
" id >= " TRX_ID_FMT ", sees < " TRX_ID_FMT "\n",
m_low_limit_id, m_up_limit_id);
}
/** Check and reduce low limit number for read view. Used to
block purge till GTID is persisted on disk table.
@param[in] trx_no transaction number to check with */
void reduce_low_limit(trx_id_t trx_no) {
if (trx_no < m_low_limit_no) {
/* Save low limit number set for Read View for MVCC. */
ut_d(m_view_low_limit_no = m_low_limit_no);
m_low_limit_no = trx_no;
}
}
/**
@return the low limit no */
trx_id_t low_limit_no() const { return (m_low_limit_no); }
/**
@return the low limit id */
trx_id_t low_limit_id() const { return (m_low_limit_id); }
/**
@return true if there are no transaction ids in the snapshot */
bool empty() const { return (m_ids.empty()); }
#ifdef UNIV_DEBUG
/**
@return the view low limit number */
trx_id_t view_low_limit_no() const { return (m_view_low_limit_no); }
/**
@param rhs view to compare with
@return truen if this view is less than or equal rhs */
bool le(const ReadView *rhs) const {
return (m_low_limit_no <= rhs->m_low_limit_no);
}
#endif /* UNIV_DEBUG */
private:
/**
Copy the transaction ids from the source vector */
inline void copy_trx_ids(const trx_ids_t &trx_ids);
/**
Opens a read view where exactly the transactions serialized before this
point in time are seen in the view.
@param id Creator transaction id */
inline void prepare(trx_id_t id);
/**
Copy state from another view. Must call copy_complete() to finish.
@param other view to copy from */
inline void copy_prepare(const ReadView &other);
/**
Complete the copy, insert the creator transaction id into the
m_trx_ids too and adjust the m_up_limit_id *, if required */
inline void copy_complete();
/**
Set the creator transaction id, existing id must be 0 */
void creator_trx_id(trx_id_t id) {
ut_ad(m_creator_trx_id == 0);
m_creator_trx_id = id;
}
friend class MVCC;
private:
// Disable copying
ReadView(const ReadView &);
ReadView &operator=(const ReadView &);
private:
/** The read should not see any transaction with trx id >= this
value. In other words, this is the "high water mark". */
trx_id_t m_low_limit_id;
/** The read should see all trx ids which are strictly
smaller (<) than this value. In other words, this is the
low water mark". */
trx_id_t m_up_limit_id;
/** trx id of creating transaction, set to TRX_ID_MAX for free
views. */
trx_id_t m_creator_trx_id;
/** Set of RW transactions that was active when this snapshot
was taken */
ids_t m_ids;
/** The view does not need to see the undo logs for transactions
whose transaction number is strictly smaller (<) than this value:
they can be removed in purge if not needed by other views */
trx_id_t m_low_limit_no;
#ifdef UNIV_DEBUG
/** The low limit number up to which read views don't need to access
undo log records for MVCC. This could be higher than m_low_limit_no
if purge is blocked for GTID persistence. Currently used for debug
variable INNODB_PURGE_VIEW_TRX_ID_AGE. */
trx_id_t m_view_low_limit_no;
#endif /* UNIV_DEBUG */
/** AC-NL-RO transaction view that has been "closed". */
bool m_closed;
typedef UT_LIST_NODE_T(ReadView) node_t;
/** List of read views in trx_sys */
byte pad1[64 - sizeof(node_t)];
node_t m_view_list;
};
/*
其实看上面的数据结构,其实内聚性还是比较好的,内聚性好意味着学习时的难度也降低不少,至少不用不断的跳来跳去。英文注释也挺清晰。
2、读操作流程
一个完整的MVVC的对外暴露过程是从Select开始的,它的调用栈在前面提到过:
do_command->dispatch_sql_command->mysql_execute_command ->m_sql_cmd->execute---->row_sel->row_sel_get_clust_rec 最终会调用(一个集群一个非集群看实际的场景):
//storage/innobase/lock/lock0lock.cc
/** Checks that a record is seen in a consistent read.
@return true if sees, or false if an earlier version of the record
should be retrieved */
bool lock_clust_rec_cons_read_sees(
const rec_t *rec, /*!< in: user record which should be read or
passed over by a read cursor */
dict_index_t *index, /*!< in: clustered index */
const ulint *offsets, /*!< in: rec_get_offsets(rec, index) */
ReadView *view) /*!< in: consistent read view */
{
ut_ad(index->is_clustered());
ut_ad(page_rec_is_user_rec(rec));
ut_ad(rec_offs_validate(rec, index, offsets));
/* Temp-tables are not shared across connections and multiple
transactions from different connections cannot simultaneously
operate on same temp-table and so read of temp-table is
always consistent read. */
if (srv_read_only_mode || index->table->is_temporary()) {
ut_ad(view == nullptr || index->table->is_temporary());
return (true);
}
/* NOTE that we call this function while holding the search
system latch. */
trx_id_t trx_id = row_get_rec_trx_id(rec, index, offsets);
return (view->changes_visible(trx_id, index->table->name));
}
/** Checks that a non-clustered index record is seen in a consistent read.
NOTE that a non-clustered index page contains so little information on
its modifications that also in the case false, the present version of
rec may be the right, but we must check this from the clustered index
record.
@return true if certainly sees, or false if an earlier version of the
clustered index record might be needed */
bool lock_sec_rec_cons_read_sees(
const rec_t *rec, /*!< in: user record which
should be read or passed over
by a read cursor */
const dict_index_t *index, /*!< in: index */
const ReadView *view) /*!< in: consistent read view */
{
ut_ad(page_rec_is_user_rec(rec));
/* NOTE that we might call this function while holding the search
system latch. */
if (recv_recovery_is_on()) {
return (false);
} else if (index->table->is_temporary()) {
/* Temp-tables are not shared across connections and multiple
transactions from different connections cannot simultaneously
operate on same temp-table and so read of temp-table is
always consistent read. */
return (true);
}
trx_id_t max_trx_id = page_get_max_trx_id(page_align(rec));
ut_ad(max_trx_id > 0);
return (view->sees(max_trx_id));
}
看一下最后的返回值函数:
/** Check whether the changes by id are visible.
@param[in] id transaction id to check against the view
@param[in] name table name
@return whether the view sees the modifications of id. */
bool changes_visible(trx_id_t id, const table_name_t &name) const
MY_ATTRIBUTE((warn_unused_result)) {
ut_ad(id > 0);
if (id < m_up_limit_id || id == m_creator_trx_id) {
return (true);
}
check_trx_id_sanity(id, name);
if (id >= m_low_limit_id) {
return (false);
} else if (m_ids.empty()) {
return (true);
}
const ids_t::value_type *p = m_ids.data();
return (!std::binary_search(p, p + m_ids.size(), id));
}
需要注意的是,这个判断和前面讲的有些细节的不同,以源码为主,前面的分析主要是为了说明具体的应用过程。这里增加空和等于两种判断,等于表示本事务内数据,当然可见;空的话也是可见(ID在中间且空)。
3、Read View创建
刚才说过,在RR的情况下第一次查询会生成Read Veiw,那么看一下具体的过程:
//row0sel.cc
dberr_t row_search_mvcc(byte *buf, page_cur_mode_t mode,
row_prebuilt_t *prebuilt, ulint match_mode,
const ulint direction) {
DBUG_TRACE;
dict_index_t *index = prebuilt->index;
ibool comp = dict_table_is_comp(index->table);
const dtuple_t *search_tuple = prebuilt->search_tuple;
......
/* Do some start-of-statement preparations */
if (!prebuilt->sql_stat_start) {
/* No need to set an intention lock or assign a read view */
if (!MVCC::is_view_active(trx->read_view) && !srv_read_only_mode &&
prebuilt->select_lock_type == LOCK_NONE) {
ib::error(ER_IB_MSG_1031) << "MySQL is trying to perform a"
" consistent read but the read view is not"
" assigned!";
trx_print(stderr, trx, 600);
fputc('\n', stderr);
ut_error;
}
} else if (prebuilt->select_lock_type == LOCK_NONE) {
/* This is a consistent read */
/* Assign a read view for the query */
if (!srv_read_only_mode) {
trx_assign_read_view(trx);//此处调用
}
prebuilt->sql_stat_start = FALSE;
} else {
wait_table_again:
err = lock_table(0, index->table,
prebuilt->select_lock_type == LOCK_S ? LOCK_IS : LOCK_IX,
thr);
if (err != DB_SUCCESS) {
table_lock_waited = TRUE;
goto lock_table_wait;
}
prebuilt->sql_stat_start = FALSE;
}
......
}
/** Assigns a read view for a consistent read query. All the consistent reads
within the same transaction will get the same read view, which is created
when this function is first called for a new started transaction.
@return consistent read view */
ReadView *trx_assign_read_view(trx_t *trx) /*!< in/out: active transaction */
{
ut_ad(trx->state == TRX_STATE_ACTIVE);
if (srv_read_only_mode) {
ut_ad(trx->read_view == nullptr);
return (nullptr);
} else if (!MVCC::is_view_active(trx->read_view)) {
trx_sys->mvcc->view_open(trx->read_view, trx);
}
return (trx->read_view);
}
/** Allocate and create a view.
@param view View owned by this class created for the caller. Must be
freed by calling view_close()
@param trx Transaction instance of caller */
void MVCC::view_open(ReadView *&view, trx_t *trx) {
ut_ad(!srv_read_only_mode);
/** If no new RW transaction has been started since the last view
was created then reuse the the existing view. */
if (view != nullptr) {
uintptr_t p = reinterpret_cast<uintptr_t>(view);
view = reinterpret_cast<ReadView *>(p & ~1);
ut_ad(view->m_closed);
/* NOTE: This can be optimised further, for now we only
resuse the view iff there are no active RW transactions.
There is an inherent race here between purge and this
thread. Purge will skip views that are marked as closed.
Therefore we must set the low limit id after we reset the
closed status after the check. */
if (trx_is_autocommit_non_locking(trx) && view->empty()) {
view->m_closed = false;
if (view->m_low_limit_id == trx_sys_get_max_trx_id()) {
return;
} else {
view->m_closed = true;
}
}
mutex_enter(&trx_sys->mutex);
UT_LIST_REMOVE(m_views, view);
} else {
mutex_enter(&trx_sys->mutex);
view = get_view();
}
if (view != nullptr) {
view->prepare(trx->id);
UT_LIST_ADD_FIRST(m_views, view);//增加到MVCC控制视图变量中
ut_ad(!view->is_closed());
ut_ad(validate());
}
trx_sys_mutex_exit();
}
/**
Find a free view from the active list, if none found then allocate
a new view.
@return a view to use */
ReadView *MVCC::get_view() {
ut_ad(mutex_own(&trx_sys->mutex));
ReadView *view;
if (UT_LIST_GET_LEN(m_free) > 0) {
view = UT_LIST_GET_FIRST(m_free);
UT_LIST_REMOVE(m_free, view);
} else {
view = UT_NEW_NOKEY(ReadView());
if (view == nullptr) {
ib::error(ER_IB_MSG_918) << "Failed to allocate MVCC view";
}
}
return (view);
}
/**
Opens a read view where exactly the transactions serialized before this
point in time are seen in the view.
@param id Creator transaction id */
void ReadView::prepare(trx_id_t id) {
ut_ad(mutex_own(&trx_sys->mutex));
m_creator_trx_id = id;
m_low_limit_no = m_low_limit_id = m_up_limit_id = trx_sys->max_trx_id;
if (!trx_sys->rw_trx_ids.empty()) {
copy_trx_ids(trx_sys->rw_trx_ids);
} else {
m_ids.clear();
}
ut_ad(m_up_limit_id <= m_low_limit_id);
if (UT_LIST_GET_LEN(trx_sys->serialisation_list) > 0) {
const trx_t *trx;
trx = UT_LIST_GET_FIRST(trx_sys->serialisation_list);
if (trx->no < m_low_limit_no) {
m_low_limit_no = trx->no;
}
}
ut_d(m_view_low_limit_no = m_low_limit_no);
m_closed = false;
}
看最后创建Read View可以看到分为两种情况即视图为空和不为空,不为空则使用原有的,为空则从空闲视图中拿一个,然后准备视图并返回。
4、MVCC版本创建和分析
先看一下版本控制的发起,也就前面提到的更新操作:
/** Updates a record when the update causes no size changes in its fields.
@param[in] flags Undo logging and locking flags
@param[in] cursor Cursor on the record to update; cursor stays valid and
positioned on the same record
@param[in,out] offsets Offsets on cursor->page_cur.rec
@param[in] update Update vector
@param[in] cmpl_info Compiler info on secondary index updates
@param[in] thr Query thread, or null if flags & (btr_no_locking_flag |
btr_no_undo_log_flag | btr_create_flag | btr_keep_sys_flag)
@param[in] trx_id Transaction id
@param[in,out] mtr Mini-transaction; if this is a secondary index, the caller
must mtr_commit(mtr) before latching any further pages
@return locking or undo log related error code, or
@retval DB_SUCCESS on success
@retval DB_ZIP_OVERFLOW if there is not enough space left
on the compressed page (IBUF_BITMAP_FREE was reset outside mtr) */
dberr_t btr_cur_update_in_place(ulint flags, btr_cur_t *cursor, ulint *offsets,
const upd_t *update, ulint cmpl_info,
que_thr_t *thr, trx_id_t trx_id, mtr_t *mtr) {
dict_index_t *index;
buf_block_t *block;
page_zip_des_t *page_zip;
dberr_t err;
rec_t *rec;
roll_ptr_t roll_ptr = 0;
ulint was_delete_marked;
ibool is_hashed;
rec = btr_cur_get_rec(cursor);
index = cursor->index;
ut_ad(rec_offs_validate(rec, index, offsets));
ut_ad(!!page_rec_is_comp(rec) == dict_table_is_comp(index->table));
ut_ad(trx_id > 0 || (flags & BTR_KEEP_SYS_FLAG) ||
index->table->is_intrinsic());
/* The insert buffer tree should never be updated in place. */
ut_ad(!dict_index_is_ibuf(index));
ut_ad(dict_index_is_online_ddl(index) == !!(flags & BTR_CREATE_FLAG) ||
index->is_clustered());
ut_ad((flags & ~(BTR_KEEP_POS_FLAG | BTR_KEEP_IBUF_BITMAP)) ==
(BTR_NO_UNDO_LOG_FLAG | BTR_NO_LOCKING_FLAG | BTR_CREATE_FLAG |
BTR_KEEP_SYS_FLAG) ||
thr_get_trx(thr)->id == trx_id);
ut_ad(fil_page_index_page_check(btr_cur_get_page(cursor)));
ut_ad(btr_page_get_index_id(btr_cur_get_page(cursor)) == index->id);
DBUG_PRINT("ib_cur",
("update-in-place %s (" IB_ID_FMT ") by " TRX_ID_FMT ": %s",
index->name(), index->id, trx_id,
rec_printer(rec, offsets).str().c_str()));
block = btr_cur_get_block(cursor);
page_zip = buf_block_get_page_zip(block);
/* Check that enough space is available on the compressed page. */
if (page_zip) {
ut_ad(!index->table->is_temporary());
if (!btr_cur_update_alloc_zip(page_zip, btr_cur_get_page_cur(cursor), index,
offsets, rec_offs_size(offsets), false,
mtr)) {
return (DB_ZIP_OVERFLOW);
}
rec = btr_cur_get_rec(cursor);
}
/* Do lock checking and undo logging */
err = btr_cur_upd_lock_and_undo(flags, cursor, offsets, update, cmpl_info,
thr, mtr, &roll_ptr);
if (UNIV_UNLIKELY(err != DB_SUCCESS)) {
/* We may need to update the IBUF_BITMAP_FREE
bits after a reorganize that was done in
btr_cur_update_alloc_zip(). */
goto func_exit;
}
if (!(flags & BTR_KEEP_SYS_FLAG) && !index->table->is_intrinsic()) {
row_upd_rec_sys_fields(rec, nullptr, index, offsets, thr_get_trx(thr),
roll_ptr);
}
was_delete_marked =
rec_get_deleted_flag(rec, page_is_comp(buf_block_get_frame(block)));
is_hashed = (block->index != nullptr);
if (is_hashed) {
/* TO DO: Can we skip this if none of the fields
index->search_info->curr_n_fields
are being updated? */
/* The function row_upd_changes_ord_field_binary works only
if the update vector was built for a clustered index, we must
NOT call it if index is secondary */
if (!index->is_clustered() ||
row_upd_changes_ord_field_binary(index, update, thr, nullptr, nullptr,
nullptr)) {
/* Remove possible hash index pointer to this record */
btr_search_update_hash_on_delete(cursor);
}
rw_lock_x_lock(btr_get_search_latch(index));
}
assert_block_ahi_valid(block);
row_upd_rec_in_place(rec, index, offsets, update, page_zip);
if (is_hashed) {
rw_lock_x_unlock(btr_get_search_latch(index));
}
btr_cur_update_in_place_log(flags, rec, index, update, trx_id, roll_ptr, mtr);
if (was_delete_marked &&
!rec_get_deleted_flag(rec, page_is_comp(buf_block_get_frame(block)))) {
/* The new updated record owns its possible externally
stored fields */
lob::BtrContext btr_ctx(mtr, nullptr, index, rec, offsets, block);
btr_ctx.unmark_extern_fields();
}
ut_ad(err == DB_SUCCESS);
func_exit:
if (page_zip && !(flags & BTR_KEEP_IBUF_BITMAP) && !index->is_clustered() &&
page_is_leaf(buf_block_get_frame(block))) {
/* Update the free bits in the insert buffer. */
ibuf_update_free_bits_zip(block, mtr);
}
return (err);
}
这里还有insert等,有兴趣可以看看相关操作函数。查询在前面提到的函数 row_search_mvcc()中发起:
dberr_t row_search_mvcc(byte *buf, page_cur_mode_t mode,
row_prebuilt_t *prebuilt, ulint match_mode,
const ulint direction)
{
else if (index == clust_index) {
/* Fetch a previous version of the row if the current
one is not visible in the snapshot; if we have a very
high force recovery level set, we try to avoid crashes
by skipping this lookup */
if (srv_force_recovery < 5 &&
!lock_clust_rec_cons_read_sees(rec, index, offsets,
trx_get_read_view(trx))) {
rec_t *old_vers;
/* The following call returns 'offsets' associated with 'old_vers' */
err = row_sel_build_prev_vers_for_mysql(
trx->read_view, clust_index, prebuilt, rec, &offsets, &heap,
&old_vers, need_vrow ? &vrow : nullptr, &mtr,
prebuilt->get_lob_undo());
if (err != DB_SUCCESS) {
goto lock_wait_or_error;
}
if (old_vers == nullptr) {
/* The row did not exist yet in
the read view */
goto next_rec;
}
rec = old_vers;
prev_rec = rec;
ut_d(prev_rec_debug = row_search_debug_copy_rec_order_prefix(
pcur, index, prev_rec, &prev_rec_debug_n_fields,
&prev_rec_debug_buf, &prev_rec_debug_buf_size));
}
}
然后下来就是视图的创建匹配和判断,在前面已经提到过了。下面看一下记录的版本具体数据的操作:
row_search_mvcc -> row_sel_build_prev_vers_for_mysql -> row_vers_build_for_consistent_read -> trx_undo_prev_version_build
bool trx_undo_prev_version_build(
const rec_t *index_rec ATTRIB_USED_ONLY_IN_DEBUG,
mtr_t *index_mtr ATTRIB_USED_ONLY_IN_DEBUG, const rec_t *rec,
const dict_index_t *const index, ulint *offsets, mem_heap_t *heap,
rec_t **old_vers, mem_heap_t *v_heap, const dtuple_t **vrow, ulint v_status,
lob::undo_vers_t *lob_undo) {
DBUG_TRACE;
trx_undo_rec_t *undo_rec = nullptr;
dtuple_t *entry;
trx_id_t rec_trx_id;
ulint type;
undo_no_t undo_no;
table_id_t table_id;
trx_id_t trx_id;
roll_ptr_t roll_ptr;
upd_t *update = nullptr;
byte *ptr;
ulint info_bits;
ulint cmpl_info;
bool dummy_extern;
byte *buf;
ut_ad(!rw_lock_own(&purge_sys->latch, RW_LOCK_S));
ut_ad(mtr_memo_contains_page(index_mtr, index_rec, MTR_MEMO_PAGE_S_FIX) ||
mtr_memo_contains_page(index_mtr, index_rec, MTR_MEMO_PAGE_X_FIX));
ut_ad(rec_offs_validate(rec, index, offsets));
ut_a(index->is_clustered());
roll_ptr = row_get_rec_roll_ptr(rec, index, offsets);
*old_vers = nullptr;
if (trx_undo_roll_ptr_is_insert(roll_ptr)) {
/* The record rec is the first inserted version */
return true;
}
rec_trx_id = row_get_rec_trx_id(rec, index, offsets);
/* REDO rollback segments are used only for non-temporary objects.
For temporary objects NON-REDO rollback segments are used. */
bool is_temp = index->table->is_temporary();
ut_ad(!index->table->skip_alter_undo);
if (trx_undo_get_undo_rec(roll_ptr, rec_trx_id, heap, is_temp,
index->table->name, &undo_rec)) {
if (v_status & TRX_UNDO_PREV_IN_PURGE) {
/* We are fetching the record being purged */
undo_rec = trx_undo_get_undo_rec_low(roll_ptr, heap, is_temp);
} else {
/* The undo record may already have been purged,
during purge or semi-consistent read. */
return false;
}
}
type_cmpl_t type_cmpl;
ptr = trx_undo_rec_get_pars(undo_rec, &type, &cmpl_info, &dummy_extern,
&undo_no, &table_id, type_cmpl);
if (table_id != index->table->id) {
/* The table should have been rebuilt, but purge has
not yet removed the undo log records for the
now-dropped old table (table_id). */
return true;
}
ptr = trx_undo_update_rec_get_sys_cols(ptr, &trx_id, &roll_ptr, &info_bits);
/* (a) If a clustered index record version is such that the
trx id stamp in it is bigger than purge_sys->view, then the
BLOBs in that version are known to exist (the purge has not
progressed that far);
(b) if the version is the first version such that trx id in it
is less than purge_sys->view, and it is not delete-marked,
then the BLOBs in that version are known to exist (the purge
cannot have purged the BLOBs referenced by that version
yet).
This function does not fetch any BLOBs. The callers might, by
possibly invoking row_ext_create() via row_build(). However,
they should have all needed information in the *old_vers
returned by this function. This is because *old_vers is based
on the transaction undo log records. The function
trx_undo_page_fetch_ext() will write BLOB prefixes to the
transaction undo log that are at least as long as the longest
possible column prefix in a secondary index. Thus, secondary
index entries for *old_vers can be constructed without
dereferencing any BLOB pointers. */
ptr = trx_undo_rec_skip_row_ref(ptr, index);
ptr = trx_undo_update_rec_get_update(ptr, index, type, trx_id, roll_ptr,
info_bits, nullptr, heap, &update,
lob_undo, type_cmpl);
ut_a(ptr);
if (row_upd_changes_field_size_or_external(index, offsets, update)) {
/* We should confirm the existence of disowned external data,
if the previous version record is delete marked. If the trx_id
of the previous record is seen by purge view, we should treat
it as missing history, because the disowned external data
might be purged already.
The inherited external data (BLOBs) can be freed (purged)
after trx_id was committed, provided that no view was started
before trx_id. If the purge view can see the committed
delete-marked record by trx_id, no transactions need to access
the BLOB. */
/* the row_upd_changes_disowned_external(update) call could be
omitted, but the synchronization on purge_sys->latch is likely
more expensive. */
if ((update->info_bits & REC_INFO_DELETED_FLAG) &&
row_upd_changes_disowned_external(update)) {
bool missing_extern;
rw_lock_s_lock(&purge_sys->latch);
missing_extern =
purge_sys->view.changes_visible(trx_id, index->table->name);
rw_lock_s_unlock(&purge_sys->latch);
if (missing_extern) {
/* treat as a fresh insert, not to
cause assertion error at the caller. */
return true;
}
}
/* We have to set the appropriate extern storage bits in the
old version of the record: the extern bits in rec for those
fields that update does NOT update, as well as the bits for
those fields that update updates to become externally stored
fields. Store the info: */
entry = row_rec_to_index_entry(rec, index, offsets, heap);
/* The page containing the clustered index record
corresponding to entry is latched in mtr. Thus the
following call is safe. */
row_upd_index_replace_new_col_vals(entry, index, update, heap);
buf = static_cast<byte *>(
mem_heap_alloc(heap, rec_get_converted_size(index, entry)));
*old_vers = rec_convert_dtuple_to_rec(buf, index, entry);
} else {
buf = static_cast<byte *>(mem_heap_alloc(heap, rec_offs_size(offsets)));
*old_vers = rec_copy(buf, rec, offsets);
rec_offs_make_valid(*old_vers, index, offsets);
row_upd_rec_in_place(*old_vers, index, offsets, update, nullptr);
}
/* Set the old value (which is the after image of an update) in the
update vector to dtuple vrow */
if (v_status & TRX_UNDO_GET_OLD_V_VALUE) {
row_upd_replace_vcol((dtuple_t *)*vrow, index->table, update, false,
nullptr, nullptr);
}
#if defined UNIV_DEBUG || defined UNIV_BLOB_LIGHT_DEBUG
ut_a(!rec_offs_any_null_extern(
*old_vers,
rec_get_offsets(*old_vers, index, nullptr, ULINT_UNDEFINED, &heap)));
#endif // defined UNIV_DEBUG || defined UNIV_BLOB_LIGHT_DEBUG
/* If vrow is not NULL it means that the caller is interested in the values of
the virtual columns for this version.
If the UPD_NODE_NO_ORD_CHANGE flag is set on cmpl_info, it means that the
change which created this entry in undo log did not affect any column of any
secondary index (in particular: virtual), and thus the values of virtual
columns were not recorded in undo. In such case the caller may assume that the
values of (virtual) columns present in secondary index are exactly the same as
they are in the next (more recent) version.
If on the other hand the UPD_NODE_NO_ORD_CHANGE flag is not set, then we will
make sure that *vrow points to a properly allocated memory and contains the
values of virtual columns for this version recovered from undo log.
This implies that if the caller has provided a non-NULL vrow, and the *vrow is
still NULL after the call, (and old_vers is not NULL) it must be because the
UPD_NODE_NO_ORD_CHANGE flag was set for this version.
This last statement is an important assumption made by the
row_vers_impl_x_locked_low() function. */
if (vrow && !(cmpl_info & UPD_NODE_NO_ORD_CHANGE)) {
if (!(*vrow)) {
*vrow = dtuple_create_with_vcol(v_heap ? v_heap : heap,
index->table->get_n_cols(),
dict_table_get_n_v_cols(index->table));
dtuple_init_v_fld(*vrow);
}
ut_ad(index->table->n_v_cols);
trx_undo_read_v_cols(index->table, ptr, *vrow,
v_status & TRX_UNDO_PREV_IN_PURGE, false, nullptr,
(v_heap != nullptr ? v_heap : heap));
}
if (update != nullptr) {
update->reset();
}
return true;
}
这个就是前面介绍的形成版本链的一个过程函数。通过解析undo log把指针一个个的连接起来,形成一个活动的版本链。
这样,通过视图创建、判断以及MVCC中创建版本链的匹配原则,就可以拿到实际具体的相关版本数据了。
MVCC是处理数据同步和安全的一种方式,是有效隔离事务的一种手段。数据库如果严格实现串行读写,就不会有这种机制出现,但在实际应用中,为了达到更好的应用效果,提高并发和访问速度,提出了想当多的方法,《数据密集型应用系统设计》中都有介绍。所以原理性的东西一定明白,再和具体的实现相对照,就会很清楚的弄明白事情的来龙去脉,知其然,知其所以然,是知也。
努力吧,归来的少年!