案例描述:
一次线上误操作.将一个大表的数据清理到历史表.涉及历史数据有1600w,大约15个G的数据.当时偷懒,就直接insert into history select * from table where xxxx;这条sql执行了大概70分钟.执行过程中没有什么大问题.在提交那一刻,服务器负载飙升,mysql线程飙升到1500多(平时只有300多),大量事务提交失败.检查了错误日志,却没有发现有任何错误.检查binlog发现,这个操作导致生成的那个binlog达到13个G(设置的binlog大小为1G). 这些都是次要的.. 关键是业务那边所有连接这个mysql的业务.在那一刻.集体报错.所有事务都居然报超时.
案例分析:
过程大概就是这样的,现在来分析下,为什么会这样.
最开始的推断是,insert 这个sql涉及到1600w行,大约15个G的数据,是一次刷新到磁盘,导致事务全部被堵在外面.具体哪个环节,没弄明白.下面就来证明是哪个环节出问题了;
开启debug模式,来检查在commit这一刻,日志是什么情况:
- ...
- ...
- ...
- T@2: | | | | >trans_commit
- T@2: | | | | | >trans_check
- T@2: | | | | | <trans_check 88
- T@2: | | | | | info: clearing SERVER_STATUS_IN_TRANS
- T@2: | | | | | >commit_owned_gtids(...)
- T@2: | | | | | <commit_owned_gtids(...) 1622
- T@2: | | | | | >ha_commit_trans
- T@2: | | | | | | info: all=1 thd->in_sub_stmt=0 ha_info=0x7fffc8002570 is_real_trans=1
- T@2: | | | | | | debug: Acquire MDL commit lock
- T@2: | | | | | | >check_readonly
- T@2: | | | | | | <check_readonly 522
- T@2: | | | | | | >MYSQL_BIN_LOG::prepare
- T@2: | | | | | | | >ha_prepare_low
- T@2: | | | | | | | | >binlog_prepare
- T@2: | | | | | | | | <binlog_prepare 1587
- T@2: | | | | | | | | >innobase_trx_init
- T@2: | | | | | | | | <innobase_trx_init 2765
- T@2: | | | | | | | <ha_prepare_low 2358
- T@2: | | | | | | <MYSQL_BIN_LOG::prepare 8153
- T@2: | | | | | | >MYSQL_BIN_LOG::commit
- T@2: | | | | | | | info: query='commit'
- T@2: | | | | | | | enter: thd: 0x7fffc8000df0, all: yes, xid: 25, cache_mngr: 0x7fffc8060570
- T@2: | | | | | | | debug: in_transaction: yes, no_2pc: no, rw_ha_count: 2
- T@2: | | | | | | | debug: trx_cache - pending: 0x0, bytes: 723876
- T@2: | | | | | | | debug: all.cannot_safely_rollback(): no, trx_cache_empty: no
- T@2: | | | | | | | debug: stmt_cache - pending: 0x0, bytes: 0
- T@2: | | | | | | | debug: stmt.cannot_safely_rollback(): no, stmt_cache_empty: yes
- T@2: | | | | | | | debug: stmt_cache - pending: 0x0, bytes: 0
- T@2: | | | | | | | debug: trx_cache - pending: 0x0, bytes: 723876
- T@2: | | | | | | | >binlog_cache_data::finalize
- T@2: | | | | | | | | debug: trx_cache - pending: 0x0, bytes: 723876
- T@2: | | | | | | | | >binlog_cache_data::write_event
- T@2: | | | | | | | | | >Log_event::write_header
- T@2: | | | | | | | | | | >Log_event::need_checksum
- T@2: | | | | | | | | | | <Log_event::need_checksum 969
- T@2: | | | | | | | | | | >Log_event::need_checksum
- T@2: | | | | | | | | | | <Log_event::need_checksum 969
- T@2: | | | | | | | | | <Log_event::write_header 1118
- T@2: | | | | | | | | | >Log_event::need_checksum
- T@2: | | | | | | | | | <Log_event::need_checksum 969
- T@2: | | | | | | | | | >Log_event::need_checksum
- T@2: | | | | | | | | | <Log_event::need_checksum 969
- T@2: | | | | | | | | <binlog_cache_data::write_event 1142
- T@2: | | | | | | | | debug: flags.finalized: yes
- T@2: | | | | | | | <binlog_cache_data::finalize 1351
- T@2: | | | | | | | debug: is_empty: 1
- T@2: | | | | | | | >Rpl_transaction_ctx::is_transaction_rollback
- T@2: | | | | | | | <Rpl_transaction_ctx::is_transaction_rollback 62
- T@2: | | | | | | | debug: Acquiring binlog protection lock
- T@2: | | | | | | | >Global_backup_lock::acquire_protection
- T@2: | | | | | | | | >Global_backup_lock::init_protection_request
- T@2: | | | | | | | | <Global_backup_lock::init_protection_request 1384
- T@2: | | | | | | | <Global_backup_lock::acquire_protection 1365
- T@2: | | | | | | | >MYSQL_BIN_LOG::ordered_commit
- T@2: | | | | | | | | enter: flags.pending: yes, commit_error: 0, thread_id: 2
- T@2: | | | | | | | | >MYSQL_BIN_LOG::change_stage
- T@2: | | | | | | | | | enter: thd: 0x7fffc8000df0, stage: FLUSH, queue: 0x7fffc8000df0
- T@2: | | | | | | | | | debug: Enqueue 0x7fffc8000df0 to queue for stage 0
- T@2: | | | | | | | | | >Stage_manager::Mutex_queue::append
- T@2: | | | | | | | | | | enter: first: 0x7fffc8000df0
- T@2: | | | | | | | | | | info: m_first: 0x0, &m_first: 0x2dfe480, m_last: 0x2dfe480
- T@2: | | | | | | | | | | info: m_first: 0x7fffc8000df0, &m_first: 0x2dfe480, m_last: 0x2dfe480
- T@2: | | | | | | | | | | info: m_first: 0x7fffc8000df0, &m_first: 0x2dfe480, m_last: 0x7fffc80030c0
- T@2: | | | | | | | | | | return: empty: yes
- T@2: | | | | | | | | | <Stage_manager::Mutex_queue::append 1904
- T@2: | | | | | | | | <MYSQL_BIN_LOG::change_stage 8755
- T@2: | | | | | | | | >MYSQL_BIN_LOG::process_flush_stage_queue
- T@2: | | | | | | | | | debug: Fetching queue for stage 0
- T@2: | | | | | | | | | >Stage_manager::Mutex_queue::fetch_and_empty
- T@2: | | | | | | | | | | enter: m_first: 0x7fffc8000df0, &m_first: 0x2dfe480, m_last: 0x7fffc80030c0
- T@2: | | | | | | | | | | info: m_first: 0x0, &m_first: 0x2dfe480, m_last: 0x2dfe480
- T@2: | | | | | | | | | | return: result: 0x7fffc8000df0
- T@2: | | | | | | | | | <Stage_manager::Mutex_queue::fetch_and_empty 2012
- T@2: | | | | | | | | | >plugin_foreach_with_mask
- T@2: | | | | | | | | | | >innobase_flush_logs
- T@2: | | | | | | | | | | | ib_log: write 207335827 to 207336069
- T@2: | | | | | | | | | | | ib_log: write 207335424 to 8329216: group 0 len 1024 blocks 404953..404954
- T@2: | | | | | | | | | | <innobase_flush_logs 4409
- T@2: | | | | | | | | | <plugin_foreach_with_mask 2322
- T@2: | | | | | | | | | >binlog_cache_data::flush
- T@2: | | | | | | | | | | debug: flags.finalized: no
- T@2: | | | | | | | | | <binlog_cache_data::flush 1471
- T@2: | | | | | | | | | >binlog_cache_data::flush
- T@2: | | | | | | | | | | debug: flags.finalized: yes
- T@2: | | | | | | | | | | debug: bytes_in_cache: 723903
- T@2: | | | | | | | | | | >MYSQL_BIN_LOG::write_gtid
- T@2: | | | | | | | | | | | >Rpl_transaction_ctx::get_gno
- T@2: | | | | | | | | | | | <Rpl_transaction_ctx::get_gno 80
- T@2: | | | | | | | | | | | >Rpl_transaction_ctx::get_sidno
- T@2: | | | | | | | | | | | <Rpl_transaction_ctx::get_sidno 74
- T@2: | | | | | | | | | | | >Gtid_state::generate_automatic_gtid
- T@2: | | | | | | | | | | | | info: 0x2fbaf40.rdlock()
- T@2: | | | | | | | | | | | | >Gtid_state::acquire_anonymous_ownership
- T@2: | | | | | | | | | | | | | info: anonymous_gtid_count increased to 1
- T@2: | | | | | | | | | | | | <Gtid_state::acquire_anonymous_ownership 2476
- T@2: | | | | | | | | | | | | >Gtid::to_string
- T@2: | | | | | | | | | | | | <Gtid::to_string 195
- T@2: | | | | | | | | | | | | info: set owned_gtid (anonymous) in generate_automatic_gtid: -2:0
- T@2: | | | | | | | | | | | | info: 0x2fbaf40.unlock()
- T@2: | | | | | | | | | | | | >gtid_set_performance_schema_values
- T@2: | | | | | | | | | | | | <gtid_set_performance_schema_values 629
- T@2: | | | | | | | | | | | <Gtid_state::generate_automatic_gtid 652
- T@2: | | | | | | | | | | | >Gtid_log_event::Gtid_log_event(THD *)
- T@2: | | | | | | | | | | | | >Gtid_specification::to_string(char*)
- T@2: | | | | | | | | | | | | <Gtid_specification::to_string(char*) 83
- T@2: | | | | | | | | | | | | info: SET @@SESSION.GTID_NEXT= 'ANONYMOUS'
- T@2: | | | | | | | | | | | <Gtid_log_event::Gtid_log_event(THD *) 13079
- T@2: | | | | | | | | | | | >Gtid_log_event::write_data_header_to_memory
- T@2: | | | | | | | | | | | | info: sid=00000000-0000-0000-0000-000000000000 sidno=0 gno=0
- T@2: | | | | | | | | | | | <Gtid_log_event::write_data_header_to_memory 13208
- T@2: | | | | | | | | | | | >Binlog_event_writer::write_event_part
- T@2: | | | | | | | | | | | <Binlog_event_writer::write_event_part 1033
- T@2: | | | | | | | | | | <MYSQL_BIN_LOG::write_gtid 1233
- T@2: | | | | | | | | | | >MYSQL_BIN_LOG::write_cache(THD *, binlog_cache_data *, bool)
- T@2: | | | | | | | | | | | >MYSQL_BIN_LOG::do_write_cache
- T@2: | | | | | | | | | | | | >reinit_io_cache
- T@2: | | | | | | | | | | | | | enter: cache: 0x7fffc8060700 type: 1 seek_offset: 0 clear_cache: 0
- T@2: | | | | | | | | | | | | | >my_b_flush_io_cache
- T@2: | | | | | | | | | | | | | | enter: cache: 0x7fffc8060700
- T@2: | | | | | | | | | | | | | | >my_write
- T@2: | | | | | | | | | | | | | | | my: fd: 61 Buffer: 0x7fffc8073f30 Count: 3007 MyFlags: 20
- T@2: | | | | | | | | | | | | | | <my_write 115
- T@2: | | | | | | | | | | | | | <my_b_flush_io_cache 1583
- T@2: | | | | | | | | | | | | <reinit_io_cache 387
- T@2: | | | | | | | | | | | | >my_seek
- T@2: | | | | | | | | | | | | | my: fd: 61 Pos: 0 Whence: 0 MyFlags: 0
- T@2: | | | | | | | | | | | | <my_seek 80
- T@2: | | | | | | | | | | | | >my_read
- T@2: | | | | | | | | | | | | | my: fd: 61 Buffer: 0x7fffc8073f30 Count: 32768 MyFlags: 16
- T@2: | | | | | | | | | | | | <my_read 106
- T@2: | | | | | | | | | | | | >Binlog_event_writer::write_event_part
- T@2: | | | | | | | | | | | | <Binlog_event_writer::write_event_part 1033
- T@2: | | | | | | | | | | | | >Binlog_event_writer::write_event_part
- T@2: | | | | | | | | | | | | <Binlog_event_writer::write_event_part 1033
- T@2: | | | | | | | | | | | | >Binlog_event_writer::write_event_part
- T@2: | | | | | | | | | | | | <Binlog_event_writer::write_event_part 1033
- T@2: | | | | | | | | | | | | >Binlog_event_writer::write_event_part
- T@2: | | | | | | | | | | | | <Binlog_event_writer::write_event_part 1033
- T@2: | | | | | | | | | | | | >Binlog_event_writer::write_event_part
- T@2: | | | | | | | | | | | | <Binlog_event_writer::write_event_part 1033
- T@2: | | | | | | | | | | | | >Binlog_event_writer::write_event_part
- T@2: | | | | | | | | | | | | | >my_b_flush_io_cache
日志太长,就不一一解释,大概其说几个关键点(从上到下的顺序):
第一个:trans_commit 开始事务提交(commit);
debug: Acquire MDL commit lock 在提交前先要取得一个MDL元数据锁;
MYSQL_BIN_LOG::prepare MDL获得后就开始进入prepare 阶段
>MYSQL_BIN_LOG::commit 开始提交阶段
T@2: | | | | | | | >binlog_cache_data::finalize
T@2: | | | | | | | | debug: trx_cache - pending: 0x0, bytes: 723876
T@2: | | | | | | | | >binlog_cache_data::write_event
从这三行开始就涉及到这个案例的一些东西了.后面再细说.这三行的意思就是.binlog_chache准备刷出内存了,准备往磁盘刷;
T@2: | | | | | | | debug: Acquiring binlog protection lock
T@2: | | | | | | | >Global_backup_lock::acquire_protection
T@2: | | | | | | | | >Global_backup_lock::init_protection_request
T@2: | | | | | | | |
volatile int32 Global_read_lock::m_active_requests;
/****************************************************************************
Handling of global read locks
Global read lock is implemented using metadata lock infrastructure.
Taking the global read lock is TWO steps (2nd step is optional; without
it, COMMIT of existing transactions will be allowed):
lock_global_read_lock() THEN make_global_read_lock_block_commit().
How blocking of threads by global read lock is achieved: that's
semi-automatic. We assume that any statement which should be blocked
by global read lock will either open and acquires write-lock on tables
or acquires metadata locks on objects it is going to modify. For any
such statement global IX metadata lock is automatically acquired for
its duration (in case of LOCK TABLES until end of LOCK TABLES mode).
And lock_global_read_lock() simply acquires global S metadata lock
and thus prohibits execution of statements which modify data (unless
they modify only temporary tables). If deadlock happens it is detected
by MDL subsystem and resolved in the standard fashion (by backing-off
metadata locks acquired so far and restarting open tables process
if possible).
Why does FLUSH TABLES WITH READ LOCK need to block COMMIT: because it's used
to read a non-moving SHOW MASTER STATUS, and a COMMIT writes to the binary
log.
Why getting the global read lock is two steps and not one. Because FLUSH
TABLES WITH READ LOCK needs to insert one other step between the two:
flushing tables. So the order is
1) lock_global_read_lock() (prevents any new table write locks, i.e. stalls
all new updates)
2) close_cached_tables() (the FLUSH TABLES), which will wait for tables
currently opened and being updated to close (so it's possible that there is
a moment where all new updates of server are stalled *and* FLUSH TABLES WITH
READ LOCK is, too).
3) make_global_read_lock_block_commit().
If we have merged 1) and 3) into 1), we would have had this deadlock:
imagine thread 1 and 2, in non-autocommit mode, thread 3, and an InnoDB
table t.
thd1: SELECT * FROM t FOR UPDATE;
thd2: UPDATE t SET a=1; # blocked by row-level locks of thd1
thd3: FLUSH TABLES WITH READ LOCK; # blocked in close_cached_tables() by the
table instance of thd2
thd1: COMMIT; # blocked by thd3.
thd1 blocks thd2 which blocks thd3 which blocks thd1: deadlock.
Note that we need to support that one thread does
FLUSH TABLES WITH READ LOCK; and then COMMIT;
(that's what innobackup does, for some good reason).
So in this exceptional case the COMMIT should not be blocked by the FLUSH
TABLES WITH READ LOCK.
****************************************************************************/
/**
Take global read lock, wait if there is protection against lock.
If the global read lock is already taken by this thread, then nothing is done.
See also "Handling of global read locks" above.
@param thd Reference to thread.
@retval False Success, global read lock set, commits are NOT blocked.
@retval True Failure, thread was killed.
*/
volatile int32 Global_read_lock::m_active_requests;
/****************************************************************************
Handling of global read locks
Global read lock is implemented using metadata lock infrastructure.
Taking the global read lock is TWO steps (2nd step is optional; without
it, COMMIT of existing transactions will be allowed):
lock_global_read_lock() THEN make_global_read_lock_block_commit().
How blocking of threads by global read lock is achieved: that's
semi-automatic. We assume that any statement which should be blocked
by global read lock will either open and acquires write-lock on tables
or acquires metadata locks on objects it is going to modify. For any
such statement global IX metadata lock is automatically acquired for
its duration (in case of LOCK TABLES until end of LOCK TABLES mode).
And lock_global_read_lock() simply acquires global S metadata lock
and thus prohibits execution of statements which modify data (unless
they modify only temporary tables). If deadlock happens it is detected
by MDL subsystem and resolved in the standard fashion (by backing-off
metadata locks acquired so far and restarting open tables process
if possible).
Why does FLUSH TABLES WITH READ LOCK need to block COMMIT: because it's used
to read a non-moving SHOW MASTER STATUS, and a COMMIT writes to the binary
log.
Why getting the global read lock is two steps and not one. Because FLUSH
TABLES WITH READ LOCK needs to insert one other step between the two:
flushing tables. So the order is
1) lock_global_read_lock() (prevents any new table write locks, i.e. stalls
all new updates)
2) close_cached_tables() (the FLUSH TABLES), which will wait for tables
currently opened and being updated to close (so it's possible that there is
a moment where all new updates of server are stalled *and* FLUSH TABLES WITH
READ LOCK is, too).
3) make_global_read_lock_block_commit().
If we have merged 1) and 3) into 1), we would have had this deadlock:
imagine thread 1 and 2, in non-autocommit mode, thread 3, and an InnoDB
table t.
thd1: SELECT * FROM t FOR UPDATE;
thd2: UPDATE t SET a=1; # blocked by row-level locks of thd1
thd3: FLUSH TABLES WITH READ LOCK; # blocked in close_cached_tables() by the
table instance of thd2
thd1: COMMIT; # blocked by thd3.
thd1 blocks thd2 which blocks thd3 which blocks thd1: deadlock.
Note that we need to support that one thread does
FLUSH TABLES WITH READ LOCK; and then COMMIT;
(that's what innobackup does, for some good reason).
So in this exceptional case the COMMIT should not be blocked by the FLUSH
TABLES WITH READ LOCK.
****************************************************************************/
/**
Take global read lock, wait if there is protection against lock.
If the global read lock is already taken by this thread, then nothing is done.
See also "Handling of global read locks" above.
@param thd Reference to thread.
@retval False Success, global read lock set, commits are NOT blocked.
@retval True Failure, thread was killed.
*/
/**
Flush and commit the transaction.
This will execute an ordered flush and commit of all outstanding
transactions and is the main function for the binary log group
commit logic. The function performs the ordered commit in two
phases.
The first phase flushes the caches to the binary log and under
LOCK_log and marks all threads that were flushed as not pending.
The second phase executes under LOCK_commit and commits all
transactions in order.
The procedure is:
1. Queue ourselves for flushing.
2. Grab the log lock, which might result is blocking if the mutex is
already held by another thread.
3. If we were not committed while waiting for the lock
1. Fetch the queue
2. For each thread in the queue:
a. Attach to it
b. Flush the caches, saving any error code
3. Flush and sync (depending on the value of sync_binlog).
4. Signal that the binary log was updated
4. Release the log lock
5. Grab the commit lock
1. For each thread in the queue:
a. If there were no error when flushing and the transaction shall be committed:
- Commit the transaction, saving the result of executing the commit.
6. Release the commit lock
7. Call purge, if any of the committed thread requested a purge.
8. Return with the saved error code
@todo The use of @c skip_commit is a hack that we use since the @c
TC_LOG Interface does not contain functions to handle
savepoints. Once the binary log is eliminated as a handlerton and
the @c TC_LOG interface is extended with savepoint handling, this
parameter can be removed.
@param thd Session to commit transaction for
@param all This is @c true if this is a real transaction commit, and
@c false otherwise.
@param skip_commit
This is @c true if the call to @c ha_commit_low should
be skipped (it is handled by the caller somehow) and @c
false otherwise (the normal case).
*/
int MYSQL_BIN_LOG::ordered_commit(THD *thd, bool all, bool skip_commit)
看到这几段源码的注释.其实后面的我都不想再继续看了..大概过程就是:
在事务commit的时候,先获取一个全局(global)元数据锁(MDL),将buffer里面的事务日志排序刷新到磁盘,这个过程,还是先刷redo,在刷binlog,同时这个过程buffer是锁定的,事务是在这个过程没法写入buffer的,然后在获取一个log commit锁,保证redo和binlog都刷新到磁盘(这个过程,很多人都有写图文描述,不累述了);
看起来很简单.实现起来就蛋疼了.. 首先是加了一个全局的MDL,然后又加了一个log commit锁...(这些锁是源码函数的意思,在mysql层的,具体还有待研究,咳咳.抱歉技术有限);
结合这个案例来看,一下子就清晰了.
因为这个事务太大(binlog都14个G),bin_log_buffer在刷新到磁盘的时候,就比较耗时,这个时候的buffer又不是被锁定的,所以新的事务就没办法写入buffer(日志和数据都是先写buffer,在根据规则设定刷新到磁盘).最后才导致所有事务都失败或者挂起(有重试机制的情况).
最后还有一个知识点没想透.就是事务在写入buffer的过程,是怎么一个机制?肯定是有什么机制或者规则的.现在还摸不透,先留个坑在这吧.后面随着研究的深入,应该能解开这个迷惑.
总结:涉及大数据的事务,最好分开成多个事务来完成,实在没法分的,可以考虑关闭binlog(set sql_log_bin=0).虽然也会写redo和undo,但这个代价比较小,redo和undoe的写入机制跟binlog不一样.有时间在另外细说.
最后感谢身边的大神--八怪!!! 所有源码的研究来自于他的帮助! 非常感谢!
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/20892230/viewspace-2132645/,如需转载,请注明出处,否则将追究法律责任。