原创水平有限,有误请指出。仅仅作为学习参考和学习笔记。
最近看了丁奇老师的mysql课程中 kill session的部分,在平时的工作的做,我们也经常用kill 命令进行杀掉某些会话,偶尔也会出现状态还是killed的情况,不由得感觉需要研究一下kill 会话的是如何实现的。刚好丁奇老师的这段提供了理论基础。
先要简单的梳理一下语句的执行的生命周期:
打个比方我们以如下的执行计划为列子:
mysql> desc select * from t1 where name='gaopeng';
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 14 | 10.00 | Using where |+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+1 row in set, 1 warning (1.67 sec)
这里涉及到了一个mysql客户端进程和一个mysqld服务端线程,他们通过socket进行通信。如果我们要要kill某个会话我们显然一般是新开起来一个mysql客户端进程连接到mysqld服务端显然这个时候又需要开启一个服务端线程与其对接来响应你的kill命令那么这个时候图如下:
如图我们需要研究的就是线程2到底如何作用于线程1,实际上线程之间共享内存很简单这是线程的特性决定的,在MySQL中就共享了这样一个变量THD::killed,不仅线程1可以访问并且线程2也可以访问 。实际上这种情况就是依赖在代码的某些位置做了THD::killed的检查而实现。先大概先描述一下这种情况kill 会话的过程
上面已经描述了一个select语句的kill的流程,但是并非都是这种情况,我稍微总结了一下可能的情况:
注意上面的情况都是待杀线程处于的情况,而发起命令的线程只有一种方式,就是调用kill_one_thread函数。下面我将详细描述一下。对于唤醒操作参考附录的内容,我这里就默认大家都知道了。
下面是栈帧:
#0 THD::awake (this=0x7ffe7800e870, state_to_set=THD::KILL_CONNECTION) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.cc:2206#1 0x00000000015d5430 in kill_one_thread (thd=0x7ffe7c000b70, id=18, only_kill_query=false) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_parse.cc:6859#2 0x00000000015d5548 in sql_kill (thd=0x7ffe7c000b70, id=18, only_kill_query=false) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_parse.cc:6887
tmp= Global_THD_manager::get_instance()->find_thd(&find_thd_with_id);//获得待杀死的会话的THD结构体tmp->awake(only_kill_query ? THD::KILL_QUERY : THD::KILL_CONNECTION);//调用THD::awake命令我们这里是 THD::KILL_CONNECTION
ERROR 2013 (HY000): Lost connection to MySQL server during query
然后会终止等待进入innodb连接,然后还会做唤醒操作,关于为什么要做唤醒操作我们后面再说如下:
killed= state_to_set; \\这里设置THD::killed 状态为 KILL_CONNECTIONvio_cancel(active_vio, SHUT_RDWR); \\关闭socket连接,关闭socket连接后则客户端连接关闭 /* Interrupt target waiting inside a storage engine. */
if (state_to_set != THD::NOT_KILLED)
ha_kill_connection(this); \\lock_trx_handle_waitmysql_mutex_lock(current_mutex);
mysql_cond_broadcast(current_cond); \\做唤醒操作
mysql_mutex_unlock(current_mutex);
这种情况就是通过在代码合适的位置检查返回值完成了,比如下面栈帧:
#0 convert_error_code_to_mysql (error=DB_INTERRUPTED, flags=33, thd=0x7ffe74012f30)
at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/handler/ha_innodb.cc:2064#1 0x00000000019d651e in ha_innobase::general_fetch (this=0x7ffe7493c960, buf=0x7ffe7493cea0 "\377", direction=1, match_mode=0)
at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/handler/ha_innodb.cc:9907#2 0x00000000019d658b in ha_innobase::index_next (this=0x7ffe7493c960, buf=0x7ffe7493cea0 "\377")
at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/handler/ha_innodb.cc:9929
我们可以在函数ha_innobase::general_fetch中找到这部分代码如下:
default:
error = convert_error_code_to_mysql(ret, m_prebuilt->table->flags, m_user_thd);
这里ret如果等于DB_INTERRUPTED就会进入线程退出逻辑,具体逻辑我们后面再看。
而其中DB_INTERRUPTED则代表是被杀死的终止状态,由如下代码设置(所谓的"埋点"):
if (trx_is_interrupted(prebuilt->trx)) {
ret = DB_INTERRUPTED;
其中trx_is_interrupted很简单,代码如下:
return(trx && trx->mysql_thd && thd_killed(trx->mysql_thd));
而thd_killed如下:
extern "C" int thd_killed(const MYSQL_THD thd)
{ if (thd == NULL) return current_thd != NULL ? current_thd->killed : 0; return thd->killed; //返回了THD::killed}
我们可以看到thd->killed正是我们前面发起kill线程设置的THD::killed为THD::KILL_CONNECTION,最终这个错误会层层返回,最终导致handle_connection循环结束进入终止流程。
这种情况和上面类似也是需要检查线程的THD::killed状态是否是THD::KILL_CONNECTION,但是我们知道如果处于pthread_cond_wait函数等待下,那么必须有其他线程对其做唤醒操作代码才会继续进行不然永远会不跑到判断逻辑,我们先来看一下等待栈帧
#0 0x00007ffff7bca68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1 0x0000000001ab1d35 in os_event::wait (this=0x7ffe74011f18) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/include/os0event.h:156#2 0x0000000001ab167d in os_event::wait_low (this=0x7ffe74011f18, reset_sig_count=2) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/os/os0event.cc:131#3 0x0000000001ab1aa6 in os_event_wait_low (event=0x7ffe74011f18, reset_sig_count=0) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/os/os0event.cc:328#4 0x0000000001a7305f in lock_wait_suspend_thread (thr=0x7ffe74005190) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0wait.cc:387#5 0x0000000001b391fc in row_mysql_handle_errors (new_err=0x7fffec091c4c, trx=0x7fffd78045f0, thr=0x7ffe74005190, savept=0x0) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/row/row0mysql.cc:1312#6 0x0000000001b7c2ea in row_search_mvcc (buf=0x7ffe74010160 "\377", mode=PAGE_CUR_G, prebuilt=0x7ffe74004a20, match_mode=0, direction=0) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/row/row0sel.cc:6318#7 0x00000000019d5443 in ha_innobase::index_read (this=0x7ffe7400e280, buf=0x7ffe74010160 "\377", key_ptr=0x0, key_len=0, find_flag=HA_READ_AFTER_KEY) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/handler/ha_innodb.cc:9536
这种情况就需要有一个线程唤醒它,但是这里唤醒是Innodb层和上面的说的MySQL层唤醒还不是一个事情(后面描述),到底由谁来唤醒它呢,我们可以将断点设置在:
#0 os_event::broadcast (this=0x7ffe74011f18) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/include/os0event.h:166#1 0x0000000001ab1be8 in os_event::set (this=0x7ffe74011f18) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/include/os0event.h:61#2 0x0000000001ab1a3a in os_event_set (event=0x7ffe74011f18) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/os/os0event.cc:277#3 0x0000000001a73460 in lock_wait_release_thread_if_suspended (thr=0x7ffe70013360) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0wait.cc:491#4 0x0000000001a6a80d in lock_cancel_waiting_and_release (lock=0x30b1938) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0lock.cc:6896#5 0x0000000001a736a6 in lock_wait_check_and_cancel (slot=0x7fff0060a2a0) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0wait.cc:539#6 0x0000000001a7383d in lock_wait_timeout_thread (arg=0x0) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0wait.cc:599#7 0x00007ffff7bc6aa1 in start_thread () from /lib64/libpthread.so.0#8 0x00007ffff6719bcd in clone () from /lib64/libc.so.6
我们稍微检查一下lock_wait_check_and_cancel的代码就会看到如下:
if (trx_is_interrupted(trx)
|| (slot->wait_timeout < 100000000
&& (wait_time > (double) slot->wait_timeout
|| wait_time < 0))) { /* Timeout exceeded or a wrap-around in system
time counter: cancel the lock request queued
by the transaction and release possible
other transactions waiting behind; it is
possible that the lock has already been
granted: in that case do nothing */
lock_mutex_enter();
trx_mutex_enter(trx); if (trx->lock.wait_lock != NULL && !trx_is_high_priority(trx)) {
ut_a(trx->lock.que_state == TRX_QUE_LOCK_WAIT);
lock_cancel_waiting_and_release(trx->lock.wait_lock);
}
lock_mutex_exit();
trx_mutex_exit(trx);
}
我们看到关键地方trx_is_interrupted做了对THD::KILL_CONNECTION的判断,当然这个线程还会做Innodb 行锁超时的唤醒工作,这个线程我们可以看到的如下:
| 35 | 4036 | innodb/srv_lock_timeout_thread | NULL | BACKGROUND | NULL | NULL |
如果对于正在执行的语句,需要回滚的会在随后做回滚操作如下:
if (thd->is_error() || (thd->variables.option_bits & OPTION_MASTER_SQL_ERROR))
trans_rollback_stmt(thd);
总的说来Innodb中正是通过丁奇老师所说的"埋点"来判断线程是否已经被杀掉,其"埋点"所做的事情就是检查线程的THD::killed状态是否是THD::KILL_CONNECTION,这种埋点是有检测周期的,不可能每行代码过后都检查一次所以我大概总结了一下埋点的检查位置:
实际上可以全代码搜索什么时候将ret = DB_INTERRUPTED; 的位置就是Innodb层的"埋点"。
这种情况就比较简单了。在空闲的状态下,待杀死线程会一直堵塞在socket读上面,因为发起kill线程会关闭socket通道,待杀死线程可以轻松的感知到这件事情,下面是net_read_raw_loop中截取
/* On failure, propagate the error code. */
if (count)
{ /* Socket should be closed. */
net->error= 2; /* Interrupted by a timeout? */
if (!eof && vio_was_timeout(net->vio))
net->last_errno= ER_NET_READ_INTERRUPTED; else
net->last_errno= ER_NET_READ_ERROR;#ifdef MYSQL_SERVER
my_error(net->last_errno, MYF(0)); //这里触发#endif
}
这样handle_connection循环结束,进入终止流程。这种情况会在release_resources中clean_up做回滚操作
还记得前面我们的发起kill线程调用THD::awake的时候最后会做唤醒操作吗?和Innodb层行锁等待一样,如果不唤醒那么代码就没办法推进,到达不了Innodb层中设置的埋点位置,下面我用sleep为例进行描述。首先我们先来看看sleep的逻辑,实际上在 Item_func_sleep::val_int 函数中还有如下代码:
timed_cond.set_timeout((ulonglong) (timeout * 1000000000.0));//这里就将sleep的值春如到了timed_cond这个结构体中
mysql_cond_init(key_item_func_sleep_cond, &cond); // pthread_cond_init 初始化 cond
mysql_mutex_lock(&LOCK_item_func_sleep); //加锁 pthread_mutex_lock 对LOCK_item_func_sleep mutex THD::enter_cond
thd->ENTER_COND(&cond, &LOCK_item_func_sleep, &stage_user_sleep, NULL); //#define ENTER_COND(C, M, S, O) enter_cond(C, M, S, O, __func__, __FILE__, __LINE__)
//这一步cond会传递给THD中其他线程也能拿到这个cond了,就可以唤醒它,KILL触发的时候就需要通过这个条件变量唤醒它
DEBUG_SYNC(current_thd, "func_sleep_before_sleep");
error= 0;
thd_wait_begin(thd, THD_WAIT_SLEEP); while (!thd->killed)
{
error= timed_cond.wait(&cond, &LOCK_item_func_sleep); //这里看是可等待 及sleep 功能实现 调用底层pthread_cond_timedwait函数实现 ,并且可以被条件变量唤醒
if (error == ETIMEDOUT || error == ETIME) break;
error= 0;
}
这里我们来证明一下,下面是sleep线程的栈帧:
[Switching to Thread 0x7fffec064700 (LWP 4738)]
#0 THD::enter_cond (this=0x7ffe70000950, cond=0x7fffec061510, mutex=0x2e4d6a0, stage=0x2d8b630, old_stage=0x0, src_function=0x1f2598c "val_int",
src_file=0x1f232e8 "/root/mysqlc/percona-server-locks-detail-5.7.22/sql/item_func.cc", src_line=6057)
at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.h:3395#1 0x00000000010265d8 in Item_func_sleep::val_int (this=0x7ffe70006210) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/item_func.cc:6057#2 0x0000000000fafea5 in Item::send (this=0x7ffe70006210, protocol=0x7ffe70001c68, buffer=0x7fffec0619b0)
at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/item.cc:7564#3 0x000000000156b10c in THD::send_result_set_row (this=0x7ffe70000950, row_items=0x7ffe700055d8)
at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.cc:5026#4 0x0000000001565708 in Query_result_send::send_data (this=0x7ffe700063a8, items=...) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.cc:2932
注意这里结构体cond=0x7fffec061510的地址,最终他会传递到THD中,以致于其他线程也能后拿到,我们再来看看THD::awake唤醒的条件变量的地址如下:
[Switching to Thread 0x7fffec0f7700 (LWP 4051)]
Breakpoint 2, THD::awake (this=0x7ffe70000950, state_to_set=THD::KILL_CONNECTION) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.cc:2206......
(gdb) n2288 mysql_cond_broadcast(current_cond);
(gdb) p current_cond
$6 = (mysql_cond_t * volatile) 0x7fffec061510
我们可以到看到也是0x7fffec061510,他们是同一个条件变量,那么也证明了确实是THD::awake最终唤醒了我们的sleep。代码得以继续,继续后会达到"埋点",最终handle_connection循环终止达到终止流程。
最终在handle_connection 的循环达到退出了条件,进行连接终止逻辑如下:
{ while (thd_connection_alive(thd)) //
{ if (do_command(thd)) break;
}
end_connection(thd);
}
close_connection(thd, 0, false, false);
thd->get_stmt_da()->reset_diagnostics_area();
thd->release_resources();
.....
thd_manager->remove_thd(thd);//这里从THD链表上摘下来,之后 KILLED状态的线程才没有了。
Connection_handler_manager::dec_connection_count(extra_port_connection);
....
delete thd; if (abort_loop) // Server is shutting down so end the pthread.
break;
channel_info= Per_thread_connection_handler::block_until_new_connection(); if (channel_info == NULL) break;
pthread_reused= true;
这里我们发现会经历几个函数end_connection/get_stmt_da()->reset_diagnostics_area()/release_resources 然后来到了thd_manager->remove_thd(thd),最终这个链接会被重用。实际上直到release_resources做完我们才会看到show processlist中的状态消失。可以修改代码,在release_resources函数前后加上sleep(10)函数来验证,如下:
sleep(10);thd->release_resources();sleep(10);
得到的测试结果如下:
mysql> show processlist ; kill 31;kill 33;kill 35;
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
| 7 | root | localhost | NULL | Query | 0 | starting | show processlist | 0 | 0 || 31 | root | localhost | NULL | Sleep | 35 | | NULL | 1 | 0 |
| 33 | root | localhost | NULL | Sleep | 32 | | NULL | 1 | 0 || 35 | root | localhost | NULL | Sleep | 29 | | NULL | 1 | 0 |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
mysql> show processlist ;
+----+------+-----------+------+---------+------+-------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |+----+------+-----------+------+---------+------+-------------+------------------+-----------+---------------+| 7 | root | localhost | NULL | Query | 0 | starting | show processlist | 0 | 0 |
| 31 | root | localhost | NULL | Killed | 44 | cleaning up | NULL | 1 | 0 || 33 | root | localhost | NULL | Killed | 41 | cleaning up | NULL | 1 | 0 |
| 35 | root | localhost | NULL | Killed | 38 | cleaning up | NULL | 1 | 0 |+----+------+-----------+------+---------+------+-------------+------------------+-----------+---------------+4 rows in set (0.02 sec)
mysql> show processlist ;
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
| 7 | root | localhost | NULL | Query | 0 | starting | show processlist | 0 | 0 |+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
可以看到大约10秒后才Killed状态才消失,而Killed状态没有出现20秒因此可以确认是这一步完成后Killed线程才会在show processlist中消失。
我们有对 thd->release_resources();做详细研究,就是这个函数执行完成过后Killed线程不会出现到show processlist中。如果有同学有兴趣可以深入学习。其实在Oracle中我也遇到很多kill session 标记为killed的情况,相信原理都差不多。下次在遇到killed状态的线程也不用急,可以看看这个线程是否正在干活,主要看看这个线程是否占用了CPU了,如果没有干活肯可能出现的BUG,比如某些Mutex的锁等待(如上列)。
一些关于posix多线程编程的函数接口
参考我的文章:
http://blog.itpub.net/7728585/viewspace-2139638/
这里截取部分内容:
在线程同步中我们经常会使用到mutex互斥量,其作用用于保护一块临界区,避免多线程并发操作对这片临界区带来的数据混乱,
POSIX的互斥量是一种建议锁,因为如果不使用互斥量也可以访问共享数据,但是可能是不安全的。
其原语包含:
而条件变量cond则代表当某个条件不满足的情况下,本线程应该放弃锁,并且将本线程堵塞。典型的
生产者消费者问题,如果生产者还没来得及生产东西,消费者则不应该进行消费操作,应该放弃锁,将
自己堵塞,直到条件满足被生产者唤醒。原语包含:
作者微信:gp_22389860
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/7728585/viewspace-2565280/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/7728585/viewspace-2565280/