Waiting for table metadata lock 这个mdl锁,我们最常见,这篇先拿它开刀。
1 长查询/mysqldump 阻塞DDL
session1 执行大的查询时候(select * from sbtest10 order by k desc;).
session2提交了一个对session中一张表的alter操作(alter table sbtest10 add column dd1111d int not null;).这时候session2会一直被阻塞,直到session1的查询完成。
session2未完成的这个时间段内, 执行show processlist 就会显示 Waiting for table metadata lock
session3 这时候又发起了一个非常简单的查询(select * from sbtest10 limit 3;),可以看到也是被阻塞的状态的。
> show full processlist; +--------+---------+------------+--------------------+---------+------+---------------------------------+------------------------------------------------------+-----------+ | Id | User | Host | db | Command | Time | State | Info | Rows_sent | +--------+---------+------------+--------------------+---------+------+---------------------------------+------------------------------------------------------+-----------+ | 115950 | root | localhost | performance_schema | Query | 0 | starting | show full processlist | 0 | | 117002 | root | localhost | test | Query | 2 | Waiting for table metadata lock | alter table sbtest10 add column dd1111d int not null | 0 | | 117027 | root | localhost | test | Query | 2 | executing | select * from sbtest10 order by k desc | 0 | | 115950 | root | localhost | test | Query | 6 | Waiting for table metadata lock | select * from sbtest10 limit 3 | 0 | +--------+---------+------------+--------------------+---------+------+---------------------------------+------------------------------------------------------+-----------+
会话1先执行select , 会话2后执行alter。
在会话1执行完毕前,会话2拿不到MDL锁,从表格上面来看,主要阻塞在rename阶段。会话1在执行完毕后,会话2拿到MDL锁,变为rename table状态,这个操作持续时间非常短,会话1再次执行查询,当会话2执行完后,此时会话1正常执行。这说明对于MDL锁而言,select会阻塞alter,而alter不会阻塞select。在rename的瞬间,alter是会阻塞select的.
当执行select语句时,只要select语句在获取MDL_SHARED_READ锁之前,alter没有执行到rename阶段,那么select获取MDL_SHARED_READ锁成功,后续有alter执行到rename阶段,请求MDL_EXCLUSIVE锁时,就会被阻塞。rename阶段会持有MDL_EXCLUSIVE锁,但由于这个过程时间非常短(大头都在copy数据阶段),并且是alter的最后一个阶段,所以基本感觉不到alter会阻塞select语句。由于MDL锁在事务提交后才释放,若线上存在大查询,或者存在未提交的事务,则会出现ddl卡住的现象。这里要注意的是,ddl卡住后,若再有select查询或DML进来,都会被堵住,就会出现thread running飙高的情况。
#### 注意: 对于MDL锁而言,select会阻塞alter,而alter加锁后DDL运行起来后会立马释放掉锁(这个加/释放锁的时间开销非常短),这时候是不会阻塞select。
2 未提交的事务阻塞 DDL
此时执行 show processlist; 如下:
+-----+------+-----------+---------+---------+------+---------------------------------+------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----+------+-----------+---------+---------+------+---------------------------------+------------------------------------+ | 143 | root | localhost | testdb | Query | 223 | Waiting for table metadata lock | alter table t add column c3 int | ---> session2 | 145 | root | localhost | testdb | Query | 4 | Waiting for table metadata lock | select count(*) from t | ---> session3 +-----+------+-----------+---------+---------+------+---------------------------------+------------------------------------+
session1 对表t进行update操作,存在未提交的事务,故一直持有 MDL_SHARED_WRITE锁,由于没有执行COMMIT,会一直持有。
session2 DDL 操作会请求 TABLE-TRANSACTION-EXCLUSIVE 锁,该锁与session1 的MDL_SHARED_WRITE 锁互斥,故 session2 的DDL 等待;session3的查询操作会请求TABLE- TRANSACTION- MDL_SHARED_READ锁,虽然MDL_SHARED_READ与活跃锁MDL_SHARED_WRITE不冲突,但是与session2的等待锁EXCLUSIVE冲突,因此也会等待。
解决该中场景的问题比较麻烦,但从show processlist 不能检查出哪个会话持有锁 。可以从两个方面进行调查
a 查询 select * from information_schema.innodb_trx\G
b 检查 show engine innodb status \G 查看里面transaction的ACTIVE的时间长度
active N sec 说明事务持续了N秒,一般而言超过10秒的事务都是有问题的。找到了活动的事务之后,要和开发沟通看看能否直接kill这个会话?
3 第1种情况的特例,存在一个查询失败的语句,比如查询不存在的列,语句失败返回,但是事务没有提交,此时alter仍然会被堵住。
通过show processlist看不到TableA上有任何操作,在information_schema.innodb_trx中也没有任何进行中的事务。这很可能是因为在一个显式的事务中,对TableA进行了一个失败的操作(比如查询了一个不存在的字段),这时事务没有开始,但是失败语句获取到的锁依然有效。从select * from performance_schema.events_statements_current\G 表中可以查到失败的语句。
官方手册上对此的说明如下:
If the server acquires metadata locks for a statement that is syntactically valid but fails during execution, it does not release the locks early. Lock release is still deferred to the end of the transaction because the failed statement is written to the binary log and the locks protect log consistency.
也就是说除了语法错误,其他错误语句获取到的锁在这个事务提交或回滚之前,仍然不会释放掉。because the failed statement is written to the binary log and the locks protect log consistency 但是解释这一行为的原因很难理解,因为错误的语句根本不会被记录到二进制日志。
查看锁情况:
普通版(不推荐使用):
> select * from sys.schema_table_lock_waits\G ***************************[ 1. row ]*************************** object_schema | sbtest object_name | sbtest3 waiting_thread_id | 746 waiting_pid | 244 waiting_account | root@localhost waiting_lock_type | EXCLUSIVE waiting_lock_duration | TRANSACTION waiting_query | alter table sbtest3 add column ccc22 int waiting_query_secs | 505 waiting_query_rows_affected | 0 waiting_query_rows_examined | 0 blocking_thread_id | 746 blocking_pid | 244 blocking_account | root@localhost blocking_lock_type | SHARED_UPGRADABLE blocking_lock_duration | TRANSACTION sql_kill_blocking_query | KILL QUERY 244 sql_kill_blocking_connection | KILL 244 ***************************[ 2. row ]*************************** object_schema | sbtest object_name | sbtest3 waiting_thread_id | 746 waiting_pid | 244 waiting_account | root@localhost waiting_lock_type | EXCLUSIVE waiting_lock_duration | TRANSACTION waiting_query | alter table sbtest3 add column ccc22 int waiting_query_secs | 505 waiting_query_rows_affected | 0 waiting_query_rows_examined | 0 blocking_thread_id | 100 blocking_pid | 18 blocking_account | root@localhost blocking_lock_type | SHARED_READ blocking_lock_duration | TRANSACTION sql_kill_blocking_query | KILL QUERY 18 sql_kill_blocking_connection | KILL 18
可以看到,我们实际上只有一个alter table操作,这里 sys库查出的却是两条记录,而且两条记录的kill对象竟然还不一样,对表结构不熟悉及不仔细看记录内容的话,难免会kill错对象。
不仅如此,如果有N个查询被DDL操作堵塞,则会产生N*2条记录。在阻塞操作较多的情况下,这N*2条记录完全是个噪音。
加强版(步骤稍繁琐些,但是也可以用):
> select object_type,object_schema,object_name,lock_type,lock_duration,lock_status,owner_thread_id from performance_schema.metadata_locks; +---------------+--------------------+----------------+---------------------+---------------+-------------+-----------------+ | object_type | object_schema | object_name | lock_type | lock_duration | lock_status | owner_thread_id | +---------------+--------------------+----------------+---------------------+---------------+-------------+-----------------+ | GLOBAL || | INTENTION_EXCLUSIVE | STATEMENT | GRANTED | 746 | | BACKUP LOCK | | | INTENTION_EXCLUSIVE | TRANSACTION | GRANTED | 746 | | SCHEMA | sbtest | | INTENTION_EXCLUSIVE | TRANSACTION | GRANTED | 746 | | TABLE | sbtest | sbtest3 | SHARED_UPGRADABLE | TRANSACTION | GRANTED | 746 | | BACKUP TABLES | | | INTENTION_EXCLUSIVE | STATEMENT | GRANTED | 746 | | TABLESPACE | | sbtest/sbtest3 | INTENTION_EXCLUSIVE | TRANSACTION | GRANTED | 746 | | TABLE | sbtest | #sql-41f69_f4 | EXCLUSIVE | STATEMENT | GRANTED | 746 | | TABLE | sbtest | sbtest3 | EXCLUSIVE | TRANSACTION | PENDING | 746 | | TABLE | sbtest | sbtest3 | SHARED_READ | TRANSACTION | GRANTED | 100 | | TABLE | performance_schema | metadata_locks | SHARED_READ | TRANSACTION | GRANTED | 810 | +---------------+--------------------+----------------+---------------------+---------------+-------------+-----------------+
这里,重点关注 lock_status , "PENDING" 代表线程在等待MDL , 而"GRANTED"则代表线程持有MDL。
> select * from performance_schema.threads where thread_id IN (746,100 )\G ***************************[ 1. row ]*************************** THREAD_ID | 100 NAME | thread/sql/one_connection TYPE | FOREGROUND PROCESSLIST_ID | 18 PROCESSLIST_USER | root PROCESSLIST_HOST | localhost PROCESSLIST_DB | sbtest PROCESSLIST_COMMAND | Sleep PROCESSLIST_TIME | 610 PROCESSLIST_STATE |PROCESSLIST_INFO | show full processlist -- 这是前一个异常的连接,最后执行的sql。一般我们把这个连接干掉,就可以解决掉mdl阻塞的问题。 PARENT_THREAD_ID | ROLE | INSTRUMENTED | YES HISTORY | YES CONNECTION_TYPE | Socket THREAD_OS_ID | 270396 RESOURCE_GROUP | USR_default ***************************[ 2. row ]*************************** THREAD_ID | 746 NAME | thread/sql/one_connection TYPE | FOREGROUND PROCESSLIST_ID | 244 PROCESSLIST_USER | root PROCESSLIST_HOST | localhost PROCESSLIST_DB | sbtest PROCESSLIST_COMMAND | Query PROCESSLIST_TIME | 502 PROCESSLIST_STATE | Waiting for table metadata lock PROCESSLIST_INFO | alter table sbtest3 add column ccc22 int PARENT_THREAD_ID | ROLE | INSTRUMENTED | YES HISTORY | YES CONNECTION_TYPE | Socket THREAD_OS_ID | 271454 RESOURCE_GROUP | USR_default
终极版(一步到位,直接找到要杀掉的连接id) :
SELECT a.OBJECT_SCHEMA AS locked_schema, a.OBJECT_NAME AS locked_table, "Metadata Lock" AS locked_type, c.PROCESSLIST_ID AS waiting_processlist_id, c.PROCESSLIST_TIME AS waiting_age, c.PROCESSLIST_INFO AS waiting_query, c.PROCESSLIST_STATE AS waiting_state, d.PROCESSLIST_ID AS blocking_processlist_id, d.PROCESSLIST_TIME AS blocking_age, d.PROCESSLIST_INFO AS blocking_query, concat('KILL ', d.PROCESSLIST_ID) AS sql_kill_blocking_connection FROM performance_schema.metadata_locks a JOIN performance_schema.metadata_locks b ON a.OBJECT_SCHEMA = b.OBJECT_SCHEMA AND a.OBJECT_NAME = b.OBJECT_NAME AND a.lock_status = 'PENDING' AND b.lock_status = 'GRANTED' AND a.OWNER_THREAD_ID <> b.OWNER_THREAD_ID AND a.lock_type = 'EXCLUSIVE' JOIN performance_schema.threads c ON a.OWNER_THREAD_ID = c.THREAD_ID JOIN performance_schema.threads d ON b.OWNER_THREAD_ID = d.THREAD_ID \G
*************************** 1. row ***************************
locked_schema: sbtest
locked_table: sbtest3
locked_type: Metadata Lock
waiting_processlist_id: 244
waiting_age: 2474
waiting_query: alter table sbtest3 add column ccc22 int
waiting_state: Waiting for table metadata lock
blocking_processlist_id: 18
blocking_age: 2582
blocking_query: show full processlist
sql_kill_blocking_connection: KILL 18 --- 我们杀掉这个连接,就可以解决这里的MDL锁问题了
1 row in set (0.00 sec)
暴击版(直接把造成mdl的会话的执行过的sql全部输出来):
SELECT locked_schema, locked_table, locked_type, waiting_processlist_id, waiting_age, waiting_query, waiting_state, blocking_processlist_id, blocking_age, substring_index(sql_text,"transaction_begin;" ,-1) AS blocking_query, sql_kill_blocking_connection FROM ( SELECT b.OWNER_THREAD_ID AS granted_thread_id, a.OBJECT_SCHEMA AS locked_schema, a.OBJECT_NAME AS locked_table, "Metadata Lock" AS locked_type, c.PROCESSLIST_ID AS waiting_processlist_id, c.PROCESSLIST_TIME AS waiting_age, c.PROCESSLIST_INFO AS waiting_query, c.PROCESSLIST_STATE AS waiting_state, d.PROCESSLIST_ID AS blocking_processlist_id, d.PROCESSLIST_TIME AS blocking_age, d.PROCESSLIST_INFO AS blocking_query, concat('KILL ', d.PROCESSLIST_ID) AS sql_kill_blocking_connection FROM performance_schema.metadata_locks a JOIN performance_schema.metadata_locks b ON a.OBJECT_SCHEMA = b.OBJECT_SCHEMA AND a.OBJECT_NAME = b.OBJECT_NAME AND a.lock_status = 'PENDING' AND b.lock_status = 'GRANTED' AND a.OWNER_THREAD_ID <> b.OWNER_THREAD_ID AND a.lock_type = 'EXCLUSIVE' JOIN performance_schema.threads c ON a.OWNER_THREAD_ID = c.THREAD_ID JOIN performance_schema.threads d ON b.OWNER_THREAD_ID = d.THREAD_ID ) t1, ( SELECT thread_id, group_concat( CASE WHEN EVENT_NAME = 'statement/sql/begin' THEN "transaction_begin" ELSE sql_text END ORDER BY event_id SEPARATOR ";" ) AS sql_text FROM performance_schema.events_statements_history GROUP BY thread_id ) t2 WHERE t1.granted_thread_id = t2.thread_id \G
结果如下:
*************************** 1. row ***************************
locked_schema: sbtest
locked_table: sbtest3
locked_type: Metadata Lock
waiting_processlist_id: 244
waiting_age: 2979
waiting_query: alter table sbtest3 add column ccc22 int
waiting_state: Waiting for table metadata lock
blocking_processlist_id: 18
blocking_age: 3087
blocking_query: select k,d from sbtest3;show full processlist;show full processlist;show full processlist;show databases;show tables;show slave status;select user,host,authentication_string,password_expired,password_last_changed from mysql.user -- 这里我们可以看到连接执行过哪些sql了
sql_kill_blocking_connection: KILL 18
1 row in set (0.00 sec)
具体的sql的分析,可以参考这篇老哥的文章,讲得非常好 https://www.cnblogs.com/ivictor/p/9460147.html