MySQL死锁case分析

死锁发现

  • 2018-01-18 14:10:03 线上环境批量更新库存的地方出现了死锁
  • 2018-01-25 16:50:03 线上环境批量更新库存的地方出现了死锁
org.springframework.dao.DeadlockLoserDataAccessException: PreparedStatementCallback; SQL [UPDATE goods_inventory SET goods_num = ? WHERE wms_type = ? AND barCode = ?]; Deadlock found when trying to get lock; try restarting transaction; nested exception is java.sql.BatchUpdateException: Deadlock found when trying to get lock; try restarting transaction
        at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:263)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
        ...

死锁原因

死锁:一般是事务相互等待对方资源,最后形成环路造成的。如下图示,左图两辆车造成未造成死锁了,右图四辆车造成了死锁
MySQL死锁case分析_第1张图片

步骤 T1 T2
1 begin; begin;
2 update xxx set name = ‘aaa’ where id = 1;
3 update xxx set name = ‘bbb’ where id = 2;
4 update xxx set name = ‘ccc’ where id = 1;
5 update xxx set name = ‘ddd’ where id = 2;

死锁分析:

  1. 根据异常栈信息,死锁出现时,执行的SQL是:UPDATE goods_inventory SET goods_num = ? WHERE wms_type = ? AND barCode = ?。该SQL使用batchUpdate执行,一次会更新几千条数据
  2. 线上,还有一个地方:订单下单的时候,会批量扣减库存,该处的SQL也是使用batchUpdate执行。SQL如下:
    UPDATE goods_inventory SET goods_num = goods_num - ? WHERE barCode = ? AND wms_type = ?
  3. 根据死锁出现的时间,两次,都恰好是整10分出现,而线上有一个定时任务以10分钟的频率同步一次库存。

根据1、3可以猜测,造成死锁的原因是两个地方同时对一个表进行了大的事务操作(batchUpdate),而又互相等待对方资源

为了进一步验证猜测,需要对MySQL死锁日志进行分析,在MySQL上执行如下命令

show engine innodb status

得到的日志如下(无关信息已被忽略):

------------------------
LATEST DETECTED DEADLOCK
------------------------
2018-01-24 11:00:03 7f186658a700 
*** (1) TRANSACTION:
TRANSACTION 436005840, ACTIVE 0.104 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 6 lock struct(s), heap size 1184, 7 row lock(s), undo log entries 27
LOCK BLOCKING MySQL thread id: 940390784 block 823017822
MySQL thread id 823017822, OS thread handle 0x7f1864fbf700, query id 4007854070 10.26.162.33 aixuexidball updating
UPDATE goods_inventory SET goods_num = goods_num - 2 WHERE wms_type = 1 AND barCode = 'XSBSTG618H'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005840 lock_mode X locks rec but not gap waiting
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000aa0; asc     ;;

*** (2) TRANSACTION:
TRANSACTION 436005809, ACTIVE 1.314 sec starting index read
mysql tables in use 1, locked 1
15 lock struct(s), heap size 2936, 1659 row lock(s), undo log entries 33
MySQL thread id 940390784, OS thread handle 0x7f186658a700, query id 4007854164 10.26.162.33 aixuexidball updating
UPDATE goods_inventory SET goods_num = 1438 WHERE wms_type = 1 AND barCode = 'XSBSTG318H'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005809 lock_mode X locks rec but not gap
Record lock, heap no 4 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 11; hex 4753523250593131313751; asc GSR2PY1117Q;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 800001af; asc     ;;

Record lock, heap no 7 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 11; hex 4753523250593131313753; asc GSR2PY1117S;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 800001b0; asc     ;;

Record lock, heap no 10 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 11; hex 4753523250593131313843; asc GSR2PY1118C;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000d0c; asc     ;;

Record lock, heap no 13 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 11; hex 4753523250593131313848; asc GSR2PY1118H;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000bf2; asc     ;;

Record lock, heap no 16 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 11; hex 4753523254473130313751; asc GSR2TG1017Q;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 800001b1; asc     ;;


...

Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000aa0; asc     ;;

 ...

 Record lock, heap no 388 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 11; hex 4757523250593130313843; asc GWR2PY1018C;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80001589; asc     ;;

 *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005809 lock_mode X locks rec but not gap waiting
Record lock, heap no 317 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 10; hex 58534253544733313848; asc XSBSTG318H;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000aa2; asc     ;;

*** WE ROLL BACK TRANSACTION (1)
下面将逐段解释(由# 标识的文字是笔者注释)

第一部分是上次死锁发生的时间

------------------------
LATEST DETECTED DEADLOCK
------------------------
2018-01-24 11:00:03 7f186658a700   
#这是上次发生死锁的时间

1.接着是发生死锁的第一个事务信息

*** (1) TRANSACTION:
TRANSACTION 436005840, ACTIVE 0.104 sec starting index read
# 事务436005840,事务处于活跃状态0.104s,正在使用索引读取数据行
mysql tables in use 1, locked 1
# 事务436005840正在使用1个table,且涉及锁的也是1个table
LOCK WAIT 6 lock struct(s), heap size 1184, 7 row lock(s), undo log entries 27
# 等待6把锁,占用内存1184字节,涉及7行记录(注:如果事务已经锁定了几行数据,这里将会有一行信息显示出锁定结构的数目和堆大小,堆是为了持有这些行锁而占用的内存大小)
LOCK BLOCKING MySQL thread id: 940390784 block 823017822
# 线程ID:940390784阻塞在线程ID:823017822上
MySQL thread id 823017822, OS thread handle 0x7f1864fbf700, query id 4007854070 10.26.162.33 aixuexidball updating
# 该事务的线程ID信息,操作系统句柄信息,连接来源、用户
UPDATE goods_inventory SET goods_num = goods_num - 2 WHERE wms_type = 1 AND barCode = 'XSBSTG618H'
# 事务涉及的SQL

2.当死锁发生时,第一个事务正在等待的锁信息

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
# 第一个事务正在等待锁被授予
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005840 lock_mode X locks rec but not gap waiting
# 等待的锁是一个record lock,空间id是4141,页面号是21,在页面的456bits处。锁发生在表managesystemv1.0.goods_inventory的索引idx_barcode_wms上,是一个X锁,但不是GAP锁
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
# record lock的heap no 位置
 0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000aa0; asc     ;;

3.接着是发生死锁的第二个事务信息

*** (2) TRANSACTION:
TRANSACTION 436005809, ACTIVE 1.314 sec starting index read
# 事务436005809,事务处于活跃状态1.314s,正在使用索引读取数据行
mysql tables in use 1, locked 1
# 事务436005809正在使用1个table,且涉及锁的也是1个table
15 lock struct(s), heap size 2936, 1659 row lock(s), undo log entries 33
# 涉及15把锁,占用内存2936字节,锁定了1659行记录
MySQL thread id 940390784, OS thread handle 0x7f186658a700, query id 4007854164 10.26.162.33 aixuexidball updating
# 该事务的线程ID信息,操作系统句柄信息,连接来源、用户
UPDATE goods_inventory SET goods_num = 1438 WHERE wms_type = 1 AND barCode = 'XSBSTG318H'
# 第二个事务涉及的SQL

4.事务二持有的锁信息

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005809 lock_mode X locks rec but not gap
# 持有的锁是一个record lock,空间id是4141,页面号是21,在页面的456bits处。锁发生在表managesystemv1.0.goods_inventory的索引idx_barcode_wms上,是一个X锁,但不是GAP锁
Record lock, heap no 4 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 11; hex 4753523250593131313751; asc GSR2PY1117Q;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 800001af; asc     ;;

...

Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
# record lock的heap no 位置
# 这个就是事务1在等待的锁
 0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000aa0; asc     ;;

 ...

 Record lock, heap no 388 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 11; hex 4757523250593130313843; asc GWR2PY1018C;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80001589; asc     ;;

5.当死锁发生时,第二个事务正在等待的锁等信息

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005809 lock_mode X locks rec but not gap waiting

Record lock, heap no 317 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
# 在等着heap no为317的锁
 0: len 10; hex 58534253544733313848; asc XSBSTG318H;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000aa2; asc     ;;

*** WE ROLL BACK TRANSACTION (1)
# 事务1被回滚,如果两个事务回滚的开销不同(undo数量不同),回滚开销最小的那个事务。如果两个事务的回滚开销一样,选择后提交的事务进行回滚。

24可以看出,事务1在等待Record lock, heap no 341,而这个锁正好被事务2持有。

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0

*** (2) HOLDS THE LOCK(S):
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
 1: len 4; hex 80000001; asc     ;;
 2: len 4; hex 80000aa0; asc     ;;

5可以看出,事务2在等待Record lock, heap no 317,虽然日志里没有明确告知谁持有该锁,但这个锁应该恰好被事务1持有。InnoDB不会显示所有持有和等待的锁,但是,它显示了足够的信息来帮助我们定位问题。

*** (2) HOLDS THE LOCK(S):
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0

因此,可以得到如下结论:

  • T1持有no317,向T2索要no341
  • T2持有no341,向T1索要no317
sequenceDiagram
T1(no317)->>T2(no341): 给我no341的锁
T2(no341)->>T1(no317): 给我no317的锁
T1(no317)->>T2(no341): 你先给我,一会我再给你
T2(no341)->>T1(no317): 不,你先给我,我再给你

最终,证实了最初的猜想是正确的。

如何尽可能避免死锁

  1. 以固定的顺序访问表和行
  2. 大事务拆分成小事务。大事务更倾向于死锁,如果业务允许,将大事务拆小。
  3. 在同一个事务中,尽可能做到一次锁定所需要的所有资源,减少死锁概率。
  4. 降低隔离级别。如果业务允许,将隔离级别调低也是较好的选择,比如将隔离级别从RR调整为RC,可以避免掉很多因为gap锁造成的死锁。
  5. 为表添加合理的索引。如果不走索引,将会为表的每一行记录添加上锁,増加死锁的概率。

你可能感兴趣的:(数据库)