记录线上服务出现MySQL死锁的排查过程

死锁原因

MySQL中的锁种类繁多,并且MySQL自带死锁检测机制。虽然正常的业务开发中很难遇到死锁的情况,但是最近团队里的一个服务它真的出现了deadlock!!!
遇事不慌,对于别人提的bug或者问题,我们首先确认是不是这样,再来考虑怎么办。所以,我让运维把相关服务日志打包发给我,这里截取一段:

[2022-04-20 13:02:46.360] [http-nio-8080-exec-88] [ERROR] [ at com.cmonelink.osms.cnpnplatform.exception.CustomExceptionResolver.handleException(CustomExceptionResolver.java:137)
] 通用异常 
org.springframework.dao.CannotAcquireLockException: 
### Error updating database.  Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Lock wait timeout exceeded; try restarting transaction
### The error may involve com.cmonelink.osms.cnpnplatform.mapper.CbDsmPerformanceNrcellMapper.deleteByList-Inline
### The error occurred while setting parameters
### SQL: delete from CB_DSM_PERFORMANCE_NRCELL         WHERE                        NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and 

从日志出发,我们可以得到:

  • 确实出现了MySQL锁相关异常(MySQLTransactionRollbackException: Lock wait timeout exceeded)
  • 锁异常的原因是超时
    所以,这并不能代表出现了死锁的情况。因为锁等待超时和死锁是两个概念。于是,我让运维人员在线上MySQL上执行SHOW ENGINE INNODB STATUS,下面截取一段执行结果:
------------------------
LATEST DETECTED DEADLOCK
------------------------
2022-04-20 10:30:22 0x7f7eaf419700
*** (1) TRANSACTION:
TRANSACTION 51226606, ACTIVE 1 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 129 lock struct(s), heap size 24784, 403 row lock(s), undo log entries 201
MySQL thread id 6687, OS thread handle 140181895087872, query id 41203034 10.215.0.50 cbapp update
INSERT INTO CB_DSM_ZG_GNODEB(NE_ID, LIFECYCLE_STATUS,DEVICE_TYPE,FREQUENCY,GNODEB_NAME,
        LONGITUDE, CREATE_TIME, LATITUDE)
        VALUES
          
            ('2299606',2,1,'2.6GHz','name1',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2318773',2,1,'2.6GHz','name2',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2318891',2,1,'2.6GHz','name3',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2318977',2,1,'2.6GHz','name4',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2311856',2,1,'2.6GHz','name5',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2306380',2,1,'2.6GHz','name6',null,
            '2022-04-20 10:30

*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226606 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;


*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226606 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;


*** (2) TRANSACTION:
TRANSACTION 51226607, ACTIVE 1 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 136 lock struct(s), heap size 24784, 403 row lock(s), undo log entries 201
MySQL thread id 6684, OS thread handle 140186323556096, query id 41203036 10.215.0.50 cbapp update
INSERT INTO CB_DSM_ZG_GNODEB(NE_ID, LIFECYCLE_STATUS,DEVICE_TYPE,FREQUENCY,GNODEB_NAME,
        LONGITUDE, CREATE_TIME, LATITUDE)
        VALUES
          
            ('2331081',2,1,'2.6GHz','name1',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2331059',2,1,'2.6GHz','name2',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2331761',2,1,'2.6GHz','name3',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2331174',2,1,'2.6GHz','name4',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2334290',2,1,'700M','name5',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2331086',2,1,'2.6GHz',name6',null,
            '2022-0

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226607 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;


*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226607 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

*** WE ROLL BACK TRANSACTION (1)

从执行的结果来看,

  1. 确实是发生了死锁。
 ------------------------
LATEST DETECTED DEADLOCK
------------------------
  1. 死锁产生的原因是两个批量insert的事务,事务id分别为5122660651226607
*** (1) TRANSACTION:
TRANSACTION 51226606, ACTIVE 1 sec inserting
.........................................................................
.........................................................................
..........................省略中间部分.........................
*** (2) TRANSACTION:
TRANSACTION 51226607, ACTIVE 1 sec inserting
  1. 两个事务均持有一个锁
    事务51226606
*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226606 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

事务51226607

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226607 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

从打印的内容来看,两个事务持有的锁内容一致,并且根据asc supremum;;可以大概判断,两个事务持有的均为间隙锁

  1. 两个事务均在等待锁
    事务51226606
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226606 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

事务51226607

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226607 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

lock_mode X insert intention waiting可以得出,两个事务均在等待插入意向锁的释放。且两个事务等的锁内容与上文中两个事务持有的锁一致,我们可以推测出一个假设:两个事务均持有了一个范围比较大的间隙锁,而后续的insert操作均需等待对方释放间隙锁才能拿到插入意向锁,从而产生死锁!

死锁复现

产生死锁的服务业务逻辑通过下面的事务来概要表示

begin
batch_delete_method()//删除表中数据
batch_insert_method()//插入数据
commit

目前怀疑是两个事务的批量delete操作中包含了不存在的数据,因此产生了间隙锁;后续批量insert的时候需要互相等对方释放间隙锁,下面开始验证假设。

  1. 首先创建一张表,id为主键索引,number为普通索引
CREATE TABLE `test1` (
  `id` int(1) NOT NULL AUTO_INCREMENT,
  `number` int(1) NOT NULL COMMENT '数字',
  PRIMARY KEY (`id`),
  KEY `number` (`number`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=24 DEFAULT CHARSET=utf8;
  1. 在表中插入几条数据
INSERT INTO `test1` VALUES (0, 0);
INSERT INTO `test1` VALUES (10, 10);
INSERT INTO `test1` VALUES (15, 15);
  1. 开启两个事务
    事务A
begin
delete from test1 where id = 12//记录不存在,在(10,15)上加间隙锁
insert into test1 values(12,12)//需等事务B释放间隙锁
commit

事务B

begin
delete from test1 where id = 14//记录不存在,在(10,15)上加间隙锁
insert into test1 values(14,14)//需等事务B释放间隙锁
commit

4.执行结果

deadlock.png

可以看到,确实发生了死锁,执行SHOW ENGINE INNODB STATUS,输出结果如下


=====================================
2022-04-21 16:20:48 0x5258 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 7 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 81 srv_active, 0 srv_shutdown, 83574 srv_idle
srv_master_thread log flush and writes: 0
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 19
OS WAIT ARRAY INFO: signal count 16
RW-shared spins 4, rounds 4, OS waits 0
RW-excl spins 4, rounds 120, OS waits 4
RW-sx spins 0, rounds 0, OS waits 0
Spin rounds per wait: 1.00 RW-shared, 30.00 RW-excl, 0.00 RW-sx
------------------------
LATEST DETECTED DEADLOCK
------------------------
2022-04-21 16:18:49 0x5258
*** (1) TRANSACTION:
TRANSACTION 56319, ACTIVE 12 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 23, OS thread handle 17400, query id 1806 localhost ::1 root update
insert into test1 values(12,12)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 353 page no 4 n bits 80 index PRIMARY of table `ocb_cp_5gmall`.`test1` trx id 56319 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 8000000f; asc     ;;
 1: len 6; hex 00000000dbf9; asc       ;;
 2: len 7; hex 81000000c10151; asc       Q;;
 3: len 4; hex 8000000f; asc     ;;

*** (2) TRANSACTION:
TRANSACTION 56320, ACTIVE 9 sec inserting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 26, OS thread handle 21080, query id 1810 localhost ::1 root update
insert into test1 values(14,14)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 353 page no 4 n bits 80 index PRIMARY of table `ocb_cp_5gmall`.`test1` trx id 56320 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 8000000f; asc     ;;
 1: len 6; hex 00000000dbf9; asc       ;;
 2: len 7; hex 81000000c10151; asc       Q;;
 3: len 4; hex 8000000f; asc     ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 353 page no 4 n bits 80 index PRIMARY of table `ocb_cp_5gmall`.`test1` trx id 56320 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 8000000f; asc     ;;
 1: len 6; hex 00000000dbf9; asc       ;;
 2: len 7; hex 81000000c10151; asc       Q;;
 3: len 4; hex 8000000f; asc     ;;

*** WE ROLL BACK TRANSACTION (2)
------------
TRANSACTIONS
------------
Trx id counter 56325
Purge done for trx's n:o < 56325 undo n:o < 0 state: running but idle
History list length 14
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 282892835750928, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835749248, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835753448, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835752608, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835748408, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835747568, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 56319, ACTIVE 131 sec
3 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 1
MySQL thread id 23, OS thread handle 17400, query id 1813 localhost ::1 root
.............................................................
.....................省略部分内容.............................
END OF INNODB MONITOR OUTPUT
============================

输出结果与线上MySQL的输出一致,验证了我们的假设!
最后用一张表来演示死锁产生的过程

顺序 事务A 事务B
1 begin
2 begin
3 delete from test1 where id = 12//记录不存在,在(10,15)上加间隙锁
4 delete from test1 where id = 14//记录不存在,在(10,15)上加间隙锁
5 insert into test1 values(12,12)//需等事务B释放间隙锁
6 insert into test1 values(14,14)//需等事务A释放间隙锁

死锁解决

找到了出现死锁的原因,我们要想到解决问题的方法。这里我把死锁产生的原因再贴出来
两个事务的批量delete操作涉及到表中不存在的数据,从而均持有了一个范围比较大的间隙锁,而后续的insert操作均需等待对方释放间隙锁才能拿到插入意向锁,从而产生死锁!
说到底,产生死锁的原因就是间隙锁。因此这里从两个角度来解决,

  • 数据库角度,将事务隔离级别设置为RC,防止产生间隙锁
  • 业务代码角度,在每次delete之前,执行一次select,确保表中存在相关数据,再执行delete,从而防止出现间隙锁

个人体会

网上的博客看的再多,没遇到过真实的问题那仅停留在纸上谈兵,实践出真知!

你可能感兴趣的:(记录线上服务出现MySQL死锁的排查过程)