某天早上,还没到公司就被通知系统定时任务都挂了,紧急对所有应用都重启之后,我们把问题定位在了zk上。
问题背景:
- 系统使用定时任务调度框架为开源tbSchedule,使用zk作为任务注册中心,这个框架对异常状况处理能力比较弱,与zk的连接中,出现网络超时、session过期等情况没有稳定的恢复和重连方案。
- zk除了做tbschedule的注册中心外,还给一个业务系统做分布式并发调度控制提供zk锁,会有不断的加锁和解锁操作。
问题定位:
- 定位到故障发生时间,查看业务日志为定时任务突然异常,无法正常调度。以往曾经发生过类似情况,zk与应用之间网络故障导致重连、其它原因导致重连等都会导致定时任务无法调度,这个是tbschedule里的老毛病,一直没有时间去优化,且由于tbschedule里源码太乱,非是一朝一夕可以找到此bug。
- 查看网络监控发现没有任何异常,咨询了阿里云的工程师,也回答没有任何网络抖动。
- 查看zk日志时,发现当时zk集群发生了重新选举,why?zk为什么要重新选举?
- 再查看zk日志,发现zxid竟然打满了,然后触发了zk集群的重新选举。
结论:
zxid打满,zk重新选举,定时任务恢复能力不足导致系统故障。三个问题,我们这里值讨论第一个问题:zxid为什么打满了?
zxid 是一个 long 型(64位)整数,分为两部分:epoch (群首周期) 和 counter (事务计数),每次选举会产生一个epoch,当前主从不变时epoch不变,每次事务只会递增counter,也就是总的有32位可以计数,总42亿,也就是说可以支持42亿次的事务,从上一次已知的zk选举时间来看过了四个多月,也就是120天,42亿/120=3500万,也就是说每天约3500万次zxid消耗,已知对zk调用最多对是业务zk锁的场景,从这里入手。
查看zk日志,发现zxid不是+1式的增长,断号现象非常严重:
Error Path:/lock/user/e0536c4ce89744269091dfb8418d8f17 Error:KeeperErrorCode = NodeExists for /lock/user/e0536c4ce89744269091dfb8418d8f17
2018-04-16 14:02:32,185 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x262cb9894cc0092 type:create cxid:0x681b8 zxid:0x100007ecb9e txntype:-1 reqpath:n/a Error Path:/lock/user/007e48345dc344cbb28f07d2749eb506 Error:KeeperErrorCode = NodeExists for /lock/user/007e48345dc344cbb28f07d2749eb506
2018-04-16 14:02:32,263 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x362cb989535009a type:create cxid:0x659b2 zxid:0x100007ecbd7 txntype:-1 reqpath:n/a Error Path:/lock/user/854f33a82fa842b2a52cb1a0d43b902d Error:KeeperErrorCode = NodeExists for /lock/user/854f33a82fa842b2a52cb1a0d43b902d
2018-04-16 14:02:32,281 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x262cb9894cc0092 type:create cxid:0x681fa zxid:0x100007ecbe1
分布式锁业务Info日志
create zookeeper node(在应用场景中,利用zk加锁成功会打印日志:create zookeeper node)
delete zookeeper node(在应用场景中,利用zk加锁之后会删除节点,打印日志:delete zookeeper node)
统计删除Node(create与delete成对出现),时间点:00:00-22:00
[root@user02 logs]# cat catalina.out |grep 'delete zookeeper node'|wc -l 1659651
//计算24小时≈166万*(22/24)≈180万
已知业务场景中,由于锁竞争,大概8次(由于本文写在事故发生几个月后,这一个数据记得不是很清,有拼凑嫌疑,但本猿大概记得八九不离十)尝试加锁能成功一次,Zxid变更数量统计
计算公式:机器数量 *(create node+ delete node)
Total = 2 * (8 * 180万+180万)= 3240万
与3500万相差一些,但是zk不仅仅是zk锁在用,且四个月时间业务本身有起伏,所以每天会有波动。
zxid跳号原因:
在zookeeper设计中create、update、delete node都会导致zxid增加。线上zookeeper集群的日志级别为info,对于delete和update这种操作,日志是不显示的。在线下开发环境进行验证,并设置日志级别为debug,删除节点后,日志如下:
2018-04-27 00:24:05,991 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x3630553d6520000 type:delete cxid:0x2d71 zxid:0x100005d7d txntype:2 reqpath:n/a
2018-04-27 00:24:05,991 [myid:2] - DEBUG [CommitProcessor:2:FinalRequestProcessor@88] - Processing request:: sessionid:0x3630553d6520000 type:delete cxid:0x2d71 zxid:0x100005d7d txntype:2 reqpath:n/a
在测试环境演示zxid变化
2018-04-27 11:05:03,321 [myid:2] - DEBUG [CommitProcessor:2:FinalRequestProcessor@88] - Processing
request:: sessionid:0x263055405810004 type:create cxid:0x7 zxid:0x10001a30c txntype:-1
reqpath:/au/4
[root@mq01 bin]# tail -n 100 zookeeper.out |grep '0x10001a30d'
2018-04-27 11:05:08,669 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x263055405810004 type:error cxid:0x9 zxid:0x10001a30d txntype:-1 reqpath:n/a 2018-04-27 11:05:08,669 [myid:2] - DEBUG [CommitProcessor:2:FinalRequestProcessor@88] - Processing request:: sessionid:0x263055405810004 type:delete cxid:0x9 zxid:0x10001a30d txntype:-1 reqpath:/au
[root@mq01 bin]# tail -n 100 zookeeper.out |grep '0x10001a30e'
2018-04-27 11:05:14,008 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x263055405810004 type:error cxid:0xb zxid:0x10001a30e txntype:-1 reqpath:n/a 2018-04-27 11:05:14,009 [myid:2] - DEBUG [CommitProcessor:2:FinalRequestProcessor@88] - Processing request:: sessionid:0x263055405810004 type:delete cxid:0xb zxid:0x10001a30e txntype:-1 reqpath:/zu
[root@mq01 bin]#
[root@mq01 bin]# tail -n 100 zookeeper.out |grep '0x10001a30f'
2018-04-27 11:05:18,119 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x263055405810004 type:error cxid:0xd zxid:0x10001a30f txntype:-1 reqpath:n/a 2018-04-27 11:05:18,120 [myid:2] - DEBUG [CommitProcessor:2:FinalRequestProcessor@88] - Processing request:: sessionid:0x263055405810004 type:delete cxid:0xd zxid:0x10001a30f txntype:-1 reqpath:/au
[root@mq01 bin]#
[root@mq01 bin]# tail -n 100 zookeeper.out |grep '0x10001a310'
2018-04-27 11:05:21,336 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x263055405810004 type:error cxid:0xf zxid:0x10001a310 txntype:-1 reqpath:n/a 2018-04-27 11:05:21,336 [myid:2] - DEBUG [CommitProcessor:2:FinalRequestProcessor@88] - Processing request:: sessionid:0x263055405810004 type:delete cxid:0xf zxid:0x10001a310 txntype:-1 reqpath:/zu
[root@mq01 bin]# tail -n 100 zookeeper.out |grep '0x10001a311'
2018-04-27 11:05:28,967 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x263055405810004 type:error cxid:0x10 zxid:0x10001a311 txntype:-1 reqpath:n/a 2018-04-27 11:05:28,967 [myid:2] - DEBUG [CommitProcessor:2:FinalRequestProcessor@88] - Processing request:: sessionid:0x263055405810004 type:delete cxid:0x10 zxid:0x10001a311 txntype:-1 reqpath:/au/4
由于写在事故几个月后,本文里的日志材料和数据可能会有点不完整,但本文旨在得出以下结论: