一、错误1
————————————————
版权声明:本文为CSDN博主「AllInCode」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/DSLZTX/article/details/51596951
1.1、错误描述
ZooKeeper Server(“FOLLOWER和LEADER”都有)的日志中显示有以下所示错误:
2016-05-14 15:33:01,818 [myid:2] - ERROR [CommitProcessor:2:NIOServerCnxn@178] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
va:151)
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
java:1081)
at org.apache.zookeeper.server.FinalRequestProcessor.proce***equest(Fina
lRequestProcessor.java:170)
at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
1.2、错误原因分析
ZooKeeper Server发送回复时,Socket连接已经被关闭。

1.3、错误解决
当ZooKeeper Server发送回复时,增加一个“sk.isValid()”的判断。以上其实是一个bug,在ZooKeeper 3.4.8版本中得到修复。

1.4、其他
这个错误在上线“使用ZooKeeper获取MQ地址方案”之前也存在。
二、错误2
2.1、错误描述
ZooKeeper Server(“FOLLOWER”)日志中显示有以下所示错误,出现该错误后,作为“FOLLOWER”的该ZooKeeper Server在一段时间内会停止工作:
2016-05-15 04:04:40,569 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2243ms which will adversely effect operation latency. See the
ZooKeeper troubleshooting guide
————————————————
2016-05-14 15:32:50,764 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
2016-05-14 15:32:50,764 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
相应的ZooKeeper Server(“LEADER”)日志中显示有如下所示错误:
2016-05-14 15:32:42,605 [myid:3] - WARN [SyncThread:3:FileTxnLog@334] - fsync-i
ng the write ahead log in SyncThread:3 took 3041ms which will adversely effect o
peration latency. See the ZooKeeper troubleshooting guide

2016-05-14 15:32:50,764 [myid:3] - WARN [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:218
1:LearnerHandler@687] - Closing connection to peer due to transaction timeout.
2016-05-14 15:32:50,764 [myid:3] - WARN [LearnerHandler-/10.110.20.23:39390:Lea
rnerHandler@646] - *** GOODBYE /10.110.20.23:39390 ****
2016-05-14 15:32:50,764 [myid:3] - WARN [LearnerHandler-/10.110.20.23:39390:Lea
rnerHandler@658] - Ignoring unexpected exception
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterrup
tibly(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantL
ock.java:312)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java
:294)
at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHan
dler.java:656)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.
java:649)

2.2、错误原因分析
“FOLLOWER”在跟“LEADER”同步时,fsync操作时间过长,导致超时。

2.3、错误解决
增加“tickTime”或者“initLimit和syncLimit”的值,或者两者都增大。

2.4、其他
这个错误在上线“使用ZooKeeper获取MQ地址方案”之前也存在,只不过没有这么高频率,而上线了“使用ZooKeeper获取MQ地址方案”之后,ZooKeeper Server之间的同步数据量增大,ZooKeeper Server的负载加重,因而最终导致高频率出现上述错误。
————————————————

有一些网友给了一些解决方案,就是在zk配置中增加时间单元,使得连接的超时时间变大,从而保证同步延迟不会超过session的超时时间。于是我尝试修改了配置:

tickTime=4000

The number of ticks that the initial

synchronization phase can take

initLimit=20

The number of ticks that can pass between

sending a request and getting an acknowledgement

syncLimit=10
tickTime是zk中的时间单元,其他时间设置都是按照其倍数来确定的,这里是4s。原来的配置是

tickTime=2000

The number of ticks that the initial

synchronization phase can take

initLimit=10

The number of ticks that can pass between

sending a request and getting an acknowledgement

syncLimit=5
我都增加了一倍。这样,如果zk的forceSync消耗的时间不是特别的长,还是能在session过期之前返回,这样连接勉强还可以维持。但是实际应用中,还是会不断的报同步延迟过高的警告:

fsync-ing the write ahead log in SyncThread:0 took 8001ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
去查了下storm和kafka的日志,还是动不动就检测到disconnected、session time out等日志,虽然服务基本不会挂,但说明问题还是没有解决。

最后无奈之下采用了一个网友的建议:在zoo.cfg配置文件中新增一项配置

forceSync=no
的确解决了问题,不再出现同步延迟太高的问题,日志里不再有之前的warn~

当然从该配置的意思上,我们就知道这并不是一个完美的解决方案,因为它将默认为yes的forceSync改为了no。这诚然可以解决同步延迟的问题,因为它使得forceSync不再执行!!!

我们可以这样理解:zk的forceSync默认为yes,意思是,每次zk接收到一些数据之后,由于forceSync=yes,所以会立刻去将当前的状态信息同步到磁盘日志文件中,同步完成之后才会给出应答。在正常的情况下,这没有是什么问题,但是在我的测试环境下,由于某种我未知的原因,使得写入日志到磁盘非常的慢,于是在这期间,zk的日志出现了

fsync-ing the write ahead log in SyncThread:0 took 8001ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
然后由于同步日志耗时太久,连接得不到回复,如果已经超过了连接的超时时间设置,那么连接(比如kafka)会认为,该连接已经失效,将重新申请建立~于是kafka和storm不断的报错,不断的重连,偶尔还会挂掉。