记一例kafka broker的ERROR错误日志事件分析--kafka.common.NotAssignedReplicaException

在这次的错误日志中最关键的一段日志信息如下,中间省略大部分类似的错误内容。

[2017-12-27 18:26:09,267] ERROR [KafkaApi-2] Error when handling request Name: FetchRequest; Version: 2; CorrelationId: 44771537; ClientId: ReplicaFetcherThread-2-2; ReplicaId: 4; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [test-topic02-rrt,12] -> PartitionFetchInfo(8085219,1048576),[test-topic01-ursr,22] -> PartitionFetchInfo(0,1048576),[test-topic02-rd,13] -> PartitionFetchInfo(787543,1048576),[test-topic02-st,12] -> PartitionFetchInfo(14804029,1048576),[test-topic04-ursr,7] -> PartitionFetchInfo(8,1048576),[test-topic04-rd,15] -> PartitionFetchInfo(2380092,1048576),[test-topic04-rrt,18] -> PartitionFetchInfo(27246143,1048576),[test-topic03-rrt,12] -> PartitionFetchInfo(12853720,1048576),[test-topic04-st,18] -> PartitionFetchInfo(25335299,1048576),[test-topic03-srt,11] -> PartitionFetchInfo(3750134,1048576),[test-topic05-ursd,17] -> PartitionFetchInfo(0,1048576),[test-topic05-srt,22] -> PartitionFetchInfo(33136074,1048576),[test-topic01-sd,1] -> PartitionFetchInfo(14361,1048576),[test-topic03-rd,21] -> PartitionFetchInfo(96366,1048576),[test-topic04-ursd,10] -> PartitionFetchInfo(0,1048576),[my-working-topic,15] -> PartitionFetchInfo(0,1048576),[test-topic02-ts_st,12] -> PartitionFetchInfo(0,1048576),[test-topic03-ursr,9] -> PartitionFetchInfo(1,1048576) (kafka.server.KafkaApis)
kafka.common.NotAssignedReplicaException: Leader 2 failed to record follower 4's position -1 since the replica is not recognized to be one of the assigned replicas  for partition [my-working-topic,15].

从中我们可以至少先找出两个关键信息:

  • Error when handling request Name: FetchRequest; 
  • kafka.common.NotAssignedReplicaException: Leader 2 failed to record follower 4's position -1 since the replica is not recognized to be one of the assigned replicas  for partition [my-working-topic,15]
1、对错误信息1的分析:Error when handling request Name: FetchRequest
这段信息中,我们可以看到是kafka在处理分区数据同步的过程中遇到的错误。
在这一行日志之上还有两行日志,其中这一行内容表明broker 2节点停止了对my-working-topic的21、15、3、9共4个partitions的数据同步线程。
[2017-12-27 18:26:09,219] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [my-working-topic,21],[my-working-topic,15],
[my-working-topic,3],[my-working-topic,9] (kafka.server.ReplicaFetcherManager)
这里有一个知识点是:ReplicaFetcherManager
这个类主要功能是,管理当前broker上的所有从leader的数据同步,主要方法有addPartitions和removePartitions等,控制启动或移除指定分区的leader副本数据同步线程。

继续往下看,日志中有这样一段内容:
[my-working-topic,15] -> PartitionFetchInfo(0,1048576)
其中 PartitionFetchInfo的定义为:class PartitionFetchInfo(offset: Long, fetchSize: Int),定义了分区的副本同步信息,两个参数分别是分区的副本数据同步操作的偏移量offset值(即起点)和获取的数据长度值。
就是说这次的FetchRequest请求中包含了my-working-topic的分区15的一段数据获取的需求,数据起始位置是0,获大小为1048576(1024*1024)的数据,(0,1048576)。

但上面我们刚刚分析过my-working-topic的21、15、3、9共4个partitions的数据同步线程是刚刚被管理器停掉了的,间隔了48ms后的一个数据同步请求中却在要求对这个topic的这几个partitions做同步。下面的日志我们只截取了partition 15的部分,实际上是把另3个分区的报错内容省略了。
那么,这种不尽合理的操作想当然得要报错了啊。错误信息就是我们重点分析的第2段日志内容。

2、kafka.common.NotAssignedReplicaException: Leader 2 failed to record follower 4's position -1 since the replica is not recognized to be one of the assigned replicas  for partition [my-working-topic,15]
这段内容的意思是这样的:my-working-topic的分区15有一个leader replica是broker 2,现在broker 2在获取和记录它的一个follower replica,即broker 4副本的日志位置信息时失败了,因为broker 4副本目前不被认为是一个有效的 my-working-topic的分区15的一个replica副本。

下文的日志我们只截取了1秒钟日志内容的一部分,实际上这些错误日志是短暂而集中得出现的。
为什么会报这种错误?
这些错误对我们生产业务会有什么样的影响?
这要从my-working-topic这个topic讲起。正是它引发的这些错误信息。原因是在这次的程序版本中新增了一部分功能,并创建了名为my-working-topic的一个新的topic。
由于kafka是一套分布式的集群系统,而且在每个broker节点上都有多种功能的线程责任维护kafka各节点、各分区以及各副本的数据一致性、完整性。猜测在创建一个新的topic,且应用了比较多的分区和副本数量时,如24分区+3副本,就会出现下面这样的短暂性的数据一致性冲突问题。遇到报错的kafka broker节点会暂停问题数据的处理,等待kafka controller broker推送正确的分区副本维护信息过来,然后再按正确信息处理本地日志文件,为topic的每个分区启动数据同步线程等。
所以,只要这种错误不是在不停的刷日志,就只是属于一种过程性的中间状态数据,保持观察即可,没有任何影响。

[2017-12-27 18:26:09,219] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [my-working-topic,21],[my-working-topic,15],[my-working-topic,3],[my-working-topic,9] (kafka.server.ReplicaFetcherManager)
[2017-12-27 18:26:09,248] INFO Completed load of log my-working-topic01 with log end offset 0 (kafka.log.Log)
[2017-12-27 18:26:09,267] ERROR [KafkaApi-2] Error when handling request Name: FetchRequest; Version: 2; CorrelationId: 44771537; ClientId: ReplicaFetcherThread-2-2; ReplicaId: 4; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [test-topic02-rrt,12] -> PartitionFetchInfo(8085219,1048576),[test-topic01-ursr,22] -> PartitionFetchInfo(0,1048576),[test-topic02-rd,13] -> PartitionFetchInfo(787543,1048576),[test-topic02-st,12] -> PartitionFetchInfo(14804029,1048576),[test-topic04-ursr,7] -> PartitionFetchInfo(8,1048576),[test-topic04-rd,15] -> PartitionFetchInfo(2380092,1048576),[test-topic04-rrt,18] -> PartitionFetchInfo(27246143,1048576),[test-topic03-rrt,12] -> PartitionFetchInfo(12853720,1048576),[test-topic04-st,18] -> PartitionFetchInfo(25335299,1048576),[test-topic03-srt,11] -> PartitionFetchInfo(3750134,1048576),[test-topic05-ursd,17] -> PartitionFetchInfo(0,1048576),[test-topic05-srt,22] -> PartitionFetchInfo(33136074,1048576),[test-topic01-sd,1] -> PartitionFetchInfo(14361,1048576),[test-topic03-rd,21] -> PartitionFetchInfo(96366,1048576),[test-topic04-ursd,10] -> PartitionFetchInfo(0,1048576),[my-working-topic,15] -> PartitionFetchInfo(0,1048576),[test-topic02-ts_st,12] -> PartitionFetchInfo(0,1048576),[test-topic03-ursr,9] -> PartitionFetchInfo(1,1048576) (kafka.server.KafkaApis)
kafka.common.NotAssignedReplicaException: Leader 2 failed to record follower 4's position -1 since the replica is not recognized to be one of the assigned replicas  for partition [my-working-topic,15].
    at kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:251)
    at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:864)
    at kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:861)
    at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
    at kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:861)
    at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:470)
    at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:496)
    at kafka.server.KafkaApis.handle(KafkaApis.scala:77)
    at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
    at java.lang.Thread.run(Thread.java:745)

.............................................................

[2017-12-27 18:26:09,387] INFO Partition [my-working-topic,13] on broker 2: No checkpointed highwatermark is found for partition [my-working-topic,13] (kafka.cluster.Partition)
[2017-12-27 18:26:09,388] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [my-working-topic,5],[my-working-topic,10],[my-working-topic,8],[my-working-topic,13],[my-working-topic,18],[my-working-topic,4],[my-working-topic,19],[my-working-topic,14] (kafka.server.ReplicaFetcherManager)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-8 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-5 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-13 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-14 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-18 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,388] INFO Truncating log my-working-topic-4 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,389] INFO Truncating log my-working-topic-19 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,389] INFO Truncating log my-working-topic-10 to offset 0. (kafka.log.Log)
[2017-12-27 18:26:09,393] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions List([[my-working-topic,8], initOffset 0 to broker BrokerEndPoint(1,172.17.140.91,9092)] , [[my-working-topic,5], initOffset 0 to broker BrokerEndPoint(4,172.17.140.95,9092)] , [[my-working-topic,13], initOffset 0 to broker BrokerEndPoint(6,172.17.140.42,9092)] , [[my-working-topic,14], initOffset 0 to broker BrokerEndPoint(1,172.17.140.91,9092)] , [[my-working-topic,18], initOffset 0 to broker BrokerEndPoint(5,172.17.140.41,9092)] , [[my-working-topic,4], initOffset 0 to broker BrokerEndPoint(3,172.17.140.93,9092)] , [[my-working-topic,19], initOffset 0 to broker BrokerEndPoint(6,172.17.140.42,9092)] , [[my-working-topic,10], initOffset 0 to broker BrokerEndPoint(3,172.17.140.93,9092)] ) (kafka.server.ReplicaFetcherManager)

3、关于INFO Partition [my-working-topic,13] on broker 2: No checkpointed highwatermark is found for partition [my-working-topic,13] (kafka.cluster.Partition)

这行日志中提到了high watermark,高水位标识,用于记录当前leader replica已经同步给其它follower replicas,并已经flush到日志文件中的日志offset position信息。
那为什么会说No checkpointed highwatermark is found呢?因为这个 my-working-topic是才创建的嘛,日志文件中一条数据都没有呢,哪来的日志偏移量和highwatermark呢。所以这条日志只是INFO级别。因为这是个正常事件。
在接下来的日志中,可以找到my-working-topic-13先是被将日志文件truncating清空,然后启动了该分区的数据同步线程added fetcher for partition .

最后检查下自己的topic状态,全部是正常的:
$ ./kafka-topics.sh --describe --zookeeper 172.17.140.91:2181 --topic my-working-topic
Topic:my-working-topic    PartitionCount:24    ReplicationFactor:3    Configs:
    Topic: my-working-topic    Partition: 0    Leader: 5    Replicas: 5,3,4    Isr: 5,3,4
    Topic: my-working-topic    Partition: 1    Leader: 6    Replicas: 6,4,5    Isr: 6,4,5
    Topic: my-working-topic    Partition: 2    Leader: 1    Replicas: 1,5,6    Isr: 1,5,6
    Topic: my-working-topic     Partition: 3    Leader: 2    Replicas: 2,6,1    Isr: 2,6,1
    Topic: my-working-topic    Partition: 4    Leader: 3    Replicas: 3,1,2    Isr: 3,1,2
    Topic: my-working-topic    Partition: 5    Leader: 4    Replicas: 4,2,3    Isr: 4,2,3
    Topic: my-working-topic    Partition: 6    Leader: 5    Replicas: 5,4,6    Isr: 5,4,6
    Topic: my-working-topic    Partition: 7    Leader: 6    Replicas: 6,5,1    Isr: 6,5,1
    Topic: my-working-topic    Partition: 8    Leader: 1    Replicas: 1,6,2    Isr: 1,6,2
    Topic: my-working-topic     Partition: 9    Leader: 2    Replicas: 2,1,3    Isr: 2,1,3
    Topic: my-working-topic    Partition: 10    Leader: 3    Replicas: 3,2,4    Isr: 3,2,4
    Topic: my-working-topic    Partition: 11    Leader: 4    Replicas: 4,3,5    Isr: 4,3,5
    Topic: my-working-topic    Partition: 12    Leader: 5    Replicas: 5,6,1    Isr: 5,6,1
    Topic: my-working-topic    Partition: 13    Leader: 6    Replicas: 6,1,2    Isr: 6,1,2
    Topic: my-working-topic    Partition: 14    Leader: 1    Replicas: 1,2,3    Isr: 1,2,3
    Topic: my-working-topic     Partition: 15    Leader: 2    Replicas: 2,3,4    Isr: 2,3,4
    Topic: my-working-topic    Partition: 16    Leader: 3    Replicas: 3,4,5    Isr: 3,4,5
    Topic: my-working-topic    Partition: 17    Leader: 4    Replicas: 4,5,6    Isr: 4,5,6
    Topic: my-working-topic    Partition: 18    Leader: 5    Replicas: 5,1,2    Isr: 5,1,2
    Topic: my-working-topic    Partition: 19    Leader: 6    Replicas: 6,2,3    Isr: 6,2,3
    Topic: my-working-topic    Partition: 20    Leader: 1    Replicas: 1,3,4    Isr: 1,3,4
    Topic: my-working-topic     Partition: 21    Leader: 2    Replicas: 2,4,5    Isr: 2,4,5
    Topic: my-working-topic    Partition: 22    Leader: 3    Replicas: 3,5,6    Isr: 3,5,6
    Topic: my-working-topic    Partition: 23    Leader: 4    Replicas: 4,6,1    Isr: 4,6,1


你可能感兴趣的:(Kafka)