Kafka管理与监控——broker宕机后无法消费问题

 

背景

因磁盘满了,导致kafka所有的服务器全部宕机了,然后重启kafka集群,服务是启动成功了,但有一些报错:

broker1:

Kafka管理与监控——broker宕机后无法消费问题_第1张图片

broker2:

Kafka管理与监控——broker宕机后无法消费问题_第2张图片

broker3:一直在刷以下错误信息

Kafka管理与监控——broker宕机后无法消费问题_第3张图片 

 

虽然报了这些错,但kafka正常启动了,通过命令测试了集群能正常生产和消费消息,但是看kafka-manager界面,出现副本未分配的异常情况:

Kafka管理与监控——broker宕机后无法消费问题_第4张图片

 

检查消费这些主题的程序,果然是消费失败了,一直在刷如下异常信息:

Kafka管理与监控——broker宕机后无法消费问题_第5张图片

  注:图中IP的是broker3节点

 

截止到这里可以看出,broker3节点出问题了,导致消费者程序连接不上,但奇怪的话,通过命令创建主题测试,在broker3节点又能消费。

继续分析broker3的日志,报错原因:集群要求的副本数是2,但只找到1个。

于是查看相关主题的详细信息,发现确实ISR列表中是少了副本

 Kafka管理与监控——broker宕机后无法消费问题_第6张图片

 

猜测由于宕机后,有些节点落后leader太多,还没有追上来,所以脱离了ISR列表,于是等它自动追上来。

等到第2天一看,还是一样,没有追上来,于是决定重启kafka集群,发现有些分区的会自动扩展成2,出问题的那些分区还是没有。。。。

Kafka管理与监控——broker宕机后无法消费问题_第7张图片

然后想通过重新分配分区指定副本,看能否让它自动恢复一下副本,通过以下命令进行处理:

bin/kafka-reassign-partitions.sh --zookeeper 10.0.xx.x:2181,10.0.xx.x:2181,10.0.xx.x:2181 --reassignment-json-file reassign.json --execute
reassign.json文件内容:
{"version":1, "partitions":[
 {"topic":"__consumer_offsets","partition":0,"replicas":[2,3]}, 
  {"topic":"__consumer_offsets","partition":1,"replicas":[3,1]},
  {"topic":"__consumer_offsets","partition":2,"replicas":[1,2]},
  {"topic":"__consumer_offsets","partition":3,"replicas":[1,2]},
  {"topic":"__consumer_offsets","partition":4,"replicas":[3,2]},
  {"topic":"__consumer_offsets","partition":5,"replicas":[1,3]},
  {"topic":"__consumer_offsets","partition":6,"replicas":[2,3]},
  {"topic":"__consumer_offsets","partition":7,"replicas":[3,1]},
  {"topic":"__consumer_offsets","partition":8,"replicas":[1,2]},
  {"topic":"__consumer_offsets","partition":9,"replicas":[2,1]},
  {"topic":"__consumer_offsets","partition":10,"replicas":[3,2]},
  {"topic":"__consumer_offsets","partition":11,"replicas":[1,3]},
  {"topic":"__consumer_offsets","partition":12,"replicas":[2,3]},
  {"topic":"__consumer_offsets","partition":13,"replicas":[3,1]},
  {"topic":"__consumer_offsets","partition":14,"replicas":[1,2]},
  {"topic":"__consumer_offsets","partition":15,"replicas":[2,1]},
  {"topic":"__consumer_offsets","partition":16,"replicas":[3,2]},
  {"topic":"__consumer_offsets","partition":17,"replicas":[1,3]},
  {"topic":"__consumer_offsets","partition":18,"replicas":[2,3]},
  {"topic":"__consumer_offsets","partition":19,"replicas":[3,1]},
  {"topic":"__consumer_offsets","partition":20,"replicas":[1,2]},
  {"topic":"__consumer_offsets","partition":21,"replicas":[2,1]},
  {"topic":"__consumer_offsets","partition":22,"replicas":[3,2]},
  {"topic":"__consumer_offsets","partition":23,"replicas":[1,3]},
  {"topic":"__consumer_offsets","partition":24,"replicas":[2,3]},
  {"topic":"__consumer_offsets","partition":25,"replicas":[3,1]},
  {"topic":"__consumer_offsets","partition":26,"replicas":[1,2]},
  {"topic":"__consumer_offsets","partition":27,"replicas":[2,1]},
  {"topic":"__consumer_offsets","partition":28,"replicas":[3,2]},
  {"topic":"__consumer_offsets","partition":29,"replicas":[1,3]},
  {"topic":"__consumer_offsets","partition":30,"replicas":[2,3]},
  {"topic":"__consumer_offsets","partition":31,"replicas":[3,1]},
  {"topic":"__consumer_offsets","partition":32,"replicas":[1,2]},
  {"topic":"__consumer_offsets","partition":33,"replicas":[2,1]},
  {"topic":"__consumer_offsets","partition":34,"replicas":[3,2]},
  {"topic":"__consumer_offsets","partition":35,"replicas":[1,3]},
  {"topic":"__consumer_offsets","partition":36,"replicas":[2,3]},
  {"topic":"__consumer_offsets","partition":37,"replicas":[3,1]},
  {"topic":"__consumer_offsets","partition":38,"replicas":[1,2]},
  {"topic":"__consumer_offsets","partition":39,"replicas":[2,1]},  
  {"topic":"__consumer_offsets","partition":40,"replicas":[3,2]},
  {"topic":"__consumer_offsets","partition":41,"replicas":[1,3]},
  {"topic":"__consumer_offsets","partition":42,"replicas":[2,3]},
  {"topic":"__consumer_offsets","partition":43,"replicas":[3,1]},
  {"topic":"__consumer_offsets","partition":44,"replicas":[1,2]},
  {"topic":"__consumer_offsets","partition":45,"replicas":[2,1]},
  {"topic":"__consumer_offsets","partition":46,"replicas":[3,2]},
  {"topic":"__consumer_offsets","partition":47,"replicas":[1,3]},
  {"topic":"__consumer_offsets","partition":48,"replicas":[2,3]},
  {"topic":"__consumer_offsets","partition":49,"replicas":[3,1]}  
]}`

Kafka管理与监控——broker宕机后无法消费问题_第8张图片

 

 重新分区指定副本的方法也不行,于是修改kafka配置,把集群要求的副本数改为1:

vi server.properties

 Kafka管理与监控——broker宕机后无法消费问题_第9张图片

 重启kafka集群后,broker3不在就报错了,在重启消费都程序,也能正常连上kafka进行消费了。

 

 

总结:

kafka出现宕机后,副本脱离ISR列表(落后leader太多),按正常来说它会慢慢追上来后在自动重新加入ISR列表中,但我的等了20个小时后还没有,重启kafka集群后也没有恢复。导致服务启动有问题。

现在临时解决方案是调整成1,让它先跑一段时间后,看能否恢复回来,到时在设置成2。

 

 

问题:

1、原因尚未找到;

2、这样调整后,kafka会出现数据丢失的情况(出问题期间的数据都丢失了)。

你可能感兴趣的:(Kafka管理与监控——broker宕机后无法消费问题)