Kafka无法启动问题调研

事故的发生

2018.10.13早上9点左右,正好周六时间,突然收到告警说Kafka进程挂了,刚开始没注意,让同事重启,发生仍然起不来,仔细查看日志,发现会生成如下文件:hs_err_pid25501.log ,内容的开头为:

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2640), pid=25501, tid=0x00007f035aef6700

初步觉得是内存不够用了,所以按照上面的提示修改各种Java的启动参数,把分配给Kafka的内存减少,终于能够正常启动了,所以也放松了警惕。

再次发生

到了下午4点左右,再次收到告警Kafka进程挂了,这次尝试了各种方法也无法启动,最长运行35分钟左右就会挂,于是我们都来到公司一起分析原因。经过仔细查看,发现了kafka在退出时,输出了错误日志如下:

2018-10-13 19:12:08,103|ERROR|ReplicaFetcherThread-0-5|kafka.utils.Logging$class.error:76|[ReplicaFetcher replicaId=0, leaderId=5, fetcherId=0] Error while processing data for partition mts_mpid-5|
org.apache.kafka.common.errors.KafkaStorageException: Error while rolling log segment for mts_mpid-5 in dir /data8/kafka/logs
Caused by: java.io.IOException: Map failed
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:940)
        at kafka.log.AbstractIndex.(AbstractIndex.scala:67)
        at kafka.log.OffsetIndex.(OffsetIndex.scala:53)
        at kafka.log.LogSegment$.open(LogSegment.scala:560)
        at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1340)
        at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1316)
        at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
        at kafka.log.Log.roll(Log.scala:1316)
        at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1303)
        at kafka.log.Log$$anonfun$append$2.apply(Log.scala:726)
        at kafka.log.Log$$anonfun$append$2.apply(Log.scala:640)
        at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
        at kafka.log.Log.append(Log.scala:640)
        at kafka.log.Log.appendAsFollower(Log.scala:623)
        at kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
        at kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
        at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:256)
        at kafka.cluster.Partition.appendRecordsToFollower(Partition.scala:559)
        at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:112)
        at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:43)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:183)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169)
        at scala.Option.foreach(Option.scala:257)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: java.lang.OutOfMemoryError: Map failed
        at sun.nio.ch.FileChannelImpl.map0(Native Method)
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937)
        ... 34 more

找了系统管理员查看主机是否有硬件异常后者一些内核问题,系统管理员经过调查一番,也没有任何结果,主机非常正常,也没有发生过内存非配错误。内存也非常充足,Swap也一直空闲,没有使用过,各种内核参数调整后,也无法阻止错误的发生。

解决问题

到网上查了一下,关于这个异常的解决方案,大多数调整到64位Java虚拟机,调整内存运行参数等,我们已经是使用64位虚拟机运行了,而且内存参数也调整过,所以也不适合我们。

问了系统管理员会不会是内存映射文件方式有限制,他表示缺省配置都非常充足,于是查看了一下现在的大小为65530。

root@ubuntu # cat /proc/sys/vm/max_map_count
65530

我们的kafka集群占用空间非常大,12个磁盘 * 4T的空间,这样打开的文件数应该不会小,所以这次启动Kafka进程后,查看了一下进程映射文件的数量:

root@ubuntu # cat /proc/61764/maps  | wc -l
53620

一下子就到了5万多,而且一直在上涨,估计就是这个问题了,于是调整参数。

root@ubuntu # sysctl -w vm.max_map_count=262144
root@ubuntu # sysctl -a|grep vm.max_map_count
vm.max_map_count = 262144

这次终于能够启动了,但是启动时间超长,花了1个多小时才最终启动起来,而且最终进程打开的映射文件数果然超过了65530。
所以问题到这基本就得到了解决,那么是什么原因造成映射文件这么多呢?本文先不回答这个问题,因为第二天我们又碰到了一个问题,在哪个问题中进行解答。

你可能感兴趣的:(Kafka无法启动问题调研)