记kafka partition数据量过大导致不能正确重启

某台kafka服务器负载过高,机器挂掉一段是时间后,kill掉占用内存的进程,然后重启kafka服务,但是一直不能完成启动和数据同步,日志如下
fset 0 to broker BrokerEndPoint(11,192.168.207.79,9092)] ) (kafka.server.ReplicaFetcherManager)
[2016-04-26 19:16:33,274] INFO [ReplicaFetcherManager on broker 13] Removed fetcher for partitions [ifindnotice_lp_queue,3],[newifindreport_lp_queue,1],[eventoutputqueue,0],[newifindreport_lp_queue,0],[NewEventOutputQueueLuyang,0],[weibo_dlstat_queue,0],[test_dlstat_queue,0],[investqa_lp_queue,2],[forum_yuqing_queue,3],[soniu_dlstat_queue,0] (kafka.server.ReplicaFetcherManager)
[2016-04-26 19:16:53,909] WARN [ReplicaFetcherThread-0-14], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@9c1b776. Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread)
[2016-04-26 19:16:53,910] WARN [ReplicaFetcherThread-0-12], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@50dafd61. Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread)
[2016-04-26 19:16:53,912] WARN [ReplicaFetcherThread-0-11], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@25493558. Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread)
[2016-04-26 19:17:25,917] WARN [ReplicaFetcherThread-0-14], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@5c570267. Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread)
[2016-04-26 19:17:25,918] WARN [ReplicaFetcherThread-0-11], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@545ee78d. Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread)
[2016-04-26 19:17:25,920] WARN [ReplicaFetcherThread-0-12], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@22b40541. Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread)
[2016-04-26 19:17:48,844] INFO [Group Metadata Manager on Broker 13]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2016-04-26 19:17:57,924] WARN [ReplicaFetcherThread-0-14], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@14347cce. Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread)
[2016-04-26 19:17:57,925] WARN [ReplicaFetcherThread-0-11], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@1edc2bf. Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread)
[2016-04-26 19:17:57,925] WARN [ReplicaFetcherThread-0-12], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@208c8a96. Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread)
[2016-04-26 19:18:17,984] INFO [ReplicaFetcherManager on broker 13] Removed fetcher for partitions [__consumer_offsets,30] (kafka.server.ReplicaFetcherManager)
[2016-04-26 19:18:17,985] INFO [Group Metadata Manager on Broker 13]: Loading offsets and group metadata from [__consumer_offsets,30] (kafka.coordinator.GroupMetadataManager)
[2016-04-26 19:18:17,999] INFO [Group Metadata Manager on Broker 13]: Finished loading offsets from [__consumer_offsets,30] in 14 milliseconds. (kafka.coordinator.GroupMetadataManager)

从上面日志发现kafka重启后,ReplicaFetcherTrhead 有错误日志 Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms ,表明数据同步时,发生超时

 

[2016-04-26 19:38:28,999] INFO Client session timed out, have not heard from server in 4846ms for sessionid 0x253850bfcad1cbe, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2016-04-26 19:38:29,100] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2016-04-26 19:38:29,399] INFO Opening socket connection to server 192.168.201.41/192.168.201.41:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2016-04-26 19:38:29,400] INFO Socket connection established to 192.168.201.41/192.168.201.41:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2016-04-26 19:38:29,402] INFO Session establishment complete on server 192.168.201.41/192.168.201.41:2181, sessionid = 0x253850bfcad1cbe, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2016-04-26 19:38:29,402] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
从上面日志发现,与zk的链接也出现了超时情况


dstat查看网卡情况,发现网卡吃满,于是初步定为到是大量同步数据导致网卡吃满,进而导致部分连接产生超时失败

记kafka partition数据量过大导致不能正确重启_第1张图片


查看kafka 数据目录,发现数据目录都是1g以上

记kafka partition数据量过大导致不能正确重启_第2张图片


要broker完全启动成功,需要不同的replica之间数据同步完成,但是由于数据量过大,导致同步失败,于是考虑有没有办法减少要同步的数据量


于是修改了 topic 的retention.ms 为 1小时,同时 server的配置 
log.retention.bytes 和 log.segment.bytes 也做了修改,分别改成1g和512m


但是这时可能仍然上线,只能依次重启其它kafka server,当isr的master切换到当前这台机器上,offset和log数据会以当前master为准同步,重启其它机器,相当于不做同步,这可能会导致数据丢失。


依次重启等待片刻后,所有kafka server 都上线,isr 恢复正常



你可能感兴趣的:(分布式系统,疑难杂症)