storm使用辛酸史

    在使用storm过程中,遇到了大大小小各种类型的问题,现在回想起来有些错误真的是很低级,有幸得到intel专家的顶力支持,问题一点点解决,将问题记录一下,作为备忘。

    软件环境:

    flume:1.5.0、kafka:2.10-0.8.1.1、storm:0.9.2

    硬件环境:

    3台机器(8cpu、16g memory),nimbus一台,supervisor二台

    问题概述:

    在topology进行实时计算时,出现如下错误:

2014-11-10 16:32:21 b.s.m.n.Client [INFO] failed to send requests to ip:6705: 
java.nio.channels.ClosedChannelException: null
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:381) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:349) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.6.3.Final.jar:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [na:1.6.0_17]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [na:1.6.0_17]
	at java.lang.Thread.run(Thread.java:636) [na:1.6.0_17]
2014-11-10 16:32:21 b.s.m.n.Client [INFO] failed to send requests to ip:6705: 
java.nio.channels.ClosedChannelException: null
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:381) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:349) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.6.3.Final.jar:na]
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.6.3.Final.jar:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [na:1.6.0_17]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [na:1.6.0_17]
	at java.lang.Thread.run(Thread.java:636) [na:1.6.0_17]
    发生上述错误时,worker进程挂起,任务异常结束。一开始以为是STORM-329问题,从github上重打修复版本后,仍存在问题。后来回退到0.9.2版本,将topology的task、worker数按实际硬件资源情况配置。发现系统比之前稳定,刚启动时kafka spout仍然会出现少量fail消息。从intel storm专家分析来看,可能是STORM-350问题引起,将disruptor回退为2.10.1版本后,fail消息消失。

    在稳定运行一段时间后,系统会出现大量异常消息,首先从zk timeout引起,日志内容如下:

2014-11-12 09:54:02 o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 13625ms for sessionid 0x1499db7e5930255, closing socket connection and attempting reconnect
2014-11-12 09:54:02 o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 13626ms for sessionid 0x1499db7e593025d, closing socket connection and attempting reconnect
    一旦发生上述错误后,系统运行失败消息快速增长,无法正常进行计算任务。听闻其他项目中遇到此类情况是由于client端jvm老区占满导致无法连接zk。于是使用jstat监控storm进程,发现问题原因是由于GC pause导致。使用jmap将堆栈打出来后,通过MemoryAnalyzer分析后发现是系统缓存内容过多导致内存占满,调整bolt重新运行,系统稳定运行。
jstat -gcutil 进程号  5000  100000000 >> gc.log &
jmap -dump:live,format=b,file=filename 进程号

你可能感兴趣的:(storm使用辛酸史)