平台建设过程中总会遇到很多挑战,目前我在Strom平台这些方面踩过坑,简单总结一下。
- 内核bug导致worker频繁发生漂移
- 网络连接随机端口与worker端口冲突
- Worker进程发生死锁
- 任务周期性高QPS
- 反压机制不成熟
- 任务运行在黑盒子,不知道发生了什么
- spout长时间没有发射数据,导致TickTuple停止
- Storm心跳风暴导致Zookeeper IO问题
内核bug导致worker频繁发生漂移
这个问题参考这边文章,http://woodding2008.iteye.com/blog/2328115
网络连接随机端口与worker端口冲突
解决办法简单粗暴,通过配置内核参数将本机worker端口设置为预留端口,只允许以监听的方式启动。
echo "5710,5711,5712,5713,5714,15710,15711,15712,15713,15714" > /proc/sys/net/ipv4/ip_local_reserved_ports
worker进程发生死锁[0.9.5]
bug已经fixed:https://issues.apache.org/jira/browse/STORM-839
死锁发生在netty通信过程中,只发现过一次,发生频率较低,解决办法升级至0.9.6或者更高版本。
Found one Java-level deadlock:
=============================
"Thread-12-disruptor-worker-transfer-queue":
waiting to lock monitor 0x00007f85e000aee8 (object 0x00000007b4ffc8e8, a java.lang.Object),
which is held by "client-worker-3"
"client-worker-3":
waiting to lock monitor 0x00007f85dc021ef8 (object 0x000000079d717418, a backtype.storm.messaging.netty.Client),
which is held by "Thread-12-disruptor-worker-transfer-queue"
"Thread-12-disruptor-worker-transfer-queue" prio=10 tid=0x00007f8750cb9000 nid=0x8a3 waiting for monitor entry [0x00007f86627e6000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:398)
- waiting to lock <0x00000007b4ffc8e8> (a java.lang.Object)
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
at org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
at org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
at org.apache.storm.netty.channel.Channels.write(Channels.java:725)
at org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
at org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
at org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
at org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
at org.apache.storm.netty.channel.Channels.write(Channels.java:704)
at org.apache.storm.netty.channel.Channels.write(Channels.java:671)
at org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
at backtype.storm.messaging.netty.Client.flushMessages(Client.java:480)
- locked <0x000000079d717418> (a backtype.storm.messaging.netty.Client)
at backtype.storm.messaging.netty.Client.send(Client.java:400)
- locked <0x000000079d717418> (a backtype.storm.messaging.netty.Client)
at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__6940$fn__6941.invoke(worker.clj:336)
at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__6940.invoke(worker.clj:334)
at backtype.storm.disruptor$clojure_handler$reify__1605.onEvent(disruptor.clj:58)
at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
at backtype.storm.disruptor$consume_loop_STAR_$fn__1618.invoke(disruptor.clj:94)
at backtype.storm.util$async_loop$fn__459.invoke(util.clj:463)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:745)
"client-worker-3" prio=10 tid=0x00007f8750d36800 nid=0x813 waiting for monitor entry [0x00007f86d0d65000]
java.lang.Thread.State: BLOCKED (on object monitor)
at backtype.storm.messaging.netty.Client.closeChannelAndReconnect(Client.java:501)
- waiting to lock <0x000000079d717418> (a backtype.storm.messaging.netty.Client)
at backtype.storm.messaging.netty.Client.access$1400(Client.java:78)
at backtype.storm.messaging.netty.Client$3.operationComplete(Client.java:492)
at org.apache.storm.netty.channel.DefaultChannelFuture.notifyListener(DefaultChannelFuture.java:427)
at org.apache.storm.netty.channel.DefaultChannelFuture.notifyListeners(DefaultChannelFuture.java:413)
at org.apache.storm.netty.channel.DefaultChannelFuture.setFailure(DefaultChannelFuture.java:380)
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:417)
- locked <0x00000007b4ffc8e8> (a java.lang.Object)
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
at org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
任务周期性高QPS
波峰严重对平台并不友好,想尽办法消掉波峰才是王道。
反压机制不成熟
反压机制不成熟直接带来的后果是洪峰流量或者流量预估不准确导致任务的worker OOM,频繁漂移。
1.0版本已经使用新的反压机制。
社区解决方案:https://issues.apache.org/jira/browse/STORM-886
任务运行在黑盒子,不知道发生了什么
社区提供的版本,任务信息太少了,需要完善监控体系。
Storm平台监控[上]:http://woodding2008.iteye.com/blog/2326358
Storm平台监控[下]:http://woodding2008.iteye.com/blog/2326675
spout长时间没有发射数据,导致TickTuple停止
Storm TickTuple 意外停止: http://woodding2008.iteye.com/blog/2328114
Storm心跳风暴导致Zookeeper IO问题
解决zookeeper磁盘IO高的问题 :http://woodding2008.iteye.com/blog/2327100