[2]Storm Bug Fix:supervisor {taskid} still hasn't started

原创文章,欢迎转载。转载请注明出处:http://blog.csdn.net/jmppok/article/details/17073397


1.问题描述

在Storm中提交Topology后,一直处于分派状态,查看Supervisor日至,显示

2013-12-02 14:49:52 supervisor [INFO] 46b25fa5-b333-4985-9c1d-3f112d5c615a still hasn't started
2013-12-02 14:49:52 supervisor [INFO] 46b25fa5-b333-4985-9c1d-3f112d5c615a still hasn't started
2013-12-02 14:49:53 supervisor [INFO] 46b25fa5-b333-4985-9c1d-3f112d5c615a still hasn't started
2013-12-02 14:49:53 supervisor [INFO] 46b25fa5-b333-4985-9c1d-3f112d5c615a still hasn't started
2013-12-02 14:49:54 supervisor [INFO] 46b25fa5-b333-4985-9c1d-3f112d5c615a still hasn't started
2013-12-02 14:49:54 supervisor [INFO] 46b25fa5-b333-4985-9c1d-3f112d5c615a still hasn't started
2013-12-02 14:49:55 supervisor [INFO] 46b25fa5-b333-4985-9c1d-3f112d5c615a still hasn't started

注意:如果只显示几次后停止,则说明worker启动成功,或Task已被转移到其他supervisor。

只有不停的显示该消息才说明执行task的worker无法启动成功。


通过查看worker的日志,可看到详细的错误信息:

2013-12-02 13:28:02 worker [ERROR] Error on initialization of server mk-worker
org.zeromq.ZMQException: Invalid argument(0x16)
        at org.zeromq.ZMQ$Socket.connect(Native Method)
        at zilch.mq$connect.invoke(mq.clj:74)
        at backtype.storm.messaging.zmq.ZMQContext.connect(zmq.clj:65)
        at backtype.storm.daemon.worker$mk_refresh_connections$this__4293$iter__4300__4304$fn__4305.invoke(worker.clj:244)
        at clojure.lang.LazySeq.sval(LazySeq.java:42)
        at clojure.lang.LazySeq.seq(LazySeq.java:60)
        at clojure.lang.RT.seq(RT.java:473)
        at clojure.core$seq.invoke(core.clj:133)
        at clojure.core$dorun.invoke(core.clj:2725)
        at clojure.core$doall.invoke(core.clj:2741)
        at backtype.storm.daemon.worker$mk_refresh_connections$this__4293.invoke(worker.clj:238)
        at backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:351)
        at clojure.lang.AFn.applyToHelper(AFn.java:185)
        at clojure.lang.AFn.applyTo(AFn.java:151)
        at clojure.core$apply.invoke(core.clj:601)
        at backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323)
        at clojure.lang.RestFn.invoke(RestFn.java:512)
        at backtype.storm.daemon.worker$_main.invoke(worker.clj:433)
        at clojure.lang.AFn.applyToHelper(AFn.java:172)
        at clojure.lang.AFn.applyTo(AFn.java:151)
        at backtype.storm.daemon.worker.main(Unknown Source)

具体该看哪个work的log可以通过观察supervisor.log中的启动命令获得,如在supervisor.log中看到如下信息:

2013-12-02 14:49:51 supervisor [INFO] Launching worker with command: java -server -Xmx768m  -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dlogfile.name=worker-6703.log -Dstorm.home=/opt/storm -Dlog4j.configuration=storm.log.properties -cp /opt/storm/storm-0.8.2.jar:/opt/storm/lib/commons-exec-1.1.jar:/opt/storm/lib/jetty-util-6.1.26.jar:/opt/storm/lib/minlog-1.2.jar:/opt/storm/lib/snakeyaml-1.9.jar:/opt/storm/lib/clj-time-0.4.1.jar:/opt/storm/lib/compojure-1.1.3.jar:/opt/storm/lib/curator-framework-1.0.1.jar:/opt/storm/lib/joda-time-2.0.jar:/opt/storm/lib/reflectasm-1.07-shaded.jar:/opt/storm/lib/log4j-1.2.16.jar:/opt/storm/lib/json-simple-1.1.jar:/opt/storm/lib/jline-0.9.94.jar:/opt/storm/lib/hiccup-0.3.6.jar:/opt/storm/lib/slf4j-log4j12-1.5.8.jar:/opt/storm/lib/clojure-1.4.0.jar:/opt/storm/lib/asm-4.0.jar:/opt/storm/lib/carbonite-1.5.0.jar:/opt/storm/lib/servlet-api-2.5.jar:/opt/storm/lib/servlet-api-2.5-20081211.jar:/opt/storm/lib/disruptor-2.10.1.jar:/opt/storm/lib/ring-servlet-0.3.11.jar:/opt/storm/lib/junit-3.8.1.jar:/opt/storm/lib/ring-jetty-adapter-0.3.11.jar:/opt/storm/lib/core.incubator-0.1.0.jar:/opt/storm/lib/tools.macro-0.1.0.jar:/opt/storm/lib/math.numeric-tower-0.0.1.jar:/opt/storm/lib/zookeeper-3.3.3.jar:/opt/storm/lib/curator-client-1.0.1.jar:/opt/storm/lib/libthrift7-0.7.0.jar:/opt/storm/lib/tools.cli-0.2.2.jar:/opt/storm/lib/tools.logging-0.2.3.jar:/opt/storm/lib/jgrapht-0.8.3.jar:/opt/storm/lib/kryo-2.17.jar:/opt/storm/lib/guava-13.0.jar:/opt/storm/lib/commons-logging-1.1.1.jar:/opt/storm/lib/ring-core-1.1.5.jar:/opt/storm/lib/commons-codec-1.4.jar:/opt/storm/lib/httpclient-4.1.1.jar:/opt/storm/lib/commons-lang-2.5.jar:/opt/storm/lib/commons-io-1.4.jar:/opt/storm/lib/slf4j-api-1.5.8.jar:/opt/storm/lib/jetty-6.1.26.jar:/opt/storm/lib/jzmq-2.1.0.jar:/opt/storm/lib/httpcore-4.1.jar:/opt/storm/lib/clout-1.0.1.jar:/opt/storm/lib/commons-fileupload-1.2.1.jar:/opt/storm/lib/objenesis-1.2.jar:/opt/storm/log4j:/opt/storm/conf:/tmp/storm_tmp/supervisor/stormdist/mytest-2-1385966991/stormjar.jar backtype.storm.daemon.worker mytest-2-1385966991 dc89a2b5-267f-4ed8-b94a-f900ed6300e4 6703 0916c7a9-c47d-43ae-9d88-13ec574ee5e6
2013-12-02 14:49:51 supervisor [INFO] 0916c7a9-c47d-43ae-9d88-13ec574ee5e6 still hasn't started
2013-12-02 14:49:52 supervisor [INFO] 0916c7a9-c47d-43ae-9d88-13ec574ee5e6 still hasn't started
2013-12-02 14:49:52 supervisor [INFO] 0916c7a9-c47d-43ae-9d88-13ec574ee5e6 still hasn't started
2013-12-02 14:49:53 supervisor [INFO] 0916c7a9-c47d-43ae-9d88-13ec574ee5e6 still hasn't started
2013-12-02 14:49:53 supervisor [INFO] 0916c7a9-c47d-43ae-9d88-13ec574ee5e6 still hasn't started
2013-12-02 14:49:54 supervisor [INFO] 0916c7a9-c47d-43ae-9d88-13ec574ee5e6 still hasn't started
2013-12-02 14:49:54 supervisor [INFO] 0916c7a9-c47d-43ae-9d88-13ec574ee5e6 still hasn't started

注意第一行命令中的worker-6703.log,就是它了。


2.问题解决办法

Storm中关于ZMQ和ZooKeeper连接错误的问题,一般都是本机的host配置有问题导致无法连接。需要在Storm集群中的所有节点,进行如下修改:

1)添加本机IP和主机名的信息,如192.168.0.2    node1

2)添加Strom Cluster中其他主机的信息,如192.168.0.3  node2

                                                                              192.168.0.4 node3


从而使ZMQ或Zookeeper在连接时能解析到正确的主机。

你可能感兴趣的:(storm,topology,bug)