发现spark集群重启突然异常无法正常提供服务,两个master都没有人正常起来提供服务,不是防火墙规则导致的,不是因为更改端口导致的
1.异常情况:
spark-master报错:
20/07/16 16:52:35 WARN ClientCnxn: Session 0x57355eb34540d0c for server hadoop5/"ip":7072, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len4588079 is out of range!
at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:112)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
20/07/16 16:52:36 INFO ConnectionStateManager: State change: SUSPENDED
20/07/16 16:53:16 ERROR Inbox: Ignoring error
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /spark/master_status
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:38)
at org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:52)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:310)
at org.apache.spark.deploy.master.PersistenceEngine.readPersistedData(PersistenceEngine.scala:86)
at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:220)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
20/07/16 16:53:17 INFO ClientCnxn: Socket connection established to hadoop1:7072, initiating session
20/07/16 16:53:17 INFO ConnectionStateManager: State change: RECONNECTED
20/07/16 16:53:17 INFO Master: hadoop20:55652 got disassociated, removing it.
20/07/16 16:53:17 INFO Master: hadoop53:47406 got disassociated, removing it.

spark-work报错:
20/07/16 18:18:56 INFO Worker: Connecting to master hadoop5:8088...
20/07/16 18:19:09 INFO Worker: Retrying connection to master (attempt # 3)
20/07/16 18:19:09 INFO Worker: Connecting to master hadoop5:8088...
20/07/16 18:19:22 INFO Worker: Retrying connection to master (attempt # 4)
20/07/16 18:19:22 INFO Worker: Connecting to master hadoop5:8088...
20/07/16 18:19:35 INFO Worker: Retrying connection to master (attempt # 5)
20/07/16 18:19:35 INFO Worker: Connecting to master hadoop5:8088...
20/07/16 18:19:48 INFO Worker: Retrying connection to master (attempt # 6)
20/07/16 18:19:48 INFO Worker: Connecting to master hadoop5:8088...

2.现象:
spark一直是处于standby的情况,但是对应的端口都是互通的

3.处理措施:
"spark"
清理掉spark下面的work,对应它进行定期清理,将zk的包传输量加大
a.暴力删除,使用rm -r删除work文件,再重新创建一个
b.修改spark-env.sh,添加下面的配置
SPARK_WORKER_OPTS=”-Dspark.worker.cleanup.enabled=true”

"zk"
[zk历史日志清理]
cd /cache11/cloud/zookeeper/version-2;ls -lht snapshot.|tail -n+66|awk '{print $9}'|xargs rm -f
cd /usr/local/cloud/zookeeper/logs/version-2;ls -lht log.
|tail -n+66|awk '{print $9}'|xargs rm -f

[清理掉spark在zk内的缓存]
sh /usr/local/cloud/zookeeper/bin/zkCli.sh -server hadoop5:7072
rmr /spark
ls /spark
quit

4.重启spark,发现可以正常服务