运行flink程序所出现的异常及其解决方案

下面的我在测试中遇到的几个问题及解决方案。

1. The assigned slot container_e08_1539148828017_15937_01_003564_0 was removed.

org.apache.flink.util.FlinkException: The assigned slot container_e08_1539148828017_15937_01_003564_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:803)
at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:340)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)

程序内存占用过大,导致TaskManager在yarn上kill了,分析原因应该是资源不够,可以将程序放在资源更大的集群上,再不行就设置减少Slot中共享的task的个数,也可能是内存泄露或内存资源配置不合理造成,需要进行合理分配。

2. The heartbeat of TaskManager with id container ....... timed out
此错误是container心跳超时,出现此种错误一般有两种可能:
1、分布式物理机网络失联,这种原因一般情况下failover后作业能正常恢复,如果出现的不频繁可以不用关注;
2、failover的节点对应TM的内存设置太小,GC严重导致心跳超时,建议调大对应节点的内存值。

3. Caused by: java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@flink88:15265/user/taskmanager_0#6643546564]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation".

在flink-conf.yaml中添加或修改:
akka.ask.timeout: 100s
web.timeout: 100000

4. Checkpoint:Checkpoint expired before completing

checkpointConf.setCheckpointTimeout(5000L)这个值设置过小,默认是10min,需要进行调大测试。

你可能感兴趣的:(运行flink程序所出现的异常及其解决方案)