1、线上有一个任务状态比较大,做checkpoint的时候大约有100G左右,任务在做到10G左右的时候会报错
2、报错日志如下:
第二个错误日志:
2022-04-16 00:05:23
org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException: Connection for partition 55780195e63c343e4a320329203bbb8a#13@83443caed0bbd8a73684a77ae7f82b55 not reachable.
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:183)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:322)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:291)
at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:94)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsNonBlocking(MailboxProcessor.java:359)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:323)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:681)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:636)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:620)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager '/10.151.213.20:12115' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:145)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:114)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:81)
at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:70)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:179)
... 15 more
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.151.213.20:12115
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
at org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
at org.apache.flink.shaded.netty4.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
第一个错误日志:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '10.151.213.20/10.151.213.20:12115'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:160)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
at org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:94)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
3、任务的配置信息如下:
table.dynamic-table-options.enabled=true;
state.backend.rocksdb.compaction.style=level;
table.exec.mini-batch.enabled=true;
state.backend.rocksdb.thread.num=8;
state.backend.rocksdb.checkpoint.transfer.thread.num=8;
table.exec.mini-batch.size=35000;
table.optimizer.distinct-agg.split.enabled=true;
state.backend.rocksdb.block.blocksize=32 kb;
state.backend.rocksdb.writebuffer.number-to-merge=2;
table.exec.mini-batch.allow-latency=15 s;
通过报错日志发现也有人遇到这个情况,判断是否是因为容器的磁盘或者内存太小小,导致出现问题,通过相关日志信息去tm上查看错误信息,错误代码如下
flink任务重启原因分析_L13763338360的博客-CSDN博客_flink运行一段时间就重启
4、
2022-04-16 00:05:23,933 WARN org.apache.flink.streaming.api.operators.BackendRestorerProcedure [] - Exception while restoring keyed state backend for KeyedMapBundleOperator_7cf1b594451647e7fb445b612e152cd7_(1/40) from alternative (1/1), will retry while more alternatives are available.
org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:394) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:465) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:90) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:441) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:582) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:562) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:759) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
Caused by: java.lang.InterruptedException
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:347) ~[?:1.8.0_262]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) ~[?:1.8.0_262]
at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForAllStateHandles(RocksDBStateDownloader.java:84) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.transferAllStateDataToDirectory(RocksDBStateDownloader.java:63) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.transferRemoteStateToLocalDirectory(RocksDBIncrementalRestoreOperation.java:253) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:221) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:187) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:167) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:314) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
... 17 more
2022-04-16 00:05:23,935 INFO org.apache.hadoop.hdfs.client.impl.BlockReaderRemote [] - Reader block: 10085135602
发现内存不是太太,调整一下内存资源。
5、配置信息如下:
调整管理内存大小,同步cache大小也调整
state.backend.incremental=true;
taskmanager.memory.managed.fraction =0.3;
state.backend.rocksdb.block.blocksize=64 kb;
state.backend.rocksdb.block.cache-size=128 mb;
state.backend.rocksdb.files.open = -1;
state.backend.rocksdb.writebuffer.size =128 mb;
state.backend.rocksdb.writebuffer.count=4;
state.backend.rocksdb.writebuffer.number-to-merge=2;
state.backend.rocksdb.compaction.style=level;
state.backend.rocksdb.thread.num=4;
state.backend.rocksdb.metrics.block-cache-usage=true;
state.backend.rocksdb.checkpoint.transfer.thread.num=8;
table.dynamic-table-options.enabled=true;
table.exec.mini-batch.enabled=true;
table.exec.mini-batch.size=35000;
table.optimizer.distinct-agg.split.enabled=true;
table.exec.mini-batch.allow-latency=15 s;
参考文档:
cFlink on RocksDB 参数调优指南 - 云+社区 - 腾讯云
nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/
Configuration | Apache Flink