flink 之 Checkpoint 出现的错误

文章目录

  • flink 任务运行中出现的错误
    • 一、flink 维护状态变量出现走过的弯路:
    • 1、flink 维护状态变量有三种方式:
          • <1>、MemoryStateBackend
          • <2>、FsStateBackend
          • <3>、RocksDBStateBackend
    • 2、出现的错误:
          • <1>:当使用 FsStateBackend 的时候出现的错误 如下图:
          • 错误如下:
          • <2>:当使用 RocksDBStateBackend 的时候出现的错误 :

flink 任务运行中出现的错误

一、flink 维护状态变量出现走过的弯路:

1、flink 维护状态变量有三种方式:

1>、MemoryStateBackend
2>、FsStateBackend
3>、RocksDBStateBackend

<1>、MemoryStateBackend

​ 是将状态维护到内存中,忽略不讲

<2>、FsStateBackend

​ 后端在TaskManager的内存中保存运行中的数据,执行checkpoint的时候,会把state的快照数据保存到配置的文件系统中可以使用hdfs等分布式文件系统。默认情况下,fsstateback使用异步快照

  val checkPointPath = new Path("hdfs:///flink/checkpoints")
  val fsStateBackend: StateBackend= new FsStateBackend(checkPointPath)
  env.setStateBackend(fsStateBackend)
<3>、RocksDBStateBackend

​ rocksdbstate后端将运行中的数据保存在RocksDB数据库中,该数据库(默认情况下)存储在TaskManager数据目录中。同时它需要配置一个远端的filesystem uri(一般是HDFS),在做checkpoint的时候,会把本地的数据直接复制到filesystem中。failover的时候从filesystem中恢复到本地,最小元数据存储在JobManager的内存中(或者在高可用性模式下,存储在元数据检查点中)。rocksdbstate后端总是执行异步快照

思考:RocksDB 是基于磁盘,Redis 基于内存,为何选用 RocksDB 待解决

 val rocksdbBackend = new RocksDBStateBackend("hdfs:///flink/checkpoints",true)
 //对于状态数据不需要压缩因为压缩选项对增量快照没有影响,因为它们使用的是RocksDB的内部格式
 rocksdbBackend.setOptions(new MyOptions())
 env.setStateBackend(rocksdbBackend)

2、出现的错误:

<1>:当使用 FsStateBackend 的时候出现的错误 如下图:

1547198994759

图中Overview 的含义见:

https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/checkpoint_monitoring.html

关于 checkpoint 的参数设置如下:

// start a checkpoint every  单位毫秒
env.enableCheckpointing(10000 * 2)
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(1000*2)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(60000 * 5)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(4)

stat size 一直增加的错误原因:这是因为state数据保存在taskmanager的内存中,其一直增加最终导致 OOM

错误如下:
2019-01-11 02:07:30,781 ERROR com.miaoke.flink.classnet.MySQLSink$                          - 数据插入MySQL失败 :    java.lang.OutOfMemoryError: GC overhead limit exceeded userId 的值为:796346time 的值为: 2019-01-11 01:28:23

End to End Duration 的时间一直在增加的原因:目前未解决

其会导致: Cause Checkpoint expread before compieting

<2>:当使用 RocksDBStateBackend 的时候出现的错误 :

stat size 的值不在增加,前面已讲解

End to End Duration 的时间一直在增加和使用 FsStateBackend 的时候出现的错误一致,如下图:

1547436814226

目前还没有解决:

怀疑是自己的配置有问题,查阅官网 Tuning RocksDB 后配置如下:

val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setRestartStrategy(RestartStrategies.noRestart())
// start a checkpoint every  单位毫秒
env.enableCheckpointing(10000 * 2)
// make sure 500 ms of progress happen between checkpoints 默认为 0
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(1000*2)
// checkpoints have to complete within one minute, or are discarded 默认 10 分钟
env.getCheckpointConfig.setCheckpointTimeout(60000 * 2)
// allow only one checkpoint to be in progress at the same time 
env.getCheckpointConfig.setMaxConcurrentCheckpoints(4)
env.getCheckpointConfig.setFailOnCheckpointingErrors(true)  // The default is true.
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
val rocksdbBackend = new RocksDBStateBackend("hdfs:///flink/checkpoints",true)
//对于状态数据不需要压缩因为压缩选项对增量快照没有影响,因为它们使用的是RocksDB的内部格式
rocksdbBackend.setOptions(new MyOptions()) rocksdbBackend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM)
env.setStateBackend(rocksdbBackend)
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)//退出不删除checkpoint

当使用该配置时在提交作业时出现的错误:

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 9ebe0758cd287126758d57b15fb5e5a3)
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:260)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at com.miaoke.flink.classnet.KafkaDataInsertMySQL$.main(KafkaDataInsertMySQL.scala:20)
	at com.miaoke.flink.classnet.KafkaDataInsertMySQL.main(KafkaDataInsertMySQL.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:426)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 12 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 10 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:953)
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
	... 4 more
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
	at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:310)
	at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:294)
	at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
	... 5 more

将 改配置取消掉再次提交作业:

rocksdbBackend.setOptions(new MyOptions()) 

此时作业提交成功

是否需要 new MyOptions() 见:

https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/state/large_state_tuning.html

为何出现该错误:未解决

将该任务分别提交到线下和线上集群:

线下集群:

1547454238486

线上集群:

1547455415258

当时间接近配置的时间是,checkpoint 出现错误,如图:

1547455525859

出现之前的异常,作业停止

思考:为何线上和线下出现不同情况 (原因是线下数据量小)

怀疑是网络缓冲区的问题:

Tuning Network Buffers

添加如下配置:

#To support, for example, a cluster of 20 8-slot machines, you should use roughly 5000 network buffers for optimal throughput.
taskmanager.network.numberOfBuffers: 2500

由于修改此配置会影响到其它的作业,放弃

待解决

将配置再次修改:

去掉该配置

//触发下一个检查点之前的最小暂停
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(1000*2)

再次提交作业:

1547464186283

根据 FLINK-10615 将配置再次修改:

https://issues.apache.org/jira/browse/FLINK-10615

https://issues.apache.org/jira/browse/FLINK-10930 此观点分离目录不知如何做,待解决

https://issues.apache.org/jira/browse/FLINK-10855

env.getCheckpointConfig.setFailOnCheckpointingErrors(false)  // The default is true.

当 : End to End Duration:1m 59s 时 出现如下错误:


2019-01-16 15:21:44,438 INFO  org.apache.flink.runtime.rest.handler.legacy.backpressure.StackTraceSampleCoordinator  - Cancelling sample 0
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@bigdata05:35756/user/taskmanager_0#528141446]] after [15000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messa
ges.RemoteRpcInvocation".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2019-01-16 15:21:45,111 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Received late message for now expired checkpoint attempt 166 from 3739f018527af86e160aa509e421ad39 of job 7e325f322
7fa7b6ce9679acc633eede6.
2019-01-16 15:21:52,867 INFO  org.apache.flink.runtime.rest.handler.legacy.backpressure.StackTraceSampleCoordinator  - Cancelling sample 1
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@bigdata05:35756/user/taskmanager_0#528141446]] after [15000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messa
ges.RemoteRpcInvocation".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2019-01-16 15:22:03,739 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Received late message for now expired checkpoint attempt 167 from 3739f018527af86e160aa509e421ad39 of job 7e325f322
7fa7b6ce9679acc633eede6.

但是作业正常运行,并没有退出,MySQL 数据正常插入,但是 Checkpoint 提交失败,

思考如下两个问题:

1、当作业重启时,是否能保证 EXACTLY_ONCE 语义、状态应当如何去恢复 待解决

2、此时的状态是否起作用

最终出现如下异常:

2019-01-16 18:13:24,234 ERROR com.miaoke.flink.classnet.MySQLSink$                          - 数据插入MySQL失败:  java.lang.OutOfMemoryError: GC overhead limit exceeded

为何出现 OOM 未解

思考:如何配置当 OOM 时,进行dump文件分析 未尝试

配置: -Xmx10M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=d://

日志:yarn logs -applicationId application_1545305009142_0290 | less

修改代码

怀疑是消费的数据太快,而插入MySQL 的时候比较慢,出现累计导致 End toEnd Duration 的时间一直递增

导致flink 背压运行 Back Pressure 如图

1547705967826

加大sink 的并行度 再次提交作业 问题得到解决,如下图所示:

1547708152002

上文的 End to End Duration 的值逐渐递增的问题解决,如下图所示:

1547709263777

最终的配置入下:

val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,Time.of(10,TimeUnit.SECONDS)))   //思考:当作业重启后是否能够获取到原来的状态值
env.setRestartStrategy(RestartStrategies.noRestart())  //作业失败后不重启
// start a checkpoint every  单位毫秒
env.enableCheckpointing(10000 * 6)
// make sure 500 ms of progress happen between checkpoints   触发下一个检查点之前的最小暂停。 默认为 0
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(10000*3)
// checkpoints have to complete within one minute, or are discarded    如果在此之前未完成,则中止正在执行的检查点的时间  默认 10 分钟
env.getCheckpointConfig.setCheckpointTimeout(60000 * 10)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
env.getCheckpointConfig.setFailOnCheckpointingErrors(true)  // The default is true.  如果设置为true,任务将在检查点错误时失败
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
val rocksdbBackend = new RocksDBStateBackend("hdfs:///flink/checkpoints",true)
//对于状态数据不需要压缩因为压缩选项对增量快照没有影响,因为它们使用的是RocksDB的内部格式
//rocksdbBackend.setOptions(new MyOptions())
rocksdbBackend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM)
env.setStateBackend(rocksdbBackend)
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)//退出不删除checkpoint

如此简单的问题,却饶了一大圈,无知、无知、无知 !!!

你可能感兴趣的:(flink)