最近在做项目的时候,涉及到这样的一个场景,就是要把一个比较小并且不会经常改动的表作为维表,和实时流进行匹配。这张表是MySQL中的一个表,我的第一反应就是读取这张表进行广播。
简要代码如下:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val broadcastMysql = env.addSource(new SourceFromMySQL).map(x=>Poi(x._1,x._2)).broadcast(new MapStateDescriptor[String,Poi]("broadcast-state",BasicTypeInfo.STRING_TYPE_INFO,TypeInformation.of(new TypeHint[Poi] {
})))
val kafkaDs = env.addSource(new FlinkKafkaConsumer011("flinkTest", new JSONKeyValueDeserializationSchema(false), properties))
val resultDs = kafkaDs.connect(broadcastMysql).process()
我用这种写法写完之后在本地测试没有任何问题,流数据和维表数据也能匹配到,但是正当我觉得万事大吉扔到集群上跑时,却出了问题。
当我在集群上跑这个任务时,却发现了如下错误:
Checkpoint triggering task XXX of job XXX is not in state RUNNING but
FINISHED instead. Aborting checkpoint.
然后去Flink WebUI查看Checkpoints信息,也是没有的,但我明明设置了ck,为什么提交不成功呢?经过反复排查,最后定位到了问题:
因为我用上面的方式,去mysql中读取一个有界流进行广播变量,那么在程序运行时,当程序读完mysql中的表时,这个task就会被标记为 FINISHED,就像下面这样:
而我去查看源码:org.apache.flink.runtime.checkpoint.CheckpointCoordinator的triggerCheckpoint方法就不难发现:
// check if all tasks that we need to trigger are running.
// if not, abort the checkpoint
Execution[] executions = new Execution[tasksToTrigger.length];
for (int i = 0; i < tasksToTrigger.length; i++) {
Execution ee = tasksToTrigger[i].getCurrentExecutionAttempt();
if (ee == null) {
LOG.info("Checkpoint triggering task {} of job {} is not being executed at the moment. Aborting checkpoint.",
tasksToTrigger[i].getTaskNameWithSubtaskIndex(),
job);
throw new CheckpointException(CheckpointFailureReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
} else if (ee.getState() == ExecutionState.RUNNING) {
executions[i] = ee;
} else {
LOG.info("Checkpoint triggering task {} of job {} is not in state {} but {} instead. Aborting checkpoint.",
tasksToTrigger[i].getTaskNameWithSubtaskIndex(),
job,
ExecutionState.RUNNING,
ee.getState());
throw new CheckpointException(CheckpointFailureReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
}
}
我们可以看到在checkpoint时会判断每个Execution的State,当State不为RUNNING时直接报出Checkpoint triggering task {} of job {} is not in state {} but {} instead. Aborting checkpoint.并结束triggerCheckpoint。
所以可以知道,是我广播流的用法,使一个Execution的State为FINISHED,所以checkpoints失败。
既然这样,那我们从问题的根本出发,想要使用广播流,但不能让ck失败,那么我们的广播流这个任务必须要维持在一个running的状态。
所以我们在这里通常广播一条无界流,例如规则信息在一个kafka topic里,然后从另一个kafka topic里实时读取明细信息,两者进行匹配。
val rule = env.addSource(new FlinkKafkaConsumer011(topic1, new SimpleStringSchema(), kafkaPro1))
val event = env.addSource(new FlinkKafkaConsumer011(topic2, new SimpleStringSchema(), kafkaPro2))
val ruleBroadcast = rule.broadcast(new MapStateDescriptor[String,String]("broadcast-state",BasicTypeInfo.STRING_TYPE_INFO,BasicTypeInfo.STRING_TYPE_INFO))
event.connect(ruleStream).process()...