hudi的两大特性:流式查询和支持upsert/delete,hudi的数据变更是基于timeline的,所以时间点(Instant)就成为了实现增量查询的依据。在与flink集成中,当开启了流式读,其实就是一个持续的增量查询的过程,可以通过配置参数read.start-commit和read.end-commit来指定一个无状态的flink job的初始查询范围。
tEnv.executeSql("CREATE TABLE tb_person_hudi ( id BIGINT, age INT, name STRING,create_time TIMESTAMP ( 3 ), time_stamp TIMESTAMP(3),PRIMARY KEY ( id ) NOT ENFORCED ) WITH (\n" +
"\t'connector' = 'hudi',\n" +
"\t'table.type' = 'MERGE_ON_READ',\n" +
"\t'path' = 'file:///D:/data/hadoop3.2.1/warehouse/tb_person_hudi',\n" +
"\t'read.start-commit' = '20220722103000',\n" +
"\t'read.task' = '1',\n" +
"\t'read.streaming.enabled' = 'true',\n" +
"\t'read.streaming.check-interval' = '30' \n" +
")");
Table table = tEnv.sqlQuery("select * from tb_person_hudi ");
tEnv.toChangelogStream(table).print().setParallelism(1);
env.execute("incremently query");
HoodieTableSource实现ScanTableSource,SupportsPartitionPushDown,SupportsProjectionPushDown,SupportsLimitPushDown,SupportsFilterPushDown
接口,后4个接口主要是支持对查询计划的优化。ScanTableSource则提供了读取hudi表的具体实现,核心方法为org.apache.hudi.table.HoodieTableSource#getScanRuntimeProvider
:
if (conf.getBoolean(FlinkOptions.READ_AS_STREAMING)) { //开启了流式读(read.streaming.enabled)
StreamReadMonitoringFunction monitoringFunction = new StreamReadMonitoringFunction(
conf, FilePathUtils.toFlinkPath(path), maxCompactionMemoryInBytes, getRequiredPartitionPaths());
InputFormat<RowData, ?> inputFormat = getInputFormat(true);
OneInputStreamOperatorFactory<MergeOnReadInputSplit, RowData> factory = StreamReadOperator.factory((MergeOnReadInputFormat) inputFormat);
SingleOutputStreamOperator<RowData> source = execEnv.addSource(monitoringFunction, getSourceOperatorName("split_monitor"))
.setParallelism(1)
.transform("split_reader", typeInfo, factory)
.setParallelism(conf.getInteger(FlinkOptions.READ_TASKS));
return new DataStreamSource<>(source);
}
上面代码在流环境中创建了一个SourceFunction(StreamReadMonitoringFunction)和一个自定义的转换(StreamReadOperator)
StreamReadMonitoringFunction负责定时(read.streaming.check-interval)扫描hudi表的元数据目录.hoodie,如果发现在active timeline上有新增的instant[action=commit,deltacommit,compaction,replace && active=completed],从这些instant信息中可以知道数据变更写到了哪些文件(parquet,log),然后构建成分片对象(MergeOnReadInputSplit)。
IncrementalInputSplits#inputSplits
获取到增量分片(已进行了排序),然后传递给下游的算子(StreamReadOperator)public void monitorDirAndForwardSplits(SourceContext<MergeOnReadInputSplit> context) {
HoodieTableMetaClient metaClient = getOrCreateMetaClient();
IncrementalInputSplits.Result result =incrementalInputSplits.inputSplits(metaClient, this.hadoopConf, this.issuedInstant);
for (MergeOnReadInputSplit split : result.getInputSplits()) {
context.collect(split);
}
}
主要逻辑在方法IncrementalInputSplits#inputSplits(metaClient, hadoopConf, issuedInstant)
,需要先了解hudi关于timeline和instant的一些基本概念,详细的流程如下图所示:
如果flink job首次运行指定了read.start-commit和read.end-commit,但是该范围是比较久以前,instant已经被归档,那么流作业将永远不能消费到数据
https://github.com/apache/hudi/issues/6167
StreamReadOperator算子接收分片后会缓存在队列Queue splits,然后不停从队列中poll分片放到线程池中执行
private void processSplits() throws IOException {
format.open(split);
consumeAsMiniBatch(split);
enqueueProcessSplits();
}
主要有三个步骤:
BaseFileOnlyFilteringIterator,BaseFileOnlyIterator,LogFileOnlyIterator,MergeIterator,SkipMergeIterator