和Spark的使用方式不同,flink结合hudi的方式,是以SPI的方式,所以不需要像使用Spark的方式一样,Spark的方式如下:
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
(这里不包括org.apache.spark.sql.sources.DataSourceRegister)
Flink结合Hudi的方式,只需要引入了对应的jar包即可,以SPI的方式:
META-INF/services/org.apache.flink.table.factories.Factory
org.apache.hudi.table.HoodieTableFactory
org.apache.hudi.table.catalog.HoodieCatalogFactory
其中 HoodieTableFactory 是读写Hudi数据的地方,
HoodieCatalogFactory是操作Hudi用到的Catalog
直接先解释一下Hudi的写数据:
HoodieTableFactory
@Override
public DynamicTableSink createDynamicTableSink(Context context) {
Configuration conf = FlinkOptions.fromMap(context.getCatalogTable().getOptions());
checkArgument(!StringUtils.isNullOrEmpty(conf.getString(FlinkOptions.PATH)),
"Option [path] should not be empty.");
ResolvedSchema schema = context.getCatalogTable().getResolvedSchema();
sanityCheck(conf, schema);
setupConfOptions(conf, context.getObjectIdentifier(), context.getCatalogTable(), schema);
return new HoodieTableSink(conf, schema);
}
创建的HoodieTableSink是真正Hudi写入数据的类:
public class HoodieTableSink implements DynamicTableSink, SupportsPartitioning, SupportsOverwrite {
...
@Override
public SinkRuntimeProvider getSinkRuntimeProvider(Context context) {
return (DataStreamSinkProviderAdapter) dataStream -> {
// setup configuration
long ckpTimeout = dataStream.getExecutionEnvironment()
.getCheckpointConfig().getCheckpointTimeout();
conf.setLong(FlinkOptions.WRITE_COMMIT_ACK_TIMEOUT, ckpTimeout);
// set up default parallelism
OptionsInference.setupSinkTasks(conf, dataStream.getExecutionConfig().getParallelism());
RowType rowType = (RowType) schema.toSinkRowDataType().notNull().getLogicalType();
// bulk_insert mode
final String writeOperation = this.conf.get(FlinkOptions.OPERATION);
if (WriteOperationType.fromValue(writeOperation) == WriteOperationType.BULK_INSERT) {
return Pipelines.bulkInsert(conf, rowType, dataStream);
}
// Append mode
if (OptionsResolver.isAppendMode(conf)) {
DataStream
| input1 | ===\ /=== | bucket assigner | ===\ /=== | task1 |
shuffle(by PK) shuffle(by bucket ID)
| input2 | ===/ \=== | bucket assigner | ===/ \=== | task2 |
WriteOperatorFactory operatorFactory = StreamWriteOperator.getFactory(conf);
return dataStream
// Key-by record key, to avoid multiple subtasks write to a bucket at the same time
.keyBy(HoodieRecord::getRecordKey)
.transform(
"bucket_assigner",
TypeInformation.of(HoodieRecord.class),
new KeyedProcessOperator<>(new BucketAssignFunction<>(conf)))
.uid(opUID("bucket_assigner", conf))
.setParallelism(conf.getInteger(FlinkOptions.BUCKET_ASSIGN_TASKS))
// shuffle by fileId(bucket id)
.keyBy(record -> record.getCurrentLocation().getFileId())
.transform(opName("stream_write", conf), TypeInformation.of(Object.class), operatorFactory)
.uid(opUID("stream_write", conf))
.setParallelism(conf.getInteger(FlinkOptions.WRITE_TASKS));