Hudi表创建时HDFS上的变化

SparkSQL 建 Hudi 表语句:

CREATE TABLE t71 (
    ds BIGINT,
    ut STRING,
    pk BIGINT,
    f0 BIGINT,
    f1 BIGINT,
    f2 BIGINT,
    f3 BIGINT,
    f4 BIGINT
) USING hudi
PARTITIONED BY (ds)
TBLPROPERTIES ( -- 这里也可使用 options (https://hudi.apache.org/docs/table_management)
  type = 'mor',
  primaryKey = 'pk',
  preCombineField = 'ut',
  hoodie.index.type = 'BUCKET',
  hoodie.bucket.index.num.buckets = '2',
  hoodie.compaction.payload.class = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
  hoodie.datasource.write.payload.class = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
  hoodie.archive.merge.enable = 'true',
  hoodie.datasource.write.operation = 'upsert'
);

执行 create table 后,会创建一子目录和文件:

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71
Found 1 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71
Found 1 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie
Found 5 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.schema
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.temp
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/archived
-rw-r--r--   3 zhangsan dfsusers       1501 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/hoodie.properties

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/archived
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.schema
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux
Found 1 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.temp
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap
Found 2 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.fileids
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.partitions

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.partitions
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.fileids

[/home/zhangsan]$ sh hadoop.sh fs -cat hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/hoodie.properties
#Properties saved on 2023-05-31T03:09:25.601Z
#Wed May 31 11:09:25 CST 2023
hoodie.table.precombine.field=ut
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.partition.fields=ds
hoodie.bucket.index.num.buckets=2
hoodie.table.type=MERGE_ON_READ
hoodie.archivelog.folder=archived
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload
hoodie.table.version=5
hoodie.timeline.layout.version=1
hoodie.table.recordkey.fields=pk
hoodie.database.name=test
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.name=t71
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.datasource.write.hive_style_partitioning=true
hoodie.table.create.schema={"type"\:"record","name"\:"t71_record","namespace"\:"hoodie.t71","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"ut","type"\:["string","null"]},{"name"\:"pk","type"\:["long","null"]},{"name"\:"f0","type"\:["long","null"]},{"name"\:"f1","type"\:["long","null"]},{"name"\:"f2","type"\:["long","null"]},{"name"\:"f3","type"\:["long","null"]},{"name"\:"f4","type"\:["long","null"]},{"name"\:"ds","type"\:["long","null"]}]}
hoodie.index.type=BUCKET
hoodie.table.checksum=3938074607

执行 drop table 后,会将表目录如 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71 删除掉。

如果是分区表,则在有数据插入是,会在表目录下以分区值建立子目录,比如:

insert into t71 (ds,ut,pk,f0) values (20230101,CURRENT_TIMESTAMP,1102,1);

上述语句会在 HDFS 上建立以“ds=20230101”为名的子目录:

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101
Found 2 items
-rw-r--r--   3 zhangsan dfsusers         96 2023-05-31 11:29 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.hoodie_partition_metadata
-rw-r--r--   3 zhangsan dfsusers     435756 2023-05-31 11:29 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/00000001-6776-4b80-915b-ad6bdff96948-0_1-21-19_20230531112913107.parquet

执行连续三条 insert:

insert into t71 (ds,ut,pk,f0) values (20230101,CURRENT_TIMESTAMP,1102,1);
select * from t71 where pk=1102;
insert into t71 (ds,ut,pk,f1) values (20230101,CURRENT_TIMESTAMP,1102,2);
select * from t71 where pk=1102;
insert into t71 (ds,ut,pk,f2) values (20230101,CURRENT_TIMESTAMP,1102,3);
select * from t71 where pk=1102;
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101
Found 3 items
-rw-r--r--   3 zhangsan dfsusers       1048 2023-05-31 14:26 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_20230531141236926.log.1_1-8-6
-rw-r--r--   3 zhangsan dfsusers       2096 2023-05-31 14:31 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_20230531141236926.log.1_1-8-6
-rw-r--r--   3 zhangsan dfsusers         96 2023-05-31 14:13 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.hoodie_partition_metadata
-rw-r--r--   3 zhangsan dfsusers     435757 2023-05-31 14:13 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_1-21-17_20230531141236926.parquet

上面列了两个“.log.”文件,分别为第一次 insert 后和第二次 insert 后的结果,为方便对比观察放在了一起。

分区的第 1 笔插入总是会生成“.parquet”文件,而不是“.log.”文件。上述“.parquet”为列存储格式的基础文件,COW 和 MOR 表都有的文件,但 COW 每笔 insert 都会整个重写,而 MOR 表则不会。“.log.”为行存储格式的增量日志文件,为 MOR 表独有文件。文件 .hoodie_partition_metadata 为分区元数据文件:

[/home/zhangsan]$ sh hadoop.sh fs -cat hdfs://hadoop-cluster-01/user/zhangsan/warehouse/testtest.db/t71/ds=20230101/.hoodie_partition_metadata
#partition metadata
#Wed May 31 11:29:49 CST 2023
commitTime=20230531112913107
partitionDepth=1

“.parquet”文件

使用在线的工具 https://parquet-viewer-online.com/result 打开“.parquet”文件,可发现内容同 select 完全一样。

_hoodie_commit_time _hoodie_commit_seqno  _hoodie_record_key _hoodie_partition_path _hoodie_file_name                                                        ut	                     pk	  f0 f1	  f2   f3   f4   ds
20230531141236926   20230531141236926_1_0 1102               ds=20230101            00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_1-21-17_20230531141236926.parquet 2023-05-31 14:12:37.126 1102 1  null null null null 20230

“.log.”文件

第二次 insert 后生成了“.log.”文件:

#HUDI#      
              4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]}       20230531142614512       •         ‰"20230531142614512*20230531142614512_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:26:14.761œ    ª¿¥          

第三次 insert 后更新了“.log.”文件:

#HUDI#      
              4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]}       20230531142614512       •         ‰"20230531142614512*20230531142614512_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:26:14.761œ    ª¿¥          #HUDI#      
              4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]}       20230531143136695       •         ‰"20230531143136695*20230531143136695_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:31:36.801œ    ª¿¥          

这里可以看到 pk=1102 有两笔数据,ut 值分别为 2023-05-31 14:26:14.761 和 2023-05-31 14:31:36.801 。使用 OverwriteNonDefaultsWithLatestAvroPayload 读取时,只能读取到 2023-05-31 14:31:36.801 这笔,这是依据 preCombineField 更大原则的结果,在 HoodieRecordPayload::preCombine 时完成的逻辑。

相关源码

// OverwriteNonDefaultsWithLatestAvroPayload 没有重写 OverwriteWithLatestAvroPayload 的 preCombine 方法
public class OverwriteNonDefaultsWithLatestAvroPayload extends OverwriteWithLatestAvroPayload {
}

public class OverwriteWithLatestAvroPayload extends BaseAvroPayload
    implements HoodieRecordPayload<OverwriteWithLatestAvroPayload> {

  @Override
  public OverwriteWithLatestAvroPayload preCombine(OverwriteWithLatestAvroPayload oldValue) {
    if (oldValue.recordBytes.length == 0) {
      // use natural order for delete record
      return this;
    }
    if (oldValue.orderingVal.compareTo(orderingVal) > 0) {
      // pick the payload with greatest ordering value
      return oldValue;
    } else {
      return this;
    }
  }
}

如果换用 PartialUpdateAvroPayload,则

_hoodie_commit_time _hoodie_commit_seqno  _hoodie_record_key _hoodie_partition_path _hoodie_file_name                       ut         pk           f0      f1 f2   f3  f4
20230531164701237   20230531164701237_0_1 1006	             ds=20230101	        00000000-ad06-474e-a7ac-0580f60307e1-0	2023-05-31 16:47:02.337	1006	1	2	3	NULL	NULL

你可能感兴趣的:(hudi,flink,spark,hdfs,hadoop,hudi)